Vocabulary

Vocabulary

class forte.data.vocabulary.Vocabulary(method, need_pad, use_unk)[source]

This class will store “Elements” that are added, assign “Ids” to them and return “Representations” if queried. These three are the main concepts in this class.

  1. Element: Any hash-able instance that the user want to store.

  2. Id: Each element will have an unique Id, which is an integer.

  3. Representation: according to the configuration, the representation for an element could be an integer (in this case, would be “Id”), or an one-hot vector (in this case, would be a list of integer).

There are two special elements.

  1. One is <PAD> element, which will be mapped into Id of 0 or -1 and have different representation according to different setting.

  2. The other one is <UNK> element, which, if added into the vocabulary, will be the default element if the queried element is not found.

Here is a table on how our Vocabulary class behavior under different settings. Element0 means the first element that is added to the vocabulary. Elements added later will be element1, element2 and so on. They will follow the same behavior as element0. For readability, they are not listed in the table.

Vocabulary Behavior under different settings.

vocab_method

raw (handle outside)

indexing

indexing

one-hot

one-hot

need_pad

assume False

True

False

True

False

get_pad_value

None

0

None

[0,0,0]

None

inner_mapping

None

0:pad 1:element0

0:element0

-1:<PAD> 0:element0

0:element0

element2repr

raise Error

pad->0 element0->1

element0->0

<PAD>->[0,0,0] element0->[1,0,0]

element0->[1,0,0]

id2element

raise Error

0->pad 1->element0

0->element0

-1 -> <PAD> 0->element0 (be careful)

0->element0

Parameters
  • method (str) – The method to represent element in vocabulary.

  • need_pad (bool) – Whether to add <PAD> element in vocabulary.

  • use_unk (bool) – Whether to add <UNK> element in vocabulary. Elements that are not found in vocabulary will be directed to <UNK> element.

method

Same as above.

Type

str

need_pad

Same as above.

Type

bool

use_unk

Same as above.

Type

bool

next_id

The id that will be used when next element is added.

Type

int

element2id_dict

This stores the mapping from element to id.

Type

dict

id2element_dict

This stores the mapping from id to element.

Type

dict

add_element(element)[source]

This function will add element to the vocabulary.

Parameters

element (Hashable) – The element to be added.

id2element(idx)[source]

This function will map id to element.

Parameters

idx (int) – The queried id of element.

Returns

The corresponding element if exist. Check the behavior of this function under different setting in the documentation.

Raises

KeyError – If the id is not found.

element2repr(element)[source]

This function will map element to representation.

Parameters

element (Hashable) – The queried element.

Returns

The corresponding representation of the element. Check the behavior of this function under different setting in the documentation.

Return type

Union[int, List[int]]

Raises

KeyError – If element is not found and vocabulary does not use <UNK> element.

has_element(element)[source]

This function checks whether an element is added to vocabulary.

Parameters

element (Hashable) – The queried element.

Returns

Whether element is found.

Return type

bool

items()[source]

This function will loop over the (element, id) pair inside this class.

Returns

Iterables of (element, id) pair.

Return type

Iterable[Tuple]

get_dict()[source]

This function will get the inner mapping from element to id.

Returns

The maintained mapping from element to id.

Return type

dict

get_pad_value()[source]

This function will get the PAD element for the vocabulary.

Returns

The PAD element. Check the behavior of this function in the documentation.

Return type

Union[None, int, List[int]]