Vocabulary

Vocabulary

class forte.data.vocabulary.Vocabulary(method='indexing', use_pad=True, use_unk=True, special_tokens=None, do_counting=True, pad_value=None, unk_value=None)[source]

This class will store “Elements” that are added, assign “Ids” to them and return “Representations” if queried. These three are the main concepts in this class.

  1. Element: Any hash-able instance that the user want to store.

  2. Id: Each element will have an unique Id, which is an integer.

  3. Representation: according to the configuration, the representation for an element could be an integer (in this case, would be “Id”), or an one-hot vector (in this case, would be a list of integer).

The class adopts the special elements from Texar-Pytorch, which are:

  1. <PAD>: which will be mapped into Id of 0 or -1 and have different representation according to different setting.

  2. <UNK>: if added into the vocabulary, will be the default element if the queried element is not found.

Note that these two special tokens are necessary for the system in certain cases and thus must present in the vocabulary. The behavior of these special tokens are pre-defined based on different settings. To get around the default behavior (for example, if you have a pre-defined vocabulary with different setups), you can instruct the class to not adding these tokens automatically, and use the mark_special_element() instead.

Here is a table on how our Vocabulary class behavior under different settings. Element0 means the first element that is added to the vocabulary. Elements added later will be element1, element2 and so on. They will follow the same behavior as element0. For readability, they are not listed in the table.

Vocabulary Behavior under different settings.

vocab_method

custom (handle and implemented by the user)

indexing

indexing

one-hot

one-hot

need_pad

assume False

True

False

True

False

get_pad_value

None

0

None

[0,0,0]

None

inner_mapping

None

0:pad 1:element0

0:element0

-1:<PAD> 0:element0

0:element0

element2repr

raise Error

pad->0 element0->1

element0->0

<PAD>->[0,0,0] element0->[1,0,0]

element0->[1,0,0]

id2element

raise Error

0->pad 1->element0

0->element0

-1 -> <PAD> 0->element0 (be careful)

0->element0

Parameters
  • method (str) – The method to represent element in vocabulary, currently supporting “indexing” and “one-hot”.

  • use_pad (bool) – Whether to add <PAD> element to the vocabulary on creation. It will be added to the vocabulary first, but the id of it depends on the specific settings.

  • use_unk (bool) – Whether to add <UNK> element to the vocabulary on creation. Elements that are not found in vocabulary will be directed to <UNK> element. It will be added right after the <PAD> element if provided.

  • special_tokens (Optional[List[str]]) – Additional special tokens to be added, they will be added at the beginning of vocabulary (but right after the <UNK> token) one by one.

  • do_counting (bool) – Whether the vocabulary class will count the elements.

  • pad_value (Optional[Any]) – A customized value/representation to be used for padding, for example, following the PyTorch convention you may want to use -100. This value is only needed when use_pad is True. Default is None, where the value of padding is determined by the system.

  • unk_value (Optional[Any]) – A customized value/representation to be used for unknown value (unk). This value is only needed when use_unk is True. Default is None, where the value of UNK is determined by the system.

method

Same as above.

Type

str

use_pad

Same as above.

Type

bool

use_unk

Same as above.

Type

bool

do_counting

Same as above.

Type

bool

get_count(e)[source]

Get the counts of the vocabulary element.

Parameters

e (Union[~ElementType, int]) – The element to get counts for. It can be the element id or the element’s raw type.

Return type

int

Returns

The count of the element.

mark_special_element(element_id, element_name, representation=None)[source]

Mark a particular (but already existed) index in the vocabulary to be a special required element (i.e PAD or UNK).

Parameters
  • element_id (int) – The id to be set for the special element.

  • element_name (str) – The name of this element to be set, it can be one of PAD, UNK.

  • representation (Optional[Any]) – The representation/value that this element should be assigned. Default is None, then its representation will be computed from the internal indexing.

is_special_token(element_id)[source]

Check whether the element is a special token.

add_special_element(element, element_id=None, representation=None, special_token_name=None)[source]

This function will add special elements to the vocabulary, such as UNK, PAD, BOS, CLS symbols. Some special tokens will not be filtered by any VocabFilter. Some special tokens has their unique behavior in the system.

Note

most of the time, you don’t have to call this method yourself, but should let the init function to handle that.

Parameters
  • element (str) – The surface form of this special element.

  • element_id (Optional[int]) – The to be used for this special token. If not provided, the vocabulary will use the next id internally. If the provided id is occupied, a ValueError will be thrown. The id can be any integer, including negative ones.

  • representation – The representation you want to assign to this special token. If None, the representation may be computed based on the index (which depends on the vocabulary setting).

  • special_token_name (Optional[str]) – An internal name of this special token. This only matters for the base special tokens: <PAD> or <UNK>, and the name should be “PAD” and “UNK” respectively. Any other name here is considered invalid, and a ValueError will be thrown if provided.

add_element(element, representation=None, count=1)[source]

This function will add a regular element to the vocabulary.

Parameters
  • element (~ElementType) – The element to be added.

  • representation (Optional[Any]) – The vocabulary representation of this element will use this value. For example, you may want to use -100 for ignored tokens for PyTorch skipped tokens. Note that the class do not check whether this representation is used by another element, so the caller have to manage the behavior itself.

  • count (int) – the count to be incremented for this element, default is 1 (i.e. consider it appear once on every add). This value will have effect only if do_counting is True.

Return type

int

Returns

The internal id of the element.

id2element(idx)[source]

This function will map id to element.

Parameters

idx (int) – The queried id of element.

Return type

~ElementType

Returns

The corresponding element if exist. Check the behavior of this function under different setting in the documentation.

Raises

KeyError – If the id is not found.

element2repr(element)[source]

This function will map element to representation.

Parameters

element (Union[~ElementType, Any]) – The queried element. It can be either the same type as the element, or string (for the special tokens).

Returns

The corresponding representation of the element. Check the behavior of this function under different setting in the documentation.

Return type

Union[int, List[int]]

Raises

KeyError – If element is not found and vocabulary does not use <UNK> element.

to_dict()[source]

Create a dictionary from the vocabulary storing all the known elements.

Return type

Dict[~ElementType, Any]

Returns

The vocabulary as a Dict from ElementType to the representation of the element (could be Integer or One-hot vector, depending on the settings of this class).

has_element(element)[source]

This function checks whether an element is added to vocabulary.

Parameters

element (Union[~ElementType, str]) – The queried element.

Returns

Whether element is found.

Return type

bool

vocab_items()[source]

This function will loop over the (element, id) pair inside this class.

Returns

Iterables of (element, id) pair.

Return type

Iterable[Tuple]

get_pad_value()[source]

This function will get the representation of the PAD element for the vocabulary. The representation depends on the settings of this class, it can be an integer or a list of int (e.g. a vector).

Returns

The PAD element. Check the behavior of this function in the class documentation.

Return type

Union[None, int, List[int]]

filter(vocab_filter)[source]

This function will create a new vocabulary object, which is based on the current vocabulary, but filter out elements that appear fewer times than the min_count value. Calling this function will cause a full iteration over the vocabulary, thus normally, it should be called after collecting all the vocabulary in the dataset.

Parameters

vocab_filter (VocabFilter) – The filter used to filter the vocabulary.

Return type

Vocabulary

Returns

A new vocabulary after filtering.

VocabFilter

class forte.data.vocabulary.VocabFilter(vocab)[source]

Base class for vocabulary filters, which is used to implement constraints to choose a subset of vocabulary. For example, one can filter out vocab elements that happen fewer than a certain frequency.

Parameters

vocab (Vocabulary) – The vocabulary object to be filtered.

filter(element_id)[source]

Given the element id, it will determine whether the element should be filtered out.

Parameters

element_id (int) – The element id to be checked.

Return type

bool

Returns

None

FrequencyVocabFilter

class forte.data.vocabulary.FrequencyVocabFilter(vocab, min_frequency=- 1, max_frequency=- 1)[source]

A frequency based filter. It will filter vocabulary elements that appear fewer than min_frequency or more than max_frequency. The check will be skipped if the threshold values are negative.

Parameters
  • vocab (Vocabulary) – The vocabulary object.

  • min_frequency (int) – The min frequency threshold, default -1 (i.e. no frequency check for min).

  • max_frequency (int) – The max frequency threshold, default -1 (i.e. no frequency check for max).

filter(element_id)[source]

Given the element id, it will determine whether the element should be filtered out.

Parameters

element_id (int) – The element id to be checked.

Return type

bool

Returns

None