Training System

Forte promotes the convention to separate data pre-processing (Domain Dependent) and actual training process. This is simply done by creating an intermediate layer to extract raw features from data packs. In this documentation, we will visit several components in this system, which includes:

  • Train Preprocessor that defines the structure of this process.

  • Extractor that extracts from data to features back and forth.

  • Converter that creates matrices.

  • Predictor that builds data pack from model output automatically.

  • Evaluator that conducts evaluation on the resulting pack.

Train Preprocessor

class forte.train_preprocessor.TrainPreprocessor(pack_iterator)[source]

TrainPreprocessor provides the functionality of doing pre-processing work including building vocabulary, extracting the features, batching and padding (optional). The processed data will be provided by its method get_train_batch_iterator(), which will return an iterator over the batch of pre-processed data. Please refer to the documentation of that method for how the pre-processing is done.

A main part of the TrainPreprocessor ` is that it maintains a list of extractors :class:`~forte.data.base_extractor.BaseExtractor that extract features. This can be provided either via calling add_extractor function. Alternatively, a request can be passed in through initialize, where the configuration under the request key will be used to create the extractor instances.

The parsed components will be stored, and can be accessed via the request property of this class.

Parameters

pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

Note

For parameters request, user does not necessarily need to provide converter. If no converter is specified, a default converter of type Converter will be picked.

add_extractor(name, extractor, is_input, converter=None)[source]

Extractors can be added to the preprocessor directly via this method.

Parameters
  • name (str) – The name/identifier of this extractor, the name should be different between different extractors.

  • extractor (BaseExtractor) – The extractor instance to be added.

  • is_input (bool) – Whether this extractor will be used as input or output.

  • converter (Optional[Converter]) – The converter instance to be applied after running the extractor.

Returns:

static default_configs()[source]

Returns a dictionary of default hyper-parameters.

{
    "preprocess": {
                "device": "cpu",
    },
    "dataset": DataPackDataset.default_hparams()
}

Here:

  • “preprocessor.device”: The device of the produced batches. For GPU training, set to current CUDA device.

  • “dataset”: This contains all the configurable options same as DataPackDataset.

property request

A Dict containing all the information needed for doing the pre-processing. This is obtained via parsing the input request

An example request is:

request = {
    "context_type": "ft.onto.base_ontology.Sentence"
    "schemes": {
        "text_tag": {
            "extractor":
                "class_name":
                  "forte.data.extractor.AttributeExtractor",
                "config": {
                    ... more configuration of the extractor
                }
        },
        "ner_tag": {
            "extractor":
                "class_name":
                  "forte.data.extractor.BioSeqTaggingExtractor",
                "config": {
                    ... more configuration of the extractor
                }
        }
    }
}

Here:

  • “context_type”: Annotation A class of type context_type. Defines the granularity to separate data into different groups. All extractors will operate based on this. For example, if context_type is Sentence, then the features of each extractor will represent the information of a sentence. If this value is None, then all extractors will operate on the whole data pack.

  • “schemes”: Dict A Dict containing the information about doing the pre-processing. The key is the tags provided by input request. The value is a Dict containing the information for doing pre-processing for that feature.

  • “schemes.tag.extractor”: An instance of type BaseExtractor.

  • “schemes.tag.converter”: An instance of type Converter.

  • “schemes.tag.type”: TrainPreprocessor.DATA_INPUT/DATA_OUTPUT Denoting whether this feature is the input or output feature.

Return type

Dict

property device

The device of the produced batches. For GPU training, set to current CUDA device.

Return type

device

property config

A Config maintaining all the configurable options for this TrainPreprocessor.

Return type

HParams

get_train_batch_iterator()[source]

This method mainly has four steps:

  1. Iterate over DataPack via pack iterator

  2. Extract Feature from DataPack

  3. Batch Feature

  4. (optional) Pad a batch of Feature

It will return an iterator of a batch of pre-processed data.

Return type

Iterator[Batch]

Returns

An Iterator of type Batch

Please refer to collate() in DataPackDataset for details about its structure.

Converter

class forte.data.converter.converter.Converter(config=None)[source]

This class has the functionality of converting a batch of Feature to a MatrixLike type which can be a Numpy array, a PyTorch Tensor, or a nested list.

It can also perform padding for the given batch of Feature if user requested it. Please refer to the request parameter in TrainPreprocessor for details.

Parameters

config (Union[Dict, HParams, None]) – An instance of Dict or Config that provides all configurable options. See default_configs() for available options and default values.

static default_configs()[source]

Returns a dictionary of default hyper-parameters.

{
    "to_numpy": True,
    "to_torch": True
}

Here:

  • “to_numpy”: bool Whether convert to numpy.ndarray. Default is True.

  • “to_torch”: bool Whether convert to torch.tensor. Default is True.

Note

If need_pad in forte.data.converter.Feature is False and to_numpy and to_torch is True, it will raise an exception if the target data cannot be converted to numpy.ndarray or torch.tensor.

Note

If need_pad in forte.data.converter.Feature is True and to_torch is True, to_torch will overwrite the effect of to_numpy.

convert(features)[source]

Convert a list of Features to matrix-like form, where

1. The outer most dimension will always be the batch dimension (i.e len(output) = len(feature_num)).

  1. The type can be:

    2.1 A List of primitive int or another List

    2.2 A numpy.ndarray

    2.3 A torch.Tensor

If need_pad in forte.data.converter.Feature is True, it will pad all features with given pad_value stored inside forte.data.converter.Feature.

If to_numpy is True, it will try to convert data into numpy.ndarray.

If to_torch is True, it will try to convert data into torch.tensor.

Parameters

features (List[Feature]) – A list of forte.data.converter.Feature

Return type

Tuple[Union[TensorType, ndarray, List], Sequence[Union[TensorType, ndarray, List]]]

Returns

A Tuple containing two elements.

1. The first element is either a MatrixLike type representing the batch of data.

2. The second element is a MatrixLike type representing masks along different feature dimensions.

Example 1:

data = [[1,2,3], [4,5], [6,7,8,9]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_numpy=True,
                      to_torch=False)

output_data, masks = converter.convert(features)

# output_data is:
# np.array([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=np.long)

# masks is:
# [
#     np.array([[1,1,1,0], [1,1,0,0], [1,1,1,1]],
#              dtype=np.bool)
# ]

Example 2:

data = [[[1,2,3], [4,5]], [[3]]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 2
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_numpy=True,
                      to_torch=False)

output_data, masks = converter.convert(features)

# output_data is:
# np.array([[[1,2,3], [4,5,0]], [[3,0,0], [0,0,0]]],
#          dtype=np.long)


# masks is:
# [
#     np.array([[1,1], [1,0]], dtype=np.bool),
#     np.array([[[1,1,1], [1,1,0]],
#              [[1,0,0], [0,0,0]]], dtype=np.bool)
# ]

Example 3:

data = [[1,2,3,0], [4,5,0,0], [6,7,8,9]]
meta_data = {
    "pad_value": 0
    "need_pad": False,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(need_pad=False)

output_data, _ = converter.convert(features)

# output_data is:
# torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)

Example 4:

data = [[1,2,3], [4,5], [6,7,8,9]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_torch=True)

output_data, masks = converter.convert(features)

# output_data is:
# torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)

# masks is:
# [
#     torch.tensor([[1,1,1,0], [1,1,0,0], [1,1,1,1]],
#                  dtype=np.bool)
# ]

Feature

class forte.data.converter.feature.Feature(data, metadata, vocab=None)[source]

This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.

Parameters
  • data (List) – A list of features, where each feature can be the value or another list of features. Typically this should be the output from extract() in BaseExtractor.

  • metadata (Dict) –

    A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.

    • dim indicates the total number of dimension for this feature.

    • dtype is the value type. For example, it can be torch.long.

  • vocab (Optional[Vocabulary]) – An optional fields about the Vocabulary used to build this feature.

Please refer to data() for the typical usage of this class.

property leaf_feature

Returns: True if current feature is leaf feature. Otherwise, False.

Return type

bool

property dtype

Returns: The data type of this feature.

property sub_features

Returns: A list of sub features. Raise exception if current feature is the leaf feature.

Return type

List[Feature]

property meta_data

Returns: A Dict of meta data describing this feature.

Return type

Dict

property vocab

Returns: The Vocabulary used to build this feature.

Return type

Optional[Vocabulary]

property dim

Returns: The dimension of this feature.

Return type

int

property need_pad

Returns: Whether the Feature need to be padded.

Return type

bool

pad(max_len)[source]

Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.

Parameters

max_len (int) – The padded length.

property data

It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.

Return type

Tuple[List[Any], List[Any]]

Returns

A Tuple where

The first element is the actual data representing this feature.

The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.

Here are some examples for how the padding works:

Example 1 (1-dim feature, no padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

data, masks = feature.data

# data is:
# [2,7,8]

# masks is:
# [
#   [1,1,1]
# ]

Example 2 (1-dim feature, scalar padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)

data, masks = feature.data

# data is:
# [2,7,8,0]

# masks is:
# [
#   [1,1,1,0]
# ]

Example 3 (2-dim feature, scalar padding):

data = [[1,2,5], [3], [1,5]]
meta_data = {
    "pad_value": 0
    "dim": 2
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)
for sub_feature in feature.sub_features:
    sub_feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[1,2,5], [3,0,0], [1,5,0], [0,0,0]]

# masks is:
# [
#   [1,1,1,0],
#   [[1,1,1], [1,0,0], [1,1,0], [0,0,0]]
# ]

Example 4 (1-dim feature, vector padding):

data = [[0,1,0],[1,0,0]]
meta_data = {
    "pad_value": [0,0,1]
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[0,1,0], [1,0,0], [0,0,1]]

# masks is:
# [
#  [1,1,0]
# ]

Extractor

BaseExtractor

class forte.data.base_extractor.BaseExtractor[source]

The functionality of Extractor is as followed. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

  1. Build vocabulary.

  2. Extract feature from datapack.

  3. Perform pre-evaluation action on datapack.

  4. Add prediction to datapack.

Explanation:

  • Vocabulary: Vocabulary is maintained as an attribute in extractor. It will store the mapping from element to index, which is an integer, and representation, which could be an index integer or one-hot vector depending on the configuration of the vocabulary. Check Vocabulary for more details.

  • Feature: A feature basically wraps the data we want from one instance in a datapack. For example, the instance can be one sentence in a datapack. Then the data wrapped by the feature could be the token text of this sentence. The data is already converted as list of indexes using vocabulary. Besides the data, other information like the raw data before indexing and some meta_data will also be stored in the feature. Check Feature for more details.

  • Remove feature / Add prediction: Removing feature means remove the existing data in the datapack. If we remove the feature in the pack, then extracting feature will return empty list. Adding prediction means we add the prediction from model back to the datapack. If a datapack has some old data (for example, the golden data in the test set), we can first remove those data and then add our model prediction to the pack.

config

An instance of Dict or Config that provides configurable options. See default_configs() for available options and default values.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • vocab_method (str) What type of vocabulary is used for this extractor. custom, indexing, one-hot are supported, default is indexing. Check the behavior of vocabulary under different setting in Vocabulary

  • context_type (str): The fully qualified name of the context used to group the extracted features, for example, it could be a ft.onto.base_ontology.Sentence. If this is None, features from in the whole data pack will be grouped together. Default is None. This value could be mandatory for some processors, which will be documented and specified by the specific processor implementation.

  • vocab_use_unk (bool) Whether the <UNK> element should be added to vocabulary. Default is true.

  • need_pad (bool) Whether the <PAD> element should be added to vocabulary. And whether the feature need to be batched and padded. Default is True.

  • pad_value (int) A customized value/representation to be used for padding. This value is only needed when use_pad is True. Default is None, where the value of padding is determined by the system.

  • unk_value (int) A customized value/representation to be used for unknown value (unk). This value is only needed when vocab_use_unk is True. Default is None, where the value of UNK is determined by the system.

property vocab

Getter of the vocabulary class.

Returns: The vocabulary. None if the vocabulary is not set.

Return type

Optional[Vocabulary]

predefined_vocab(predefined)[source]

Populate the vocabulary with predefined values. You can also extend this method to customize the ways to handle the vocabulary.

Overwrite instruction:

  1. Take out elements from predefined.

  2. Make modification on the elements based on the need of the extractor.

  3. Use add() function to add the element into vocabulary.

Parameters

predefined (Iterable) – A collections that contains the elements to be added into the vocabulary.

update_vocab(pack, context=None)[source]

Populate the vocabulary needed by the extractor. This can be implemented by a specific extractor. The populated vocabulary can be used to map features/items to numeric representations. If you use a pre-specified vocabulary, you may not need to use this function.

Overwrite instructions:

1. Get all entries of the type of interest, such as all the Token`s in the data pack. 2. Use :meth:`~forte.data.vocabulary.Vocabulary.add to add those element into self._vocab.

Parameters
  • pack (DataPack) – The input data pack.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

abstract extract(pack, context=None)[source]

This method should be implemented to extract features from a datapack.

Parameters
  • pack (DataPack) – The input data pack that contains the features.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Returns: Features inside this instance stored as a ~forte.data.converter.feature.Feature instance.

Return type

Feature

pre_evaluation_action(pack, context)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. For example, you can remove entries or remove some attributes of the entry. By default, this function will not do anything.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

add_to_pack(pack, predictions, context=None)[source]

Add prediction of a model (normally in the form of a tensor) back to the pack. This function should have knowledge of the structure of the prediction to correctly populate the data pack values.

This function can be roughly considered as the reverse operation of extract().

Overwrite instruction:

  1. Get all entries from one instance in the pack.

  2. Convert predictions into elements that needs to be assigned to entries. You can use id2element() to convert integers in the prediction into element via the vocabulary maintained by the extractor.

  3. Add the element to corresponding entry based on the need.

Parameters
  • pack (DataPack) – The datapack to add predictions back.

  • predictions (Any) – This is the output of the model, the format of which will be determined by the predict function defined in the Predictor.

  • context (Optional[Annotation]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as in extract(). If None, then the whole data pack will be used as the context. Default is None.

AttributeExtractor

class forte.data.extractors.attribute_extractor.AttributeExtractor[source]

AttributeExtractor extracts feature from the attribute of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • attribute”: str The name of the attribute we want to extract from the entry. This attribute should present in the entry definition. There are some built-in attributes for some instance, such as text for Annotation entries. tid should be also available for any entries. The default value is tid.

  • entry_type”: str The fully qualified name of the entry to extract attributes from. The default value is None, but this value must present or an ProcessorConfigError will be thrown.

update_vocab(pack, context=None)[source]

Get all attributes of one instance and add them into the vocabulary.

Parameters
  • pack (DataPack) – The data pack input to extract vocabulary.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

extract(pack, context=None)[source]

Extract the attribute of an entry of the configured entry type. The entry type is passed in from via extractor config entry_type.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Return type

Feature

Returns

Features (attributes) for instance with in the provided context, they will be converted to the representation based on the vocabulary configuration.

pre_evaluation_action(pack, context)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove all attributes defined in the config (set them to None). You can overwrite this function by yourself.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

add_to_pack(pack, predictions, context=None)[source]

Add the prediction for attributes to the data pack. We assume the number of predictions in the iterable to be the same as the number of the entries of the defined type in the data pack.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • predictions (Iterable[SupportsInt]) – This is the output of the model, which should be the class index for the attribute.

  • context (Optional[Annotation]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as in extract(). If None, then the whole data pack will be used as the context. Default is None.

LinkExtractor

class forte.data.extractors.relation_extractor.LinkExtractor[source]

This extractor extracts relation type features from data packs. This extractor expects the parent and child of the relation to be Annotation entries.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • entry_type”: The target relation entry type, should be a Link entry.

  • attribute”: The attribute of the relation to extract.

  • index_annotation”: The annotation object used to index the head and child node of the relations.

update_vocab(pack, context=None)[source]

Update values of relation attributes to the vocabulary.

Parameters
  • pack (DataPack) – The input data pack.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Returns

None

extract(pack, context=None)[source]

Extract link data as features from the context.

Parameters
  • pack (DataPack) – The input data pack that contains the features.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Returns:

Return type

Feature

add_to_pack(pack, predictions, context=None)[source]

Convert prediction back to Links inside the data pack.

Parameters
  • pack (DataPack) – The datapack to add predictions back.

  • predictions (List[Tuple[Tuple[int, int], Tuple[int, int], int]]) – This is the output of the model, it is a triplet, the first element shows the parent, the second element shows the child. These two are indexed by the index_annotation of this extractor. The last element is the index of the relation attribute.

  • context (Optional[Annotation]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as in extract(). If None, then the whole data pack will be used as the context. Default is None.

SubwordExtractor

class forte.data.extractors.subword_extractor.SubwordExtractor[source]

SubwordExtractor extracts feature from the subword of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

Parameters

config – An instance of Dict Config

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • pretrained_model_name”: The name of the pretrained bert model. Must be the same as used in subword tokenizer.

  • subword_class”: the fully qualified name of the class of the subword, default is ft.onto.base_ontology.Subword.

extract(pack, context=None)[source]

Extract the subword feature of one instance.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Returns

a feature that contains the extracted data.

Return type

Feature

CharExtractor

class forte.data.extractors.char_extractor.CharExtractor[source]

CharExtractor extracts feature from the text of entry. Text will be split into characters.

classmethod default_configs()[source]

Returns a dictionary of default configuration parameters.

Here:

  • “max_char_length”: int The maximum number of characters for one token in the text, default is None, which means no limit will be set.

  • “entry_type”: str The fully qualified name of an annotation type entry. Characters will be extracted based on these entries. Default is Token, which means characters of tokens will be extracted.

update_vocab(pack, context=None)[source]

Add all character into vocabulary.

Parameters
  • pack (DataPack) – The input data pack.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

extract(pack, context=None)[source]

Extract the character feature of one instance.

Parameters
  • pack (DataPack) – The datapack to extract features from.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Return type

Feature

Returns

a iterator of feature that contains the characters of each specified annotation.

BioSeqTaggingExtractor

class forte.data.extractors.seqtagging_extractor.BioSeqTaggingExtractor[source]

BioSeqTaggingExtractor will the feature by performing BIO encoding for the attribute of entry and aligning to the tagging_unit entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

initialize(config)[source]

Initialize the extractor based on the provided configuration.

Parameters

config (Union[Dict, HParams]) – The configuration of the extractor, it can be a Dict or Config. See default_configs() for available options and default values.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here, additional parameters are added from the parent class:

  • entry_type (str): Required. The fully qualified name of an Annotation entry to extract attribute from. For example, for an NER task, it could be ft.onto.base_ontology.EntityMention.

  • attribute (str): Required. The attribute name of the entry from which labels are extracted.

  • tagging_unit (str): Required. The fully qualified name of the units for tagging, The tagging label will align to the units, e.g: ft.onto.base_ontology.Token.

  • pad_value (int): A customized value/representation to be used for padding. This value is only needed when use_pad is True. Default is -100 to follow PyTorch convention.

  • is_bert (bool): It indicates whether Bert model is used. If true, padding will be added to the beginning and end of a sentence corresponding to the special tokens ([CLS], [SEP]) used in Bert. Default is False.

For example, the config can be:

{
    "entry_type": "ft.onto.base_ontology.EntityMention",
    "attribute": "ner_type",
    "tagging_unit": "ft.onto.base_ontology.Token"
}

The extractor will extract the BIO NER tags for instances. A possible feature can be:

[[None, "O"], ["LOC", "B"], ["LOC", "I"], [None, "O"],
[None, "O"], ["PER", "B"], [None, "O"]]
predefined_vocab(predefined)[source]

Add predefined tags into the vocabulary. i.e. One can construct the tag vocabulary without exploring the training data.

Parameters

predefined (Iterable) – A set of pre-defined tags.

update_vocab(pack, context=None)[source]

Add all the tag from one instance into the vocabulary.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

extract(pack, context=None)[source]

Extract the sequence tagging feature of one instance. If the vocabulary of this extractor is set, then the extracted tag sequences will be converted to the tag ids (int).

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Returns (Feature): a feature that contains the extracted BIO sequence

of and other metadata.

Return type

Feature

pre_evaluation_action(pack, context=None)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove tags in the instance. You can overwrite this function by yourself.

Parameters
  • pack (DataPack) – The datapack to be processed.

  • context (Optional[Annotation]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

add_to_pack(pack, predictions, context=None)[source]

Add the prediction results to data pack. The predictions are

We make following assumptions for prediction.

  1. If we encounter “I” while its tag is different from the previous tag, we will consider this “I” as a “B” and start a new tag here.

  2. We will truncate the prediction it according to the number of entry. If the prediction contains <PAD> element, this should remove them.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • predictions (List[int]) – This is the output of the model, which contains the index for attributes of one instance.

  • context (Optional[Annotation]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.

Predictor

class forte.processors.base.batch_processor.Predictor[source]

Predictor is a special type of batch processor that uses BaseExtractor to collect features from data packs, and also uses Extractors to write the prediction back.

Predictor implements the PackingBatchProcessor class, and implements the predict and pack function using the extractors.

add_extractor(name, extractor, is_input, converter=None)[source]

Extractors can be added to the preprocessor directly via this method.

Parameters
  • name (str) – The name/identifier of this extractor, the name should be different between different extractors.

  • extractor (BaseExtractor) – The extractor instance to be added.

  • is_input (bool) – Whether this extractor will be used as input or output.

  • converter (Optional[Converter]) – The converter instance to be applied after running the extractor.

Returns

None

classmethod define_batcher()[source]

Define a specific batcher for this processor. Single pack BaseBatchProcessor initialize the batcher to be a ProcessingBatcher. And MultiPackBatchProcessor initialize the batcher to be a MultiPackBatchProcessor .

Return type

ProcessingBatcher

classmethod default_configs()[source]

Defines the default configs for batching processor.

Return type

Dict[str, Any]

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (HParams) – The configuration passed in to set up this component.

pack(pack, predict_results, context=None)[source]

The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.

Parameters
  • pack (~PackType) – The pack to add entries or fields to.

  • predict_results (Dict) – The prediction results returned by predict(). This processor will add these results to the provided pack as entry and attributes.

  • context (Optional[Annotation]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.

predict(data_batch)[source]

The function that task processors should implement. Make predictions for the input data_batch.

Parameters

data_batch (Dict) – A batch of instances in our dict format.

Return type

Dict

Returns

The prediction results in dict datasets.

Feature

class forte.data.converter.Feature(data, metadata, vocab=None)[source]

This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.

Parameters
  • data (List) – A list of features, where each feature can be the value or another list of features. Typically this should be the output from extract() in BaseExtractor.

  • metadata (Dict) –

    A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.

    • dim indicates the total number of dimension for this feature.

    • dtype is the value type. For example, it can be torch.long.

  • vocab (Optional[Vocabulary]) – An optional fields about the Vocabulary used to build this feature.

Please refer to data() for the typical usage of this class.

property leaf_feature

Returns: True if current feature is leaf feature. Otherwise, False.

Return type

bool

property dtype

Returns: The data type of this feature.

property sub_features

Returns: A list of sub features. Raise exception if current feature is the leaf feature.

Return type

List[Feature]

property meta_data

Returns: A Dict of meta data describing this feature.

Return type

Dict

property vocab

Returns: The Vocabulary used to build this feature.

Return type

Optional[Vocabulary]

property dim

Returns: The dimension of this feature.

Return type

int

property need_pad

Returns: Whether the Feature need to be padded.

Return type

bool

pad(max_len)[source]

Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.

Parameters

max_len (int) – The padded length.

property data

It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.

Return type

Tuple[List[Any], List[Any]]

Returns

A Tuple where

The first element is the actual data representing this feature.

The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.

Here are some examples for how the padding works:

Example 1 (1-dim feature, no padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

data, masks = feature.data

# data is:
# [2,7,8]

# masks is:
# [
#   [1,1,1]
# ]

Example 2 (1-dim feature, scalar padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)

data, masks = feature.data

# data is:
# [2,7,8,0]

# masks is:
# [
#   [1,1,1,0]
# ]

Example 3 (2-dim feature, scalar padding):

data = [[1,2,5], [3], [1,5]]
meta_data = {
    "pad_value": 0
    "dim": 2
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)
for sub_feature in feature.sub_features:
    sub_feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[1,2,5], [3,0,0], [1,5,0], [0,0,0]]

# masks is:
# [
#   [1,1,1,0],
#   [[1,1,1], [1,0,0], [1,1,0], [0,0,0]]
# ]

Example 4 (1-dim feature, vector padding):

data = [[0,1,0],[1,0,0]]
meta_data = {
    "pad_value": [0,0,1]
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[0,1,0], [1,0,0], [0,0,1]]

# masks is:
# [
#  [1,1,0]
# ]

Evaluation

Base Evaluator

class forte.evaluation.base.base_evaluator.Evaluator[source]

The base class of the evaluator.

abstract consume_next(pred_pack, ref_pack)[source]

The actual consume function that will be called by the pipeline. This function will deal with the basic pipeline status and call the consume_next function.

Parameters
  • pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.

  • ref_pack (~PackType) – The reference pack, which should contain the reference to score on.

abstract get_result()[source]

The evaluator gather the results and the score should be obtained here.

Return type

Any

expected_types_and_attributes(pred_pack_expectation, ref_pack_expectation)[source]

If the evaluator has required input types and attributes for pred_pack or ref_pack, user could add the types and attributes required with this function.

Parameters
  • pred_pack_expectation (Dict[str, Set[str]]) – The expected types and attributes of prediction pack.

  • ref_pack_expectation (Dict[str, Set[str]]) – The expected types and attributes of reference pack.

pred_pack_record(record_meta)[source]

Method to add output type record of prediction datapack of current processor to forte.data.base_pack.BaseMeta.record.

Parameters

record_meta (Dict[str, Set[str]]) – The field in the datapack for type record that need to fill in for consistency checking.

ref_pack_record(record_meta)[source]

Method to add output type record of reference datapack of current processor to forte.data.base_pack.BaseMeta.record.

Parameters

record_meta (Dict[str, Set[str]]) – The field in the datapack for record that need to fill in for consistency checking.

check_record(pred_pack, ref_pack)[source]

Method to check type consistency if enforce_consistency is enabled for the pipeline. If any expected type or its attribute does not exist in the pred_pack or ref_pack record of the previous pipeline component, an error of ExpectedRecordNotFound will be raised.

Parameters
  • pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.

  • ref_pack (~PackType) – The reference pack, which should contain the reference to score on.

writes_record(pred_pack, ref_pack)[source]

Method to write records of the output type of the current processor to the datapack. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters
  • pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.

  • ref_pack (~PackType) – The reference pack, which should contain the reference to score on.

Task Evaluators

class forte.evaluation.ner_evaluator.CoNLLNEREvaluator[source]
initialize(resources, configs)[source]

Initialize the evaluator with resources and configs. This method is called by the pipeline during the initialization.

Parameters
  • resources (Resources) – An object of class Resources that holds references to objects that can be shared throughout the pipeline.

  • configs (HParams) – A configuration to initialize the evaluator. This evaluator is expected to hold the following (key, value) pairs - “entry_type” (str): The entry to be evaluated. - “tagging_unit” (str): The tagging unit that the evaluation is performed on. e.g. “ft.onto.base_ontology.Sentence” - “attribute” (str): The attribute of the entry to be evaluated.

classmethod default_configs()[source]

Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

consume_next(pred_pack, ref_pack)[source]

The actual consume function that will be called by the pipeline. This function will deal with the basic pipeline status and call the consume_next function.

Parameters
  • pred_pack (DataPack) – The prediction pack, which should contain the system predicted results.

  • ref_pack (DataPack) – The reference pack, which should contain the reference to score on.

get_result()[source]

The evaluator gather the results and the score should be obtained here.

Return type

Dict