Training System

Forte advocates the convention to separate data preprocessing (Domain Dependent) and actual training process. This is simply done by creating an intermediate layer to extract raw features from data packs. In this documentation, we will visit several components in this system, which includes:

  • Train Preprocessor that defines the structure of this process.

  • Extractor that extracts from data to features back and forth.

  • Converter that creates matrices.

  • Predictor that builds data pack from model output automatically.

  • Evaluator that conducts evaluation on the resulting pack.

Train Preprocessor

class forte.train_preprocessor.TrainPreprocessor(pack_iterator, request, config=None)[source]

TrainPreprocessor provides the functionality of doing pre-processing work including building vocabulary, extracting the features, batching and padding (optional). The main functionality is provided by its method get_train_batch_iterator() which will return an iterator over the batch of preprocessed data. Please refer to the documentation of that method for how the pre-processing is done.

TrainPreprocessor will maintain a Config that stores all the configurable parameters for various components.

TrainPreprocessor will also accept a user request. Internally it will parse this user request and store the parsed result.

Parameters
  • pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

  • request (Dict) – A request that specifies how to do train pre-processing. Please refer to request() for details.

  • config – A Dict or Config that configs this preprocessor. See default_configs() for the defaults.

Note

For parameters request, user does not necessarily need to provide converter. If no converter is specified, a default converter of type Converter will be picked.

static default_configs()[source]

Returns a dictionary of default hyper-parameters.

{
    "preprocess": {
                "device": "cpu",
    },
    "dataset": DataPackDataset.default_hparams()
}

Here:

“preprocessor.device”:

The device of the produced batches. For GPU training, set to current CUDA device.

“dataset”:

This contains all the configurable options same as DataPackDataset.

property request
A Dict containing all the information needed for doing the

pre-processing. This is obtained via parsing the input request

An example request is:

request = {
    "scope": ft.onto.Sentence
    "schemes": {
        "text_tag": {
            "extractor": forte.data.extractor.AttributeExtractor,
            "converter": forte.data.converter.Converter,
            "type": TrainPreprocessor.DATA_INPUT,
        },
        "char_tag" {
            "extractor": forte.data.extractor.CharExtractor,
            "converter": forte.data.converter.Converter,
            "type": TrainPreprocessor.DATA_INPUT,
        }
        "ner_tag": {
            "extractor":
                forte.data.extractor.BioSeqTaggingExtractor,
            "converter": forte.data.converter.Converter,
            "type": TrainPreprocessor.DATA_OUTPUT,
        }
    }
}

Here:

“scope”: Entry

A class of type Entry The granularity to separate data into different examples. For example, if scope is Sentence, then each training example will represent the information of a sentence.

“schemes”: Dict

A Dict containing the information about doing the pre-processing. The key is the tags provided by input request. The value is a Dict containing the information for doing pre-processing for that feature.

“schemes.tag.extractor”: Extractor

An instance of type BaseExtractor.

“schemes.tag.converter”: Converter

An instance of type Converter.

“schemes.tag.type”: TrainPreprocessor.DATA_INPUT/DATA_OUTPUT

Denoting whether this feature is the input or output feature.

property device

The device of the produced batches. For GPU training, set to current CUDA device.

property config

A Config maintaining all the configurable options for this TrainPreprocessor.

get_train_batch_iterator()[source]

This method mainly has four steps:

  1. Iterate over DataPack via pack iterator

  2. Extract Feature from DataPack

  3. Batch Feature

  4. (optional) Pad a batch of Feature

It will return an iterator of a batch of preprocessed data.

Returns

An Iterator of type Batch

Please refer to collate() in DataPackDataset for details about its structure.

Converter

class forte.data.converter.converter.Converter(config)[source]

This class has the functionality of converting a batch of Feature to a PyTorch Tensor. It can also do the padding for the given batch of Feature if user requested it. Please refer to the request parameter in TrainPreprocessor for details.

Parameters

config – An instance of Dict or Config that provides all configurable options. See default_configs() for available options and default values.

static default_configs()[source]

Returns a dictionary of default hyper-parameters.

{
    "to_numpy": True,
    "to_torch": True
}

Here:

“to_numpy”: bool

Whether convert to numpy.ndarray. Default is True.

“to_torch”: bool

Whether convert to torch.tensor. Default is True.

Note

If need_pad in forte.data.converter.Feature is False and to_numpy and to_torch is True, it will raise an exception if the target data cannot be converted to numpy.ndarray or torch.tensor.

Note

If need_pad in forte.data.converter.Feature is True and to_torch is True, to_torch will overwrite the effect of to_numpy.

convert(features)[source]

Convert a list of Features to actual data, where

1. The outer most dimension will always be the batch dimension (i.e len(output) = len(feature_num)).

  1. The type can be:

    2.1 A List of primitive int or another List

    2.2 A numpy.ndarray

    2.3 A torch.Tensor

If need_pad in forte.data.converter.Feature is True, it will pad all features with given pad_value stored inside forte.data.converter.Feature.

If to_numpy is True, it will try to convert data into numpy.ndarray.

If to_torch is True, it will try to convert data into torch.tensor.

Parameters

features (List[Feature]) – A list of forte.data.converter.Feature

Returns

A Tuple containing two elements.

1. The first element is either a List or numpy.ndarray or torch.tensor representing the batch of data.

2. The second element is a List or numpy.ndarray representing masks along different feature dimensions.

Example 1:

data = [[1,2,3], [4,5], [6,7,8,9]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_numpy=True,
                      to_torch=False)

output_data, masks = converter.convert(features)

# output_data is:
# np.array([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=np.long)

# masks is:
# [
#     np.array([[1,1,1,0], [1,1,0,0], [1,1,1,1]],
#              dtype=np.bool)
# ]

Example 2:

data = [[[1,2,3], [4,5]], [[3]]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 2
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_numpy=True,
                      to_torch=False)

output_data, masks = converter.convert(features)

# output_data is:
# np.array([[[1,2,3], [4,5,0]], [[3,0,0], [0,0,0]]],
#          dtype=np.long)


# masks is:
# [
#     np.array([[1,1], [1,0]], dtype=np.bool),
#     np.array([[[1,1,1], [1,1,0]],
#              [[1,0,0], [0,0,0]]], dtype=np.bool)
# ]

Example 3:

data = [[1,2,3,0], [4,5,0,0], [6,7,8,9]]
meta_data = {
    "pad_value": 0
    "need_pad": False,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(need_pad=False)

output_data, _ = converter.convert(features)

# output_data is:
# torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)

Example 4:

data = [[1,2,3], [4,5], [6,7,8,9]]
meta_data = {
    "pad_value": 0,
    "need_pad": True,
    "dim": 1
    "dtype": np.long
}
features = [Feature(i, meta_data=meta_data) for i in data]
converter = Converter(to_torch=True)

output_data, masks = converter.convert(features)

# output_data is:
# torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)

# masks is:
# [
#     torch.tensor([[1,1,1,0], [1,1,0,0], [1,1,1,1]],
#                  dtype=np.bool)
# ]

Extractor

BaseExtractor

class forte.data.extractor.base_extractor.BaseExtractor(config)[source]

The functionality of Extractor is as followed. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

  1. Build vocabulary.

  2. Extract feature from datapack.

  3. Perform pre-evaluation action on datapack.

  4. Add prediction to datapack.

Explanation:

Vocabulary:

Vocabulary is maintained as an inner class in extractor. It will store the mapping from element to index, which is an integer, and representation, which could be an index integer or one-hot vector depending on the configuration of the vocabulary. Check Vocabulary for more details.

Feature:

A feature basically wraps the data we want from one instance in a datapack. For example, the instance can be one sentence in a datapack. Then the data wrapped by the feature could be the token text of this sentence. The data is already converted as list of indexes using vocabulary. Besides the data, other information like the raw data before indexing and some meta_data will also be stored in the feature. Check Feature for more details.

Remove feature / Add prediction:

Removing feature means remove the existing data in the datapack. If we remove the feature in the pack, then extracting feature will return empty list. Adding prediction means we add the prediction from model back to the datapack. If a datapack has some old data (for example, the golden data in the test set), we can first remove those data and then add our model prediction to the pack.

Parameters

config – An instance of Dict or Config that provides all configurable options. See default_configs() for available options and default values. Entry_type is the key that need to be passed in and there will not be default value for this key.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

entry_type (Type[Entry]).

Required. The ontology type that the extractor will get feature from.

“vocab_method” (str)

What type of vocabulary is used for this extractor. raw, indexing, one-hot are supported, default is indexing. Check the behavior of vocabulary under different setting in Vocabulary

“need_pad” (bool)

Whether the <PAD> element should be added to vocabulary. And whether the feature need to be batched and padded. Default is True.

“vocab_use_unk” (bool)

Whether the <UNK> element should be added to vocabulary. Default is true.

property vocab

Getter of the vocabulary class.

Returns: The vocabulary. None if the vocabulary is not set.

predefined_vocab(predefined)[source]

Populate the vocabulary with predefined values. You can also extend this method to customize the ways to handle the vocabulary.

Overwrite instruction:

  1. Take out elements from predefined.

  2. Make modification on the elements based on the need of the extractor.

  3. Use add() function to add the element into vocabulary.

Parameters

predefined (Iterable) – A collections that contains the elements to be added into the vocabulary.

update_vocab(pack, instance)[source]

Populate the vocabulary by taking the elements from one instance vocabulary. For example, when the instance is Sentence and we want to add all Token from one sentence into the vocabulary, we might call this function.

If you use a pre-specified vocabulary, you may not need to use this function.

Overwrite instructions:

  1. Get all entries from one instance in the pack. You probably would use pack.get function to acquire Entry that you need.

  2. Get elements that are needed from entries. This process will be very different for different extractors. For example, you might want to get the token text from one sentence. Or you might want to get the tags for a sequence for one sentence.

  3. Use add() to add those element into the vocabulary.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will get elements from.

extract(pack, instance)[source]

Extract the feature for one instance in a pack.

Overwrite instruction:

  1. Get all entries from one instance in the pack.

  2. Get elements that are needed form entries. For example, the token text or sequence tags.

  3. Construct a feature and return it.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

Returns (Feature):

a feature that contains the extracted data.

pre_evaluation_action(pack, instance)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. For example, you can remove entries or remove some attributes of the entry. By default, this function will not do anything.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

add_to_pack(pack, instance, prediction)[source]

Add prediction of a model (normally in the form of a tensor) back to the pack. This function should have knowledge of the structure of the prediction to correctly populate the data pack values.

This function can be roughly considered as the reverse operation of extract().

Overwrite instruction:

  1. Get all entries from one instance in the pack.

  2. Convert prediction into elements that need to be assigned to entries. You might need to use id2element() to convert index in the prediction into element via the vocabulary maintained by the extractor.

  3. Add the element to corresponding entry based on the need.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance to which the extractor add prediction.

  • prediction (Any) – This is the output of the model, whose format will be determined by the predict function user define and pass in to our framework.

AttributeExtractor

class forte.data.extractor.attribute_extractor.AttributeExtractor(config)[source]

AttributeExtractor extracts feature from the attribute of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • “attribute”: str The name of attribute we want to extract from the entry. For example, text attribute of Token. The default one is text.

update_vocab(pack, instance)[source]

Get all attributes of one instance and add them into the vocabulary.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

extract(pack, instance)[source]

Extract attributes of one instance. For example, the text of tokens in one sentence.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

Returns (Feature):

a feature that contains the extracted data.

pre_evaluation_action(pack, instance)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove the attribute. You can overwrite this function by yourself.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

add_to_pack(pack, instance, prediction)[source]

Add the prediction for attribute to the instance. If the prediction is an iterable object, we assume each of the element in prediction will correspond to one entry. If the prediction is only one element, then we assume there will only be one entry in the instance.

Extending this class will need to handle the specific prediction data types. The default implementation assume the data type is Integer.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance to which the extractor add prediction.

  • prediction (Iterable[Union[int, Any]]) – This is the output of the model, which contains the index for attributes of one instance.

CharExtractor

class forte.data.extractor.char_extractor.CharExtractor(config)[source]

CharExtractor extracts feature from the text of entry. Text will be split into characters. Most of the time, a user will not need to call this class/function explicitly, they will be called by the framework.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

“max_char_length”: int

The maximum number of characters for one token in the text.

update_vocab(pack, instance)[source]

Add all character into vocabulary.

Parameters
  • pack (Datapack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will get text from.

extract(pack, instance)[source]

Extract the character feature of one instance.

Parameters
  • pack (Datapack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

Returns (Feature):

a feature that contains the extracted data.

BioSeqTaggingExtractor

class forte.data.extractor.seqtagging_extractor.BioSeqTaggingExtractor(config)[source]

BioSeqTaggingExtractor will the feature by performing BIO encoding for the attribute of entry and aligning to the tagging_unit entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.

Parameters

config – An instance of Dict or Config. See default_configs() for available options and default values.

classmethod default_configs()[source]

Returns a dictionary of default hyper-parameters.

Here:

  • entry_type: Type[Entry]. Required. The ontology type that the extractor will get feature from.

  • attribute (str): Required. The attribute name of the entry from which labels are extracted.

  • tagging_unit (Type[Entry]): Required. The tagging label will align to the tagging_unit Entry.

For example, the config can be:

{
    "entry_type": EntityMention,
    "attribute": "ner_type",
    "tagging_unit": Token
}

The extractor will extract the BIO-NER style tags for instances. A possible feature can be:

[[None, "O"], ["LOC", "B"], ["LOC", "I"], [None, "O"],
[None, "O"], ["PER", "B"], [None, "O"]]
predefined_vocab(predefined)[source]

Add predefined tags into the vocabulary. i.e. One can construct the tag vocabulary without exploring the training data.

Parameters

predefined (Iterable[str]) – A set of pre-defined tags.

update_vocab(pack, instance)[source]

Add all the tag from one instance into the vocabulary.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

extract(pack, instance)[source]

Extract the sequence tagging feature of one instance. If the vocabulary of this extractor is set, then the extracted tag sequences will be converted to the tag ids (int).

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance from which the extractor will extractor feature.

Returns (Feature):

a feature that contains the extracted data.

pre_evaluation_action(pack, instance)[source]

This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove tags in the instance. You can overwrite this function by yourself.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance on which the extractor performs the pre-evaluation action.

add_to_pack(pack, instance, prediction)[source]

Add the prediction for attribute to the instance. We make following assumptions for prediction.

  1. If we encounter “I” while its tag is different from the previous tag, we will consider this “I” as a “B” and start a new tag here.

  2. We will truncate the prediction it according to the number of entry. If the prediction contains <PAD> element, this should remove them.

Parameters
  • pack (DataPack) – The datapack that contains the current instance.

  • instance (Annotation) – The instance to which the extractor add prediction.

  • prediction (Iterable[Union[int, Any]]) – This is the output of the model, which contains the index for attributes of one instance.

Predictor

class forte.processors.base.batch_processor.Predictor[source]

This class is used to perform prediction on features that extracted from the datapack and add the prediction back to the datapack.

pack(pack, inputs)[source]

This function is just for the compatibility reason. And it is not actually used in this class.

static define_batcher()[source]

Define a specific batcher for this processor. Single pack BatchProcessor initialize the batcher to be a ProcessingBatcher. And MultiPackBatchProcessor initialize the batcher to be a MultiPackProcessingBatcher.

classmethod default_configs()[source]

A default config contains the field for batcher.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

predict(data_batch)[source]

The function that task processors should implement. Make predictions for the input data_batch.

Parameters

data_batch (dict) – A batch of instances in our dict format.

Returns

The prediction results in dict datasets.

Feature

class forte.data.converter.Feature(data, metadata, vocab=None)[source]

This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.

Parameters
  • data (list) – A list of features, where each feature can be the value or another list of features. Typically this should be the output from extract() in BaseExtractor.

  • metadata (dict) –

    A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.

    dim indicates the total number of dimension for this feature.

    dtype is the value type. For example, it can be torch.long.

  • vocab (Vocabulary) – An optional fields about the Vocabulary used to build this feature.

Please refer to data() for the typical usage of this class.

property leaf_feature

Returns: True if current feature is leaf feature. Otherwise, False.

property dtype

Returns: The data type of this feature.

property sub_features

Returns: A list of sub features. Raise exception if current feature is the leaf feature.

property meta_data

Returns: A Dict of meta data describing this feature.

property vocab

Returns: The Vocabulary used to build this feature.

property dim

Returns: The dimension of this feature.

property need_pad

Returns: Whether the Feature need to be padded.

pad(max_len)[source]

Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.

Parameters

max_len (int) – The padded length.

property data

It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.

Returns

A Tuple where

The first element is the actual data representing this feature.

The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.

Here are some examples for how the padding works:

Example 1 (1-dim feature, no padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

data, masks = feature.data

# data is:
# [2,7,8]

# masks is:
# [
#   [1,1,1]
# ]

Example 2 (1-dim feature, scalar padding):

data = [2,7,8]
meta_data = {
    "pad_value": 0
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)

data, masks = feature.data

# data is:
# [2,7,8,0]

# masks is:
# [
#   [1,1,1,0]
# ]

Example 3 (2-dim feature, scalar padding):

data = [[1,2,5], [3], [1,5]]
meta_data = {
    "pad_value": 0
    "dim": 2
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=4)
for sub_feature in feature.sub_features:
    sub_feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[1,2,5], [3,0,0], [1,5,0], [0,0,0]]

# masks is:
# [
#   [1,1,1,0],
#   [[1,1,1], [1,0,0], [1,1,0], [0,0,0]]
# ]

Example 4 (1-dim feature, vector padding):

data = [[0,1,0],[1,0,0]]
meta_data = {
    "pad_value": [0,0,1]
    "dim": 1
    "dtype": torch.long
}
feature = Feature(data, meta_data=meta_data)

feature.pad(max_len=3)

data, masks = feature.data

# data is:
# [[0,1,0], [1,0,0], [0,0,1]]

# masks is:
# [
#  [1,1,0]
# ]

Evaluation

Base Evaluator

class forte.evaluation.base.base_evaluator.Evaluator[source]

The base class of the evaluator.

abstract consume_next(pred_pack, ref_pack)[source]

Consume the prediction pack and the reference pack to compute evaluation results.

Parameters
  • pred_pack – The prediction datapack, which should contain the system predicted results.

  • ref_pack – The reference datapack, which should contain the reference to score on.

abstract get_result()[source]

The evaluator gather the results and the score can be obtained here.

Task Evaluators

class forte.evaluation.ner_evaluator.CoNLLNEREvaluator[source]
consume_next(pred_pack, refer_pack)[source]

Consume the prediction pack and the reference pack to compute evaluation results.

Parameters
  • pred_pack – The prediction datapack, which should contain the system predicted results.

  • ref_pack – The reference datapack, which should contain the reference to score on.

get_result()[source]

The evaluator gather the results and the score can be obtained here.