Training System¶
Forte advocates the convention to separate data preprocessing (Domain Dependent) and actual training process. This is simply done by creating an intermediate layer to extract raw features from data packs. In this documentation, we will visit several components in this system, which includes:
Train Preprocessor that defines the structure of this process.
Extractor that extracts from data to features back and forth.
Converter that creates matrices.
Predictor that builds data pack from model output automatically.
Evaluator that conducts evaluation on the resulting pack.
Train Preprocessor¶
-
class
forte.train_preprocessor.
TrainPreprocessor
(pack_iterator, request, config=None)[source]¶ TrainPreprocessor provides the functionality of doing pre-processing work including building vocabulary, extracting the features, batching and padding (optional). The main functionality is provided by its method
get_train_batch_iterator()
which will return an iterator over the batch of preprocessed data. Please refer to the documentation of that method for how the pre-processing is done.TrainPreprocessor will maintain a Config that stores all the configurable parameters for various components.
TrainPreprocessor will also accept a user request. Internally it will parse this user request and store the parsed result.
- Parameters
pack_iterator (Iterator[DataPack]) – An iterator of
DataPack
.request (Dict) – A request that specifies how to do train pre-processing. Please refer to
request()
for details.config – A Dict or
Config
that configs this preprocessor. Seedefault_configs()
for the defaults.
Note
For parameters request, user does not necessarily need to provide converter. If no converter is specified, a default converter of type
Converter
will be picked.-
static
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
{ "preprocess": { "device": "cpu", }, "dataset": DataPackDataset.default_hparams() }
Here:
- “preprocessor.device”:
The device of the produced batches. For GPU training, set to current CUDA device.
- “dataset”:
This contains all the configurable options same as
DataPackDataset
.
-
property
request
¶ - A Dict containing all the information needed for doing the
pre-processing. This is obtained via parsing the input request
An example request is:
request = { "scope": ft.onto.Sentence "schemes": { "text_tag": { "extractor": forte.data.extractor.AttributeExtractor, "converter": forte.data.converter.Converter, "type": TrainPreprocessor.DATA_INPUT, }, "char_tag" { "extractor": forte.data.extractor.CharExtractor, "converter": forte.data.converter.Converter, "type": TrainPreprocessor.DATA_INPUT, } "ner_tag": { "extractor": forte.data.extractor.BioSeqTaggingExtractor, "converter": forte.data.converter.Converter, "type": TrainPreprocessor.DATA_OUTPUT, } } }
Here:
- “scope”: Entry
A class of type
Entry
The granularity to separate data into different examples. For example, if scope isSentence
, then each training example will represent the information of a sentence.- “schemes”: Dict
A Dict containing the information about doing the pre-processing. The key is the tags provided by input request. The value is a Dict containing the information for doing pre-processing for that feature.
- “schemes.tag.extractor”: Extractor
An instance of type
BaseExtractor
.- “schemes.tag.converter”: Converter
An instance of type
Converter
.- “schemes.tag.type”: TrainPreprocessor.DATA_INPUT/DATA_OUTPUT
Denoting whether this feature is the input or output feature.
-
property
device
¶ The device of the produced batches. For GPU training, set to current CUDA device.
-
property
config
¶ A
Config
maintaining all the configurable options for this TrainPreprocessor.
-
get_train_batch_iterator
()[source]¶ This method mainly has four steps:
Iterate over
DataPack
via pack iteratorExtract
Feature
fromDataPack
Batch
Feature
(optional) Pad a batch of
Feature
It will return an iterator of a batch of preprocessed data.
- Returns
An Iterator of type
Batch
Please refer to
collate()
inDataPackDataset
for details about its structure.
Converter¶
-
class
forte.data.converter.converter.
Converter
(config)[source]¶ This class has the functionality of converting a batch of
Feature
to a PyTorch Tensor. It can also do the padding for the given batch ofFeature
if user requested it. Please refer to the request parameter inTrainPreprocessor
for details.- Parameters
config – An instance of Dict or
Config
that provides all configurable options. Seedefault_configs()
for available options and default values.
-
static
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
{ "to_numpy": True, "to_torch": True }
Here:
- “to_numpy”: bool
Whether convert to numpy.ndarray. Default is True.
- “to_torch”: bool
Whether convert to torch.tensor. Default is True.
Note
If need_pad in
forte.data.converter.Feature
is False and to_numpy and to_torch is True, it will raise an exception if the target data cannot be converted to numpy.ndarray or torch.tensor.Note
If need_pad in
forte.data.converter.Feature
is True and to_torch is True, to_torch will overwrite the effect of to_numpy.
-
convert
(features)[source]¶ Convert a list of Features to actual data, where
1. The outer most dimension will always be the batch dimension (i.e len(output) = len(feature_num)).
The type can be:
2.1 A List of primitive int or another List
2.2 A numpy.ndarray
2.3 A torch.Tensor
If need_pad in
forte.data.converter.Feature
is True, it will pad all features with given pad_value stored insideforte.data.converter.Feature
.If to_numpy is True, it will try to convert data into numpy.ndarray.
If to_torch is True, it will try to convert data into torch.tensor.
- Parameters
features (List[Feature]) – A list of
forte.data.converter.Feature
- Returns
A Tuple containing two elements.
1. The first element is either a List or numpy.ndarray or torch.tensor representing the batch of data.
2. The second element is a List or numpy.ndarray representing masks along different feature dimensions.
Example 1:
data = [[1,2,3], [4,5], [6,7,8,9]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_numpy=True, to_torch=False) output_data, masks = converter.convert(features) # output_data is: # np.array([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=np.long) # masks is: # [ # np.array([[1,1,1,0], [1,1,0,0], [1,1,1,1]], # dtype=np.bool) # ]
Example 2:
data = [[[1,2,3], [4,5]], [[3]]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 2 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_numpy=True, to_torch=False) output_data, masks = converter.convert(features) # output_data is: # np.array([[[1,2,3], [4,5,0]], [[3,0,0], [0,0,0]]], # dtype=np.long) # masks is: # [ # np.array([[1,1], [1,0]], dtype=np.bool), # np.array([[[1,1,1], [1,1,0]], # [[1,0,0], [0,0,0]]], dtype=np.bool) # ]
Example 3:
data = [[1,2,3,0], [4,5,0,0], [6,7,8,9]] meta_data = { "pad_value": 0 "need_pad": False, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(need_pad=False) output_data, _ = converter.convert(features) # output_data is: # torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)
Example 4:
data = [[1,2,3], [4,5], [6,7,8,9]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_torch=True) output_data, masks = converter.convert(features) # output_data is: # torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long) # masks is: # [ # torch.tensor([[1,1,1,0], [1,1,0,0], [1,1,1,1]], # dtype=np.bool) # ]
Extractor¶
BaseExtractor¶
-
class
forte.data.extractor.base_extractor.
BaseExtractor
(config)[source]¶ The functionality of Extractor is as followed. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
Build vocabulary.
Extract feature from datapack.
Perform pre-evaluation action on datapack.
Add prediction to datapack.
Explanation:
- Vocabulary:
Vocabulary is maintained as an inner class in extractor. It will store the mapping from element to index, which is an integer, and representation, which could be an index integer or one-hot vector depending on the configuration of the vocabulary. Check
Vocabulary
for more details.- Feature:
A feature basically wraps the data we want from one instance in a datapack. For example, the instance can be one sentence in a datapack. Then the data wrapped by the feature could be the token text of this sentence. The data is already converted as list of indexes using vocabulary. Besides the data, other information like the raw data before indexing and some meta_data will also be stored in the feature. Check
Feature
for more details.- Remove feature / Add prediction:
Removing feature means remove the existing data in the datapack. If we remove the feature in the pack, then extracting feature will return empty list. Adding prediction means we add the prediction from model back to the datapack. If a datapack has some old data (for example, the golden data in the test set), we can first remove those data and then add our model prediction to the pack.
- Parameters
config – An instance of Dict or
Config
that provides all configurable options. Seedefault_configs()
for available options and default values. Entry_type is the key that need to be passed in and there will not be default value for this key.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
- entry_type (Type[Entry]).
Required. The ontology type that the extractor will get feature from.
- “vocab_method” (str)
What type of vocabulary is used for this extractor. raw, indexing, one-hot are supported, default is indexing. Check the behavior of vocabulary under different setting in
Vocabulary
- “need_pad” (bool)
Whether the <PAD> element should be added to vocabulary. And whether the feature need to be batched and padded. Default is True.
- “vocab_use_unk” (bool)
Whether the <UNK> element should be added to vocabulary. Default is true.
-
property
vocab
¶ Getter of the vocabulary class.
Returns: The vocabulary. None if the vocabulary is not set.
-
predefined_vocab
(predefined)[source]¶ Populate the vocabulary with predefined values. You can also extend this method to customize the ways to handle the vocabulary.
Overwrite instruction:
Take out elements from predefined.
Make modification on the elements based on the need of the extractor.
Use
add()
function to add the element into vocabulary.
- Parameters
predefined (Iterable) – A collections that contains the elements to be added into the vocabulary.
-
update_vocab
(pack, instance)[source]¶ Populate the vocabulary by taking the elements from one instance vocabulary. For example, when the instance is Sentence and we want to add all Token from one sentence into the vocabulary, we might call this function.
If you use a pre-specified vocabulary, you may not need to use this function.
Overwrite instructions:
Get all entries from one instance in the pack. You probably would use pack.get function to acquire Entry that you need.
Get elements that are needed from entries. This process will be very different for different extractors. For example, you might want to get the token text from one sentence. Or you might want to get the tags for a sequence for one sentence.
Use
add()
to add those element into the vocabulary.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will get elements from.
-
extract
(pack, instance)[source]¶ Extract the feature for one instance in a pack.
Overwrite instruction:
Get all entries from one instance in the pack.
Get elements that are needed form entries. For example, the token text or sequence tags.
Construct a feature and return it.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
- Returns (Feature):
a feature that contains the extracted data.
-
pre_evaluation_action
(pack, instance)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. For example, you can remove entries or remove some attributes of the entry. By default, this function will not do anything.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
-
add_to_pack
(pack, instance, prediction)[source]¶ Add prediction of a model (normally in the form of a tensor) back to the pack. This function should have knowledge of the structure of the prediction to correctly populate the data pack values.
This function can be roughly considered as the reverse operation of
extract()
.Overwrite instruction:
Get all entries from one instance in the pack.
Convert prediction into elements that need to be assigned to entries. You might need to use
id2element()
to convert index in the prediction into element via the vocabulary maintained by the extractor.Add the element to corresponding entry based on the need.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance to which the extractor add prediction.
prediction (Any) – This is the output of the model, whose format will be determined by the predict function user define and pass in to our framework.
AttributeExtractor¶
-
class
forte.data.extractor.attribute_extractor.
AttributeExtractor
(config)[source]¶ AttributeExtractor extracts feature from the attribute of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
“attribute”: str The name of attribute we want to extract from the entry. For example, text attribute of Token. The default one is text.
-
update_vocab
(pack, instance)[source]¶ Get all attributes of one instance and add them into the vocabulary.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
-
extract
(pack, instance)[source]¶ Extract attributes of one instance. For example, the text of tokens in one sentence.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
- Returns (Feature):
a feature that contains the extracted data.
-
pre_evaluation_action
(pack, instance)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove the attribute. You can overwrite this function by yourself.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
-
add_to_pack
(pack, instance, prediction)[source]¶ Add the prediction for attribute to the instance. If the prediction is an iterable object, we assume each of the element in prediction will correspond to one entry. If the prediction is only one element, then we assume there will only be one entry in the instance.
Extending this class will need to handle the specific prediction data types. The default implementation assume the data type is Integer.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance to which the extractor add prediction.
prediction (Iterable[Union[int, Any]]) – This is the output of the model, which contains the index for attributes of one instance.
-
classmethod
CharExtractor¶
-
class
forte.data.extractor.char_extractor.
CharExtractor
(config)[source]¶ CharExtractor extracts feature from the text of entry. Text will be split into characters. Most of the time, a user will not need to call this class/function explicitly, they will be called by the framework.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
- “max_char_length”: int
The maximum number of characters for one token in the text.
-
update_vocab
(pack, instance)[source]¶ Add all character into vocabulary.
- Parameters
pack (Datapack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will get text from.
-
extract
(pack, instance)[source]¶ Extract the character feature of one instance.
- Parameters
pack (Datapack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
- Returns (Feature):
a feature that contains the extracted data.
-
classmethod
BioSeqTaggingExtractor¶
-
class
forte.data.extractor.seqtagging_extractor.
BioSeqTaggingExtractor
(config)[source]¶ BioSeqTaggingExtractor will the feature by performing BIO encoding for the attribute of entry and aligning to the tagging_unit entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
- Parameters
config – An instance of Dict or
Config
. Seedefault_configs()
for available options and default values.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
entry_type: Type[Entry]. Required. The ontology type that the extractor will get feature from.
attribute (str): Required. The attribute name of the entry from which labels are extracted.
tagging_unit (Type[Entry]): Required. The tagging label will align to the tagging_unit Entry.
For example, the config can be:
{ "entry_type": EntityMention, "attribute": "ner_type", "tagging_unit": Token }
The extractor will extract the BIO-NER style tags for instances. A possible feature can be:
[[None, "O"], ["LOC", "B"], ["LOC", "I"], [None, "O"], [None, "O"], ["PER", "B"], [None, "O"]]
-
predefined_vocab
(predefined)[source]¶ Add predefined tags into the vocabulary. i.e. One can construct the tag vocabulary without exploring the training data.
- Parameters
predefined (Iterable[str]) – A set of pre-defined tags.
-
update_vocab
(pack, instance)[source]¶ Add all the tag from one instance into the vocabulary.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
-
extract
(pack, instance)[source]¶ Extract the sequence tagging feature of one instance. If the vocabulary of this extractor is set, then the extracted tag sequences will be converted to the tag ids (int).
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance from which the extractor will extractor feature.
- Returns (Feature):
a feature that contains the extracted data.
-
pre_evaluation_action
(pack, instance)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove tags in the instance. You can overwrite this function by yourself.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance on which the extractor performs the pre-evaluation action.
-
add_to_pack
(pack, instance, prediction)[source]¶ Add the prediction for attribute to the instance. We make following assumptions for prediction.
If we encounter “I” while its tag is different from the previous tag, we will consider this “I” as a “B” and start a new tag here.
We will truncate the prediction it according to the number of entry. If the prediction contains <PAD> element, this should remove them.
- Parameters
pack (DataPack) – The datapack that contains the current instance.
instance (Annotation) – The instance to which the extractor add prediction.
prediction (Iterable[Union[int, Any]]) – This is the output of the model, which contains the index for attributes of one instance.
Predictor¶
-
class
forte.processors.base.batch_processor.
Predictor
[source]¶ This class is used to perform prediction on features that extracted from the datapack and add the prediction back to the datapack.
-
pack
(pack, inputs)[source]¶ This function is just for the compatibility reason. And it is not actually used in this class.
-
static
define_batcher
()[source]¶ Define a specific batcher for this processor. Single pack
BatchProcessor
initialize the batcher to be aProcessingBatcher
. AndMultiPackBatchProcessor
initialize the batcher to be aMultiPackProcessingBatcher
.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.- Parameters
resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Config) – The configuration passed in to set up this component.
-
Feature¶
-
class
forte.data.converter.
Feature
(data, metadata, vocab=None)[source]¶ This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.
- Parameters
data (list) – A list of features, where each feature can be the value or another list of features. Typically this should be the output from
extract()
inBaseExtractor
.metadata (dict) –
A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.
dim indicates the total number of dimension for this feature.
dtype is the value type. For example, it can be torch.long.
vocab (Vocabulary) – An optional fields about the
Vocabulary
used to build this feature.
Please refer to
data()
for the typical usage of this class.-
property
leaf_feature
¶ Returns: True if current feature is leaf feature. Otherwise, False.
-
property
dtype
¶ Returns: The data type of this feature.
-
property
sub_features
¶ Returns: A list of sub features. Raise exception if current feature is the leaf feature.
-
property
meta_data
¶ Returns: A Dict of meta data describing this feature.
-
property
vocab
¶ Returns: The
Vocabulary
used to build this feature.
-
property
dim
¶ Returns: The dimension of this feature.
-
property
need_pad
¶ Returns: Whether the Feature need to be padded.
-
pad
(max_len)[source]¶ Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.
- Parameters
max_len (int) – The padded length.
-
property
data
¶ It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.
- Returns
A Tuple where
The first element is the actual data representing this feature.
The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.
Here are some examples for how the padding works:
Example 1 (1-dim feature, no padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) data, masks = feature.data # data is: # [2,7,8] # masks is: # [ # [1,1,1] # ]
Example 2 (1-dim feature, scalar padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) data, masks = feature.data # data is: # [2,7,8,0] # masks is: # [ # [1,1,1,0] # ]
Example 3 (2-dim feature, scalar padding):
data = [[1,2,5], [3], [1,5]] meta_data = { "pad_value": 0 "dim": 2 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) for sub_feature in feature.sub_features: sub_feature.pad(max_len=3) data, masks = feature.data # data is: # [[1,2,5], [3,0,0], [1,5,0], [0,0,0]] # masks is: # [ # [1,1,1,0], # [[1,1,1], [1,0,0], [1,1,0], [0,0,0]] # ]
Example 4 (1-dim feature, vector padding):
data = [[0,1,0],[1,0,0]] meta_data = { "pad_value": [0,0,1] "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=3) data, masks = feature.data # data is: # [[0,1,0], [1,0,0], [0,0,1]] # masks is: # [ # [1,1,0] # ]
Evaluation¶
Base Evaluator¶
-
class
forte.evaluation.base.base_evaluator.
Evaluator
[source]¶ The base class of the evaluator.
-
abstract
consume_next
(pred_pack, ref_pack)[source]¶ Consume the prediction pack and the reference pack to compute evaluation results.
- Parameters
pred_pack – The prediction datapack, which should contain the system predicted results.
ref_pack – The reference datapack, which should contain the reference to score on.
-
abstract
Task Evaluators¶
-
class
forte.evaluation.ner_evaluator.
CoNLLNEREvaluator
[source]¶ -
consume_next
(pred_pack, refer_pack)[source]¶ Consume the prediction pack and the reference pack to compute evaluation results.
- Parameters
pred_pack – The prediction datapack, which should contain the system predicted results.
ref_pack – The reference datapack, which should contain the reference to score on.
-