Training System¶
Forte promotes the convention to separate data pre-processing (Domain Dependent) and actual training process. This is simply done by creating an intermediate layer to extract raw features from data packs. In this documentation, we will visit several components in this system, which includes:
Train Preprocessor that defines the structure of this process.
Extractor that extracts from data to features back and forth.
Converter that creates matrices.
Predictor that builds data pack from model output automatically.
Evaluator that conducts evaluation on the resulting pack.
Train Preprocessor¶
-
class
forte.train_preprocessor.
TrainPreprocessor
(pack_iterator)[source]¶ TrainPreprocessor provides the functionality of doing pre-processing work including building vocabulary, extracting the features, batching and padding (optional). The processed data will be provided by its method
get_train_batch_iterator()
, which will return an iterator over the batch of pre-processed data. Please refer to the documentation of that method for how the pre-processing is done.A main part of the TrainPreprocessor ` is that it maintains a list of extractors :class:`~forte.data.base_extractor.BaseExtractor that extract features. This can be provided either via calling add_extractor function. Alternatively, a request can be passed in through initialize, where the configuration under the request key will be used to create the extractor instances.
The parsed components will be stored, and can be accessed via the request property of this class.
- Parameters
pack_iterator (Iterator[DataPack]) – An iterator of
DataPack
.
Note
For parameters request, user does not necessarily need to provide converter. If no converter is specified, a default converter of type
Converter
will be picked.-
add_extractor
(name, extractor, is_input, converter=None)[source]¶ Extractors can be added to the preprocessor directly via this method.
- Parameters
name (
str
) – The name/identifier of this extractor, the name should be different between different extractors.extractor (
BaseExtractor
) – The extractor instance to be added.is_input (
bool
) – Whether this extractor will be used as input or output.converter (
Optional
[Converter
]) – The converter instance to be applied after running the extractor.
Returns:
-
static
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
{ "preprocess": { "device": "cpu", }, "dataset": DataPackDataset.default_hparams() }
Here:
“preprocessor.device”: The device of the produced batches. For GPU training, set to current CUDA device.
“dataset”: This contains all the configurable options same as
DataPackDataset
.
-
property
request
¶ A Dict containing all the information needed for doing the pre-processing. This is obtained via parsing the input request
An example request is:
request = { "context_type": "ft.onto.base_ontology.Sentence" "schemes": { "text_tag": { "extractor": "class_name": "forte.data.extractor.AttributeExtractor", "config": { ... more configuration of the extractor } }, "ner_tag": { "extractor": "class_name": "forte.data.extractor.BioSeqTaggingExtractor", "config": { ... more configuration of the extractor } } } }
Here:
“context_type”: Annotation A class of type
context_type
. Defines the granularity to separate data into different groups. All extractors will operate based on this. For example, if context_type isSentence
, then the features of each extractor will represent the information of a sentence. If this value is None, then all extractors will operate on the whole data pack.“schemes”: Dict A Dict containing the information about doing the pre-processing. The key is the tags provided by input request. The value is a Dict containing the information for doing pre-processing for that feature.
“schemes.tag.extractor”: An instance of type
BaseExtractor
.“schemes.tag.converter”: An instance of type
Converter
.“schemes.tag.type”: TrainPreprocessor.DATA_INPUT/DATA_OUTPUT Denoting whether this feature is the input or output feature.
- Return type
-
property
device
¶ The device of the produced batches. For GPU training, set to current CUDA device.
- Return type
device
-
property
config
¶ A
Config
maintaining all the configurable options for this TrainPreprocessor.- Return type
-
get_train_batch_iterator
()[source]¶ This method mainly has four steps:
It will return an iterator of a batch of pre-processed data.
- Return type
- Returns
An Iterator of type
Batch
Please refer to
collate()
inDataPackDataset
for details about its structure.
Converter¶
-
class
forte.data.converter.converter.
Converter
(config=None)[source]¶ This class has the functionality of converting a batch of
Feature
to a MatrixLike type which can be a Numpy array, a PyTorch Tensor, or a nested list.It can also perform padding for the given batch of
Feature
if user requested it. Please refer to the request parameter inTrainPreprocessor
for details.- Parameters
config (
Union
[Dict
,HParams
,None
]) – An instance of Dict orConfig
that provides all configurable options. Seedefault_configs()
for available options and default values.
-
static
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
{ "to_numpy": True, "to_torch": True }
Here:
“to_numpy”: bool Whether convert to numpy.ndarray. Default is True.
“to_torch”: bool Whether convert to torch.tensor. Default is True.
Note
If need_pad in
forte.data.converter.Feature
is False and to_numpy and to_torch is True, it will raise an exception if the target data cannot be converted to numpy.ndarray or torch.tensor.Note
If need_pad in
forte.data.converter.Feature
is True and to_torch is True, to_torch will overwrite the effect of to_numpy.
-
convert
(features)[source]¶ Convert a list of Features to matrix-like form, where
1. The outer most dimension will always be the batch dimension (i.e len(output) = len(feature_num)).
The type can be:
2.1 A List of primitive int or another List
2.2 A numpy.ndarray
2.3 A torch.Tensor
If need_pad in
forte.data.converter.Feature
is True, it will pad all features with given pad_value stored insideforte.data.converter.Feature
.If to_numpy is True, it will try to convert data into numpy.ndarray.
If to_torch is True, it will try to convert data into torch.tensor.
- Parameters
features (
List
[Feature
]) – A list offorte.data.converter.Feature
- Return type
Tuple
[Union
[TensorType
,ndarray
,List
],Sequence
[Union
[TensorType
,ndarray
,List
]]]- Returns
A Tuple containing two elements.
1. The first element is either a MatrixLike type representing the batch of data.
2. The second element is a MatrixLike type representing masks along different feature dimensions.
Example 1:
data = [[1,2,3], [4,5], [6,7,8,9]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_numpy=True, to_torch=False) output_data, masks = converter.convert(features) # output_data is: # np.array([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=np.long) # masks is: # [ # np.array([[1,1,1,0], [1,1,0,0], [1,1,1,1]], # dtype=np.bool) # ]
Example 2:
data = [[[1,2,3], [4,5]], [[3]]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 2 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_numpy=True, to_torch=False) output_data, masks = converter.convert(features) # output_data is: # np.array([[[1,2,3], [4,5,0]], [[3,0,0], [0,0,0]]], # dtype=np.long) # masks is: # [ # np.array([[1,1], [1,0]], dtype=np.bool), # np.array([[[1,1,1], [1,1,0]], # [[1,0,0], [0,0,0]]], dtype=np.bool) # ]
Example 3:
data = [[1,2,3,0], [4,5,0,0], [6,7,8,9]] meta_data = { "pad_value": 0 "need_pad": False, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(need_pad=False) output_data, _ = converter.convert(features) # output_data is: # torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long)
Example 4:
data = [[1,2,3], [4,5], [6,7,8,9]] meta_data = { "pad_value": 0, "need_pad": True, "dim": 1 "dtype": np.long } features = [Feature(i, meta_data=meta_data) for i in data] converter = Converter(to_torch=True) output_data, masks = converter.convert(features) # output_data is: # torch.tensor([[1,2,3,0], [4,5,0,0], [6,7,8,9]], dtype=torch.long) # masks is: # [ # torch.tensor([[1,1,1,0], [1,1,0,0], [1,1,1,1]], # dtype=np.bool) # ]
Feature¶
-
class
forte.data.converter.feature.
Feature
(data, metadata, vocab=None)[source]¶ This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.
- Parameters
data (
List
) – A list of features, where each feature can be the value or another list of features. Typically this should be the output fromextract()
inBaseExtractor
.metadata (
Dict
) –A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.
dim indicates the total number of dimension for this feature.
dtype is the value type. For example, it can be torch.long.
vocab (
Optional
[Vocabulary
]) – An optional fields about theVocabulary
used to build this feature.
Please refer to
data()
for the typical usage of this class.-
property
leaf_feature
¶ Returns: True if current feature is leaf feature. Otherwise, False.
- Return type
-
property
dtype
¶ Returns: The data type of this feature.
-
property
sub_features
¶ Returns: A list of sub features. Raise exception if current feature is the leaf feature.
-
property
vocab
¶ Returns: The
Vocabulary
used to build this feature.- Return type
-
pad
(max_len)[source]¶ Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.
- Parameters
max_len (int) – The padded length.
-
property
data
¶ It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.
- Return type
- Returns
A Tuple where
The first element is the actual data representing this feature.
The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.
Here are some examples for how the padding works:
Example 1 (1-dim feature, no padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) data, masks = feature.data # data is: # [2,7,8] # masks is: # [ # [1,1,1] # ]
Example 2 (1-dim feature, scalar padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) data, masks = feature.data # data is: # [2,7,8,0] # masks is: # [ # [1,1,1,0] # ]
Example 3 (2-dim feature, scalar padding):
data = [[1,2,5], [3], [1,5]] meta_data = { "pad_value": 0 "dim": 2 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) for sub_feature in feature.sub_features: sub_feature.pad(max_len=3) data, masks = feature.data # data is: # [[1,2,5], [3,0,0], [1,5,0], [0,0,0]] # masks is: # [ # [1,1,1,0], # [[1,1,1], [1,0,0], [1,1,0], [0,0,0]] # ]
Example 4 (1-dim feature, vector padding):
data = [[0,1,0],[1,0,0]] meta_data = { "pad_value": [0,0,1] "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=3) data, masks = feature.data # data is: # [[0,1,0], [1,0,0], [0,0,1]] # masks is: # [ # [1,1,0] # ]
Extractor¶
BaseExtractor¶
-
class
forte.data.base_extractor.
BaseExtractor
[source]¶ The functionality of Extractor is as followed. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
Build vocabulary.
Extract feature from datapack.
Perform pre-evaluation action on datapack.
Add prediction to datapack.
Explanation:
Vocabulary: Vocabulary is maintained as an attribute in extractor. It will store the mapping from element to index, which is an integer, and representation, which could be an index integer or one-hot vector depending on the configuration of the vocabulary. Check
Vocabulary
for more details.Feature: A feature basically wraps the data we want from one instance in a datapack. For example, the instance can be one sentence in a datapack. Then the data wrapped by the feature could be the token text of this sentence. The data is already converted as list of indexes using vocabulary. Besides the data, other information like the raw data before indexing and some meta_data will also be stored in the feature. Check
Feature
for more details.Remove feature / Add prediction: Removing feature means remove the existing data in the datapack. If we remove the feature in the pack, then extracting feature will return empty list. Adding prediction means we add the prediction from model back to the datapack. If a datapack has some old data (for example, the golden data in the test set), we can first remove those data and then add our model prediction to the pack.
-
config
¶ An instance of Dict or
Config
that provides configurable options. Seedefault_configs()
for available options and default values.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
vocab_method (str) What type of vocabulary is used for this extractor. custom, indexing, one-hot are supported, default is indexing. Check the behavior of vocabulary under different setting in
Vocabulary
context_type (str): The fully qualified name of the context used to group the extracted features, for example, it could be a ft.onto.base_ontology.Sentence. If this is None, features from in the whole data pack will be grouped together. Default is None. This value could be mandatory for some processors, which will be documented and specified by the specific processor implementation.
vocab_use_unk (bool) Whether the <UNK> element should be added to vocabulary. Default is true.
need_pad (bool) Whether the <PAD> element should be added to vocabulary. And whether the feature need to be batched and padded. Default is True.
pad_value (int) A customized value/representation to be used for padding. This value is only needed when use_pad is True. Default is None, where the value of padding is determined by the system.
unk_value (int) A customized value/representation to be used for unknown value (unk). This value is only needed when vocab_use_unk is True. Default is None, where the value of UNK is determined by the system.
-
property
vocab
¶ Getter of the vocabulary class.
Returns: The vocabulary. None if the vocabulary is not set.
- Return type
-
predefined_vocab
(predefined)[source]¶ Populate the vocabulary with predefined values. You can also extend this method to customize the ways to handle the vocabulary.
Overwrite instruction:
Take out elements from predefined.
Make modification on the elements based on the need of the extractor.
Use
add()
function to add the element into vocabulary.
- Parameters
predefined (Iterable) – A collections that contains the elements to be added into the vocabulary.
-
update_vocab
(pack, context=None)[source]¶ Populate the vocabulary needed by the extractor. This can be implemented by a specific extractor. The populated vocabulary can be used to map features/items to numeric representations. If you use a pre-specified vocabulary, you may not need to use this function.
Overwrite instructions:
1. Get all entries of the type of interest, such as all the Token`s in the data pack. 2. Use :meth:`~forte.data.vocabulary.Vocabulary.add to add those element into self._vocab.
- Parameters
pack (
DataPack
) – The input data pack.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
abstract
extract
(pack, context=None)[source]¶ This method should be implemented to extract features from a datapack.
- Parameters
pack (
DataPack
) – The input data pack that contains the features.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
Returns: Features inside this instance stored as a ~forte.data.converter.feature.Feature instance.
- Return type
-
pre_evaluation_action
(pack, context)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. For example, you can remove entries or remove some attributes of the entry. By default, this function will not do anything.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
add_to_pack
(pack, predictions, context=None)[source]¶ Add prediction of a model (normally in the form of a tensor) back to the pack. This function should have knowledge of the structure of the prediction to correctly populate the data pack values.
This function can be roughly considered as the reverse operation of
extract()
.Overwrite instruction:
Get all entries from one instance in the pack.
Convert predictions into elements that needs to be assigned to entries. You can use
id2element()
to convert integers in the prediction into element via the vocabulary maintained by the extractor.Add the element to corresponding entry based on the need.
- Parameters
pack (
DataPack
) – The datapack to add predictions back.predictions (
Any
) – This is the output of the model, the format of which will be determined by the predict function defined in the Predictor.context (
Optional
[Annotation
]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as inextract()
. If None, then the whole data pack will be used as the context. Default is None.
AttributeExtractor¶
-
class
forte.data.extractors.attribute_extractor.
AttributeExtractor
[source]¶ AttributeExtractor extracts feature from the attribute of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
“attribute”: str The name of the attribute we want to extract from the entry. This attribute should present in the entry definition. There are some built-in attributes for some instance, such as text for Annotation entries.
tid
should be also available for any entries. The default value istid
.“entry_type”: str The fully qualified name of the entry to extract attributes from. The default value is None, but this value must present or an ProcessorConfigError will be thrown.
-
update_vocab
(pack, context=None)[source]¶ Get all attributes of one instance and add them into the vocabulary.
- Parameters
pack (
DataPack
) – The data pack input to extract vocabulary.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
extract
(pack, context=None)[source]¶ Extract the attribute of an entry of the configured entry type. The entry type is passed in from via extractor config entry_type.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
- Return type
- Returns
Features (attributes) for instance with in the provided context, they will be converted to the representation based on the vocabulary configuration.
-
pre_evaluation_action
(pack, context)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove all attributes defined in the config (set them to None). You can overwrite this function by yourself.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
add_to_pack
(pack, predictions, context=None)[source]¶ Add the prediction for attributes to the data pack. We assume the number of predictions in the iterable to be the same as the number of the entries of the defined type in the data pack.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.predictions (
Iterable
[SupportsInt
]) – This is the output of the model, which should be the class index for the attribute.context (
Optional
[Annotation
]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as inextract()
. If None, then the whole data pack will be used as the context. Default is None.
-
classmethod
LinkExtractor¶
-
class
forte.data.extractors.relation_extractor.
LinkExtractor
[source]¶ This extractor extracts relation type features from data packs. This extractor expects the parent and child of the relation to be Annotation entries.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
“entry_type”: The target relation entry type, should be a Link entry.
“attribute”: The attribute of the relation to extract.
“index_annotation”: The annotation object used to index the head and child node of the relations.
-
update_vocab
(pack, context=None)[source]¶ Update values of relation attributes to the vocabulary.
- Parameters
pack (
DataPack
) – The input data pack.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
- Returns
None
-
extract
(pack, context=None)[source]¶ Extract link data as features from the context.
- Parameters
pack (
DataPack
) – The input data pack that contains the features.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
Returns:
- Return type
-
add_to_pack
(pack, predictions, context=None)[source]¶ Convert prediction back to Links inside the data pack.
- Parameters
pack (
DataPack
) – The datapack to add predictions back.predictions (
List
[Tuple
[Tuple
[int
,int
],Tuple
[int
,int
],int
]]) – This is the output of the model, it is a triplet, the first element shows the parent, the second element shows the child. These two are indexed by the index_annotation of this extractor. The last element is the index of the relation attribute.context (
Optional
[Annotation
]) – The context is an Annotation entry where predictions will be added to. This has the same meaning with context as inextract()
. If None, then the whole data pack will be used as the context. Default is None.
-
classmethod
SubwordExtractor¶
-
class
forte.data.extractors.subword_extractor.
SubwordExtractor
[source]¶ SubwordExtractor extracts feature from the subword of entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
- Parameters
config – An instance of Dict
Config
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here:
“pretrained_model_name”: The name of the pretrained bert model. Must be the same as used in subword tokenizer.
“subword_class”: the fully qualified name of the class of the subword, default is ft.onto.base_ontology.Subword.
-
extract
(pack, context=None)[source]¶ Extract the subword feature of one instance.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
- Returns
a feature that contains the extracted data.
- Return type
Feature
CharExtractor¶
-
class
forte.data.extractors.char_extractor.
CharExtractor
[source]¶ CharExtractor extracts feature from the text of entry. Text will be split into characters.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default configuration parameters.
Here:
“max_char_length”: int The maximum number of characters for one token in the text, default is None, which means no limit will be set.
“entry_type”: str The fully qualified name of an annotation type entry. Characters will be extracted based on these entries. Default is Token, which means characters of tokens will be extracted.
-
update_vocab
(pack, context=None)[source]¶ Add all character into vocabulary.
- Parameters
pack (
DataPack
) – The input data pack.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
extract
(pack, context=None)[source]¶ Extract the character feature of one instance.
- Parameters
pack (
DataPack
) – The datapack to extract features from.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
- Return type
- Returns
a iterator of feature that contains the characters of each specified annotation.
-
classmethod
BioSeqTaggingExtractor¶
-
class
forte.data.extractors.seqtagging_extractor.
BioSeqTaggingExtractor
[source]¶ BioSeqTaggingExtractor will the feature by performing BIO encoding for the attribute of entry and aligning to the tagging_unit entry. Most of the time, a user will not need to call this class explicitly, they will be called by the framework.
-
initialize
(config)[source]¶ Initialize the extractor based on the provided configuration.
- Parameters
config (
Union
[Dict
,HParams
]) – The configuration of the extractor, it can be a Dict orConfig
. Seedefault_configs()
for available options and default values.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyper-parameters.
Here, additional parameters are added from the parent class:
entry_type (str): Required. The fully qualified name of an Annotation entry to extract attribute from. For example, for an NER task, it could be ft.onto.base_ontology.EntityMention.
attribute (str): Required. The attribute name of the entry from which labels are extracted.
tagging_unit (str): Required. The fully qualified name of the units for tagging, The tagging label will align to the units, e.g: ft.onto.base_ontology.Token.
pad_value (int): A customized value/representation to be used for padding. This value is only needed when use_pad is True. Default is -100 to follow PyTorch convention.
is_bert (bool): It indicates whether Bert model is used. If true, padding will be added to the beginning and end of a sentence corresponding to the special tokens ([CLS], [SEP]) used in Bert. Default is False.
For example, the config can be:
{ "entry_type": "ft.onto.base_ontology.EntityMention", "attribute": "ner_type", "tagging_unit": "ft.onto.base_ontology.Token" }
The extractor will extract the BIO NER tags for instances. A possible feature can be:
[[None, "O"], ["LOC", "B"], ["LOC", "I"], [None, "O"], [None, "O"], ["PER", "B"], [None, "O"]]
-
predefined_vocab
(predefined)[source]¶ Add predefined tags into the vocabulary. i.e. One can construct the tag vocabulary without exploring the training data.
- Parameters
predefined (
Iterable
) – A set of pre-defined tags.
-
update_vocab
(pack, context=None)[source]¶ Add all the tag from one instance into the vocabulary.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
extract
(pack, context=None)[source]¶ Extract the sequence tagging feature of one instance. If the vocabulary of this extractor is set, then the extracted tag sequences will be converted to the tag ids (int).
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
- Returns (Feature): a feature that contains the extracted BIO sequence
of and other metadata.
- Return type
-
pre_evaluation_action
(pack, context=None)[source]¶ This function is performed on the pack before the evaluation stage, allowing one to perform some actions before the evaluation. By default, this function will remove tags in the instance. You can overwrite this function by yourself.
- Parameters
pack (
DataPack
) – The datapack to be processed.context (
Optional
[Annotation
]) – The context is an Annotation entry where data are extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
add_to_pack
(pack, predictions, context=None)[source]¶ Add the prediction results to data pack. The predictions are
We make following assumptions for prediction.
If we encounter “I” while its tag is different from the previous tag, we will consider this “I” as a “B” and start a new tag here.
We will truncate the prediction it according to the number of entry. If the prediction contains <PAD> element, this should remove them.
- Parameters
pack (
DataPack
) – The datapack that contains the current instance.predictions (
List
[int
]) – This is the output of the model, which contains the index for attributes of one instance.context (
Optional
[Annotation
]) – The context is an Annotation entry where features will be extracted within its range. If None, then the whole data pack will be used as the context. Default is None.
-
Predictor¶
-
class
forte.processors.base.batch_processor.
Predictor
[source]¶ Predictor is a special type of batch processor that uses
BaseExtractor
to collect features from data packs, and also uses Extractors to write the prediction back.Predictor implements the PackingBatchProcessor class, and implements the predict and pack function using the extractors.
-
add_extractor
(name, extractor, is_input, converter=None)[source]¶ Extractors can be added to the preprocessor directly via this method.
- Parameters
name (
str
) – The name/identifier of this extractor, the name should be different between different extractors.extractor (
BaseExtractor
) – The extractor instance to be added.is_input (
bool
) – Whether this extractor will be used as input or output.converter (
Optional
[Converter
]) – The converter instance to be applied after running the extractor.
- Returns
None
-
classmethod
define_batcher
()[source]¶ Define a specific batcher for this processor. Single pack
BaseBatchProcessor
initialize the batcher to be aProcessingBatcher
. AndMultiPackBatchProcessor
initialize the batcher to be aMultiPackBatchProcessor
.- Return type
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
pack
(pack, predict_results, context=None)[source]¶ The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.
- Parameters
pack (~PackType) – The pack to add entries or fields to.
predict_results (
Dict
) – The prediction results returned bypredict()
. This processor will add these results to the provided pack as entry and attributes.context (
Optional
[Annotation
]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.
-
Feature¶
-
class
forte.data.converter.
Feature
(data, metadata, vocab=None)[source]¶ This class represents a type of feature for a single data instance. The Feature can be multiple dimensions. It has methods to do padding and retrieve the actual multi-dimension data.
- Parameters
data (
List
) – A list of features, where each feature can be the value or another list of features. Typically this should be the output fromextract()
inBaseExtractor
.metadata (
Dict
) –A dictionary storing meta-data for this feature. Mandatory fields includes: dim, dtype.
dim indicates the total number of dimension for this feature.
dtype is the value type. For example, it can be torch.long.
vocab (
Optional
[Vocabulary
]) – An optional fields about theVocabulary
used to build this feature.
Please refer to
data()
for the typical usage of this class.-
property
leaf_feature
¶ Returns: True if current feature is leaf feature. Otherwise, False.
- Return type
-
property
dtype
¶ Returns: The data type of this feature.
-
property
sub_features
¶ Returns: A list of sub features. Raise exception if current feature is the leaf feature.
-
property
vocab
¶ Returns: The
Vocabulary
used to build this feature.- Return type
-
pad
(max_len)[source]¶ Pad the current feature dimension with the given max_len. It will use pad_value to do the padding.
- Parameters
max_len (int) – The padded length.
-
property
data
¶ It will return the actual data stored. Internally, it will recursively retrieve data from inner dimension features. Meanwhile, it will also return a list of masks representing the mask along different dimensions.
- Return type
- Returns
A Tuple where
The first element is the actual data representing this feature.
The second element is a list of masks. masks[i] in this list represents the mask along i-th dimension.
Here are some examples for how the padding works:
Example 1 (1-dim feature, no padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) data, masks = feature.data # data is: # [2,7,8] # masks is: # [ # [1,1,1] # ]
Example 2 (1-dim feature, scalar padding):
data = [2,7,8] meta_data = { "pad_value": 0 "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) data, masks = feature.data # data is: # [2,7,8,0] # masks is: # [ # [1,1,1,0] # ]
Example 3 (2-dim feature, scalar padding):
data = [[1,2,5], [3], [1,5]] meta_data = { "pad_value": 0 "dim": 2 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=4) for sub_feature in feature.sub_features: sub_feature.pad(max_len=3) data, masks = feature.data # data is: # [[1,2,5], [3,0,0], [1,5,0], [0,0,0]] # masks is: # [ # [1,1,1,0], # [[1,1,1], [1,0,0], [1,1,0], [0,0,0]] # ]
Example 4 (1-dim feature, vector padding):
data = [[0,1,0],[1,0,0]] meta_data = { "pad_value": [0,0,1] "dim": 1 "dtype": torch.long } feature = Feature(data, meta_data=meta_data) feature.pad(max_len=3) data, masks = feature.data # data is: # [[0,1,0], [1,0,0], [0,0,1]] # masks is: # [ # [1,1,0] # ]
Evaluation¶
Base Evaluator¶
-
class
forte.evaluation.base.base_evaluator.
Evaluator
[source]¶ The base class of the evaluator.
-
abstract
consume_next
(pred_pack, ref_pack)[source]¶ The actual consume function that will be called by the pipeline. This function will deal with the basic pipeline status and call the consume_next function.
- Parameters
pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.
ref_pack (~PackType) – The reference pack, which should contain the reference to score on.
-
abstract
get_result
()[source]¶ The evaluator gather the results and the score should be obtained here.
- Return type
-
expected_types_and_attributes
(pred_pack_expectation, ref_pack_expectation)[source]¶ If the evaluator has required input types and attributes for pred_pack or ref_pack, user could add the types and attributes required with this function.
-
pred_pack_record
(record_meta)[source]¶ Method to add output type record of prediction datapack of current processor to
forte.data.base_pack.BaseMeta.record
.
-
ref_pack_record
(record_meta)[source]¶ Method to add output type record of reference datapack of current processor to
forte.data.base_pack.BaseMeta.record
.
-
check_record
(pred_pack, ref_pack)[source]¶ Method to check type consistency if
enforce_consistency
is enabled for the pipeline. If any expected type or its attribute does not exist in the pred_pack or ref_pack record of the previous pipeline component, an error ofExpectedRecordNotFound
will be raised.- Parameters
pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.
ref_pack (~PackType) – The reference pack, which should contain the reference to score on.
-
writes_record
(pred_pack, ref_pack)[source]¶ Method to write records of the output type of the current processor to the datapack. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
- Parameters
pred_pack (~PackType) – The prediction pack, which should contain the system predicted results.
ref_pack (~PackType) – The reference pack, which should contain the reference to score on.
-
abstract
Task Evaluators¶
-
class
forte.evaluation.ner_evaluator.
CoNLLNEREvaluator
[source]¶ -
initialize
(resources, configs)[source]¶ Initialize the evaluator with resources and configs. This method is called by the pipeline during the initialization.
- Parameters
resources (
Resources
) – An object of classResources
that holds references to objects that can be shared throughout the pipeline.configs (
HParams
) – A configuration to initialize the evaluator. This evaluator is expected to hold the following (key, value) pairs - “entry_type” (str): The entry to be evaluated. - “tagging_unit” (str): The tagging unit that the evaluation is performed on. e.g. “ft.onto.base_ontology.Sentence” - “attribute” (str): The attribute of the entry to be evaluated.
-
classmethod
default_configs
()[source]¶ Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.
-