Processors¶

Base Processors¶

BaseProcessor¶

class forte.processors.base.base_processor.BaseProcessor[source]¶

Base class inherited by all kinds of processors such as trainer, predictor and evaluator.

record(record_meta)[source]¶

Method to add output record of the current processor to forte.data.data_pack.Meta.record. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters: record_meta (Dict[str, Set[str]]) – The field in the datapack for type record that need to fill in for consistency checking.

expected_types_and_attributes()[source]¶

Method to add expected types and attributes for the input of the current processor which would be checked before running the processor if if the pipeline is initialized with enforce_consistency=True.

Return type: Dict[str, Set[str]]

check_record(input_pack)[source]¶

Method to check type consistency if the pipeline is initialized with enforce_consistency=True. If any expected type or its attribute does not exist in the datapack record of the previous pipeline component, an error of ExpectedRecordNotFound will be raised.

Parameters: input_pack (~PackType) – The input datapack.

write_record(input_pack)[source]¶

Method to write records of the output type of the current processor to the datapack. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters: input_pack (~PackType) – The input datapack.

classmethod default_configs()[source]¶

Returns a dict of configurations of the processor with default values. Used to replace the missing values of input configs during pipeline construction.

Return type: Dict[str, Any]

BasePackProcessor¶

class forte.processors.base.pack_processor.BasePackProcessor[source]¶: The base class of processors that process one pack in a streaming way. If you are looking for batching (that might happen across packs, refer to BaseBatchProcessor.

BaseBatchProcessor¶

class forte.processors.base.batch_processor.BaseBatchProcessor[source]¶

The base class of processors that process data in batch. This processor enables easy data batching via analyze the context and data objects. The context defines the scope of analysis of a particular task.

For example, in dependency parsing, the context is normally a sentence, in entity coreference, the context is normally a document. The processor will create data batches relative to the context.

Key fields in this processor:

batcher: The processing batcher used for this processor. The batcher will also keep track of the relation between the pack and the batch data.

use_coverage_index: If true, the index will be built based on the requests.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Optional[HParams]) – The configuration passed in to set up this component.

flush()[source]¶: Indicate that there will be no more packs to be passed in, handle what’s remaining in the buffer.

classmethod default_configs()[source]¶

Defines the default configs for batching processor.

Return type: Dict[str, Any]

abstract classmethod define_batcher()[source]¶

Define a specific batcher for this processor. Single pack BaseBatchProcessor initialize the batcher to be a ProcessingBatcher. And MultiPackBatchProcessor initialize the batcher to be a MultiPackBatchProcessor .

Return type: ProcessingBatcher

PackingBatchProcessor¶

class forte.processors.base.batch_processor.PackingBatchProcessor[source]¶

This class extends the BaseBatchProcessor class and provide additional utilities to align and pack the extracted results back to the data pack.

To implement this processor, one need to implement: 1. The predict function that make predictions for each input data batch. 2. The pack function that add the prediction value back to the data pack.

Users that implement the processor only have to concern about a single batch, the alignment between the data batch and the data pack will be maintained by the system.

predict(data_batch)[source]¶

The function that task processors should implement. Make predictions for the input data_batch.

Parameters: data_batch (dict) – A batch of instances in our dict format.
Return type: Dict[str, List[Any]]
Returns: The prediction results in dictionary form.

pack(pack, predict_results, context=None)[source]¶

The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.

Parameters

pack (~PackType) – The pack to add entries or fields to.
predict_results (Dict[str, Any]) – The prediction results returned by predict(). This processor will add these results to the provided pack as entry and attributes.
context (Optional[Annotation]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.

pack_all(packs, contexts, output_dict)[source]¶

Pack the prediction results contained in the output_dict back to the corresponding packs.

Parameters

packs (List[~PackType]) – The list of data packs corresponding to the output batches.
contexts (List[Optional[Annotation]]) – The list of contexts corresponding to the output batches.
output_dict (Dict[str, List[Any]]) – Stores the output in a specific format. The keys are string names that specify data. The value is a list of data in the shape of (batch_size, Any). There might be additional structures inside Any as specific implementation choices.

MultiPackBatchProcessor¶

class forte.processors.base.batch_processor.MultiPackBatchProcessor[source]¶: This class defines the base batch processor for MultiPack.

RequestPackingProcessor¶

class forte.processors.base.batch_processor.RequestPackingProcessor[source]¶

A processor that implements the packing batch processor, using a variation of the fixed size batcher FixedSizeRequestDataPackBatcher, which will use DataPack.get_data function with the`context_type` and requests parameters.

classmethod define_batcher()[source]¶

Define a specific batcher for this processor. Single pack BaseBatchProcessor initialize the batcher to be a ProcessingBatcher. And MultiPackBatchProcessor initialize the batcher to be a MultiPackBatchProcessor .

Return type: ProcessingBatcher

classmethod default_configs()[source]¶

Defines the default configs for batching processor.

Return type: Dict[str, Any]

FixedSizeBatchProcessor¶

class forte.processors.base.batch_processor.FixedSizeBatchProcessor[source]¶

A processor that implements the packing batch processor, using a fixed size batcher FixedSizeDataPackBatcher

classmethod default_configs()[source]¶

Defines the default configs for batching processor.

Return type: Dict[str, Any]

Predictor¶

class forte.processors.base.batch_processor.Predictor[source]

Predictor is a special type of batch processor that uses BaseExtractor to collect features from data packs, and also uses Extractors to write the prediction back.

Predictor implements the PackingBatchProcessor class, and implements the predict and pack function using the extractors.

add_extractor(name, extractor, is_input, converter=None)[source]

Extractors can be added to the preprocessor directly via this method.

Parameters

name (str) – The name/identifier of this extractor, the name should be different between different extractors.
extractor (BaseExtractor) – The extractor instance to be added.
is_input (bool) – Whether this extractor will be used as input or output.
converter (Optional[Converter]) – The converter instance to be applied after running the extractor.

Returns

None

classmethod define_batcher()[source]

Define a specific batcher for this processor. Single pack BaseBatchProcessor initialize the batcher to be a ProcessingBatcher. And MultiPackBatchProcessor initialize the batcher to be a MultiPackBatchProcessor .

Return type: ProcessingBatcher

classmethod default_configs()[source]

Defines the default configs for batching processor.

Return type: Dict[str, Any]

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

pack(pack, predict_results, context=None)[source]

The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.

Parameters

pack (~PackType) – The pack to add entries or fields to.
predict_results (Dict) – The prediction results returned by predict(). This processor will add these results to the provided pack as entry and attributes.
context (Optional[Annotation]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.

predict(data_batch)[source]

The function that task processors should implement. Make predictions for the input data_batch.

Parameters: data_batch (Dict) – A batch of instances in our dict format.
Return type: Dict
Returns: The prediction results in dict datasets.

Pack Processors¶

PackProcessor¶

class forte.processors.base.pack_processor.PackProcessor[source]¶: The base class of processors that process one DataPack each time.

MultiPackProcessor¶

class forte.processors.base.pack_processor.MultiPackProcessor[source]¶: The base class of processors that process MultiPack each time.

Task Processors¶

ElizaProcessor¶

class forte.processors.nlp.eliza_processor.ElizaProcessor[source]¶

The ElizaProcessor adapted from https://github.com/wadetb/eliza. This processor will generate response based on the Eliza rules. For more information, please refer to https://dl.acm.org/doi/10.1145/365153.365168.

This processor is not parallel ready because it will keep an internal state memory.

SubwordTokenizer¶

class forte.processors.nlp.subword_tokenizer.SubwordTokenizer[source]¶

Subword Tokenizer using pretrained Bert model.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

record(record_meta)[source]¶

Method to add output type record of current processor to forte.data.data_pack.Meta.record.

Parameters: record_meta (Dict[str, Set[str]]) – the field in the data pack storing type records needed in for consistency checking.
Returns: None

expected_types_and_attributes()[source]¶

Method to add expected type for current processor input which would be checked before running the processor if the pipeline is initialized with enforce_consistency=True or enforce_consistency() was enabled for the pipeline.

Return type: Dict[str, Set[str]]

classmethod default_configs()[source]¶

Returns the configuration with default values.

Here:

tokenizer_configs contains all default hyper-parameters in BERTTokenizer, this processor will pass on all the configurations to the tokenizer to create the tokenizer instance.
segment_unit contains an Annotation entry type used to split the text into smaller units. For example, setting this to ft.onto.base_ontology.Sentence will make this tokenizer do tokenization on a sentence base, which could be more efficient when the alignment is used.
token_source contains entry name of where the tokens come from. For example, setting this to ft.onto.base_ontology.Token will make this tokenizer split the sub-word based on this token. The default value will use ft.onto.base_ontology.Token. If this value is set to None, then it will use word_tokenization function of this class to do tokenization.

Note that if segment_unit or token_source is provided, the check_record() will check if certain types are written before this processor.

Returns: Default configuration value for the tokenizer.

CoNLLNERPredictor¶

class forte.processors.nlp.ner_predictor.CoNLLNERPredictor[source]¶

An Named Entity Recognizer trained according to Ma, Xuezhe, and Eduard Hovy. “End-to-end sequence labeling via bi-directional lstm-cnns-crf.”.

Note that to use CoNLLNERPredictor, the ontology of Pipeline must be an ontology that include ft.onto.base_ontology.Token and ft.onto.base_ontology.Sentence.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

predict(data_batch)[source]¶

The function that task processors should implement. Make predictions for the input data_batch.

Parameters: data_batch (dict) – A batch of instances in our dict format.
Return type: Dict[str, Dict[str, List[ndarray]]]
Returns: The prediction results in dictionary form.

pack(pack, predict_results, _=None)[source]¶: Write the prediction results back to datapack. by writing the predicted ner to the original tokens.

get_batch_tensor(data, device=None)[source]¶

Get the tensors to be fed into the model.

Parameters

data (List[Tuple[List[int], List[List[int]]]]) – A list of tuple (word_ids, char_id_sequences)
device (Optional[device]) – The device for the tensors.

Return type

Tuple[Tensor, Tensor, Tensor, Tensor]

Returns

A tuple where

words: A tensor of shape [batch_size, batch_length] representing the word ids in the batch
chars: A tensor of shape [batch_size, batch_length, char_length] representing the char ids for each word in the batch
masks: A tensor of shape [batch_size, batch_length] representing the indices to be masked in the batch. 1 indicates no masking.
lengths: A tensor of shape [batch_size] representing the length of each sentences in the batch

classmethod default_configs()[source]¶: Default config for NER Predictor

SRLPredictor¶

class forte.processors.nlp.srl_predictor.SRLPredictor[source]¶

An Semantic Role labeler trained according to He, Luheng, et al. “Jointly predicting predicates and arguments in neural semantic role labeling.”.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Optional[HParams]) – The configuration passed in to set up this component.

predict(data_batch)[source]¶

The function that task processors should implement. Make predictions for the input data_batch.

Parameters: data_batch (dict) – A batch of instances in our dict format.
Return type: Dict[str, List[List[Tuple[Span, List[Tuple[Span, str]]]]]]
Returns: The prediction results in dictionary form.

pack(pack, predict_results, _=None)[source]¶

The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.

Parameters

pack (DataPack) – The pack to add entries or fields to.
predict_results (Dict[str, List[List[Tuple[Span, List[Tuple[Span, str]]]]]]) – The prediction results returned by predict(). This processor will add these results to the provided pack as entry and attributes.
context – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.

classmethod default_configs()[source]¶: This defines the default configuration structure for the predictor.

VocabularyProcessor¶

class forte.processors.misc.vocabulary_processor.VocabularyProcessor[source]¶: Build vocabulary from the input DataPack, write the result into the shared resources.

Alphabet¶

class forte.processors.misc.vocabulary_processor.Alphabet(name, word_cnt=None, keep_growing=True, ignore_case_in_query=True, other_embeddings=None)[source]¶

Parameters

name – The name of the alphabet
keep_growing (bool) – If True, new instances not found ruing get_index will be added to the vocabulary.
ignore_case_in_query (bool) – If it’s True, Alphabet will try to query the lower-cased input from it’s vocabulary if it cannot find the input in its keys.

get_index(instance)[source]¶

Parameters: instance – the input token
Returns: the index of the queried token in the dictionary

save(output_directory, name=None)[source]¶

Save both alphabet records to the given directory.

Parameters

output_directory – Directory to save model and weights.
name – The alphabet saving name, optional.

PeriodSentenceSplitter¶

class forte.processors.misc.simple_processors.PeriodSentenceSplitter[source]¶: A processor that create sentences based on periods.

WhiteSpaceTokenizer¶

class forte.processors.misc.simple_processors.WhiteSpaceTokenizer[source]¶: A simple processor that split the tokens based on white space.

RemoteProcessor¶

class forte.processors.misc.remote_processor.RemoteProcessor[source]¶

RemoteProcessor wraps up the interactions with remote Forte end point. Each input DataPack from the upstream component will be serialized and packed into a POST request to be sent to a remote service, which should return a response that can be parsed into a DataPack to update the input. Example usage:

# Assume that a Forte service is running on "localhost:8080".
Pipeline() \
    .set_reader(plaintext_reader(), {"input_path":"some/path"}) \
    .add(RemoteProcessor(), {"url": "http://localhost:8008"})

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

record(record_meta)[source]¶

Method to add output type record of RemoteProcessor. The records are queried from the remote service. The types and attributes are populated from all the components in remote pipeline.

Parameters: record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

expected_types_and_attributes()[source]¶: Method to add expected types and attributes for the input of RemoteProcessor. This should be the expected_types_and_attributes of the first processor in remote pipeline.

set_test_mode(app)[source]¶

Configure the processor into test mode. This should only be called from a pytest program.

Parameters: app – A fastapi app from a Forte pipeline.

classmethod default_configs()[source]¶

This defines a basic config structure for RemoteProcessor. Following are the keys for this dictionary:

url: URL of the remote service end point. Default value is “http://localhost:8008”.

validation: Information for validation.

do_init_type_check: Validate the pipeline by checking the info of the remote pipeline with the expected attributes. Default to False.

input_format: The expected input format of the remote service. Default to “string”.

expected_name: The expected pipeline name. Default to ‘’.

Returns: A dictionary with the default config for this processor.
Return type: dict

LowerCaserProcessor¶

class forte.processors.misc.lowercaser_processor.LowerCaserProcessor[source]¶

classmethod default_configs()[source]¶

Default configurations for this processor, it contains the following configuration values:

“custom_substitutions”: a dictionary contains the mapping
used to conduct lower case, {“İ”: “i”}. The length (len) of the two string must be the same.

Returns:

Return type: Dict[str, Any]

DeleteOverlapEntry¶

class forte.processors.misc.delete_overlap_entries.DeleteOverlapEntry[source]¶

A processor to delete overlapping annotations in a data pack. When overlapping, the first annotation (based on the iteration order) will be kept and the rest of them will be removed.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]¶

The entry_type config determines which type of annotation to be checked for duplication. This value should be the name of a class that is sub-class for Annotation. Otherwise a ValueError will be raised.

Returns: None.

AttributeMasker¶

class forte.processors.misc.attribute_masking_processor.AttributeMasker[source]¶

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]¶

Default config for this processor.

Example usage is shown below

{
    "requests": {
        "ft.onto.base_ontology.Token": ["pos"]
    }
}

Here:

“requests”: dict The entry types and fields required. The keys of the requests dict are the entry types whose fields need to be masked and the value is a list of field names.

Return type: Dict[str, Any]

AnnotationRemover¶

class forte.processors.misc.annotation_remover.AnnotationRemover[source]¶

classmethod default_configs()[source]¶

Returns a dict of configurations of the processor with default values. Used to replace the missing values of input configs during pipeline construction.

Return type: Dict[str, Any]