Processors¶
Base Processors¶
BaseProcessor¶
-
class
forte.processors.base.base_processor.
BaseProcessor
[source]¶ Base class inherited by all kinds of processors such as trainer, predictor and evaluator.
-
record
(record_meta)[source]¶ Method to add output record of the current processor to
forte.data.data_pack.Meta.record
. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
-
expected_types_and_attributes
()[source]¶ Method to add expected types and attributes for the input of the current processor which would be checked before running the processor if if the pipeline is initialized with enforce_consistency=True.
-
check_record
(input_pack)[source]¶ Method to check type consistency if the pipeline is initialized with enforce_consistency=True. If any expected type or its attribute does not exist in the datapack record of the previous pipeline component, an error of
ExpectedRecordNotFound
will be raised.- Parameters
input_pack (~PackType) – The input datapack.
-
write_record
(input_pack)[source]¶ Method to write records of the output type of the current processor to the datapack. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
- Parameters
input_pack (~PackType) – The input datapack.
-
BasePackProcessor¶
-
class
forte.processors.base.pack_processor.
BasePackProcessor
[source]¶ The base class of processors that process one pack in a streaming way. If you are looking for batching (that might happen across packs, refer to
BaseBatchProcessor
.
BaseBatchProcessor¶
-
class
forte.processors.base.batch_processor.
BaseBatchProcessor
[source]¶ The base class of processors that process data in batch. This processor enables easy data batching via analyze the context and data objects. The context defines the scope of analysis of a particular task.
For example, in dependency parsing, the context is normally a sentence, in entity coreference, the context is normally a document. The processor will create data batches relative to the context.
Key fields in this processor:
batcher: The processing batcher used for this processor. The batcher will also keep track of the relation between the pack and the batch data.
use_coverage_index: If true, the index will be built based on the requests.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
flush
()[source]¶ Indicate that there will be no more packs to be passed in, handle what’s remaining in the buffer.
-
abstract classmethod
define_batcher
()[source]¶ Define a specific batcher for this processor. Single pack
BaseBatchProcessor
initialize the batcher to be aProcessingBatcher
. AndMultiPackBatchProcessor
initialize the batcher to be aMultiPackBatchProcessor
.- Return type
PackingBatchProcessor¶
-
class
forte.processors.base.batch_processor.
PackingBatchProcessor
[source]¶ This class extends the BaseBatchProcessor class and provide additional utilities to align and pack the extracted results back to the data pack.
To implement this processor, one need to implement: 1. The predict function that make predictions for each input data batch. 2. The pack function that add the prediction value back to the data pack.
Users that implement the processor only have to concern about a single batch, the alignment between the data batch and the data pack will be maintained by the system.
-
predict
(data_batch)[source]¶ The function that task processors should implement. Make predictions for the input
data_batch
.
-
pack
(pack, predict_results, context=None)[source]¶ The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.
- Parameters
pack (~PackType) – The pack to add entries or fields to.
predict_results (
Dict
[str
,Any
]) – The prediction results returned bypredict()
. This processor will add these results to the provided pack as entry and attributes.context (
Optional
[Annotation
]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.
-
pack_all
(packs, contexts, output_dict)[source]¶ Pack the prediction results contained in the output_dict back to the corresponding packs.
- Parameters
packs (
List
[~PackType]) – The list of data packs corresponding to the output batches.contexts (
List
[Optional
[Annotation
]]) – The list of contexts corresponding to the output batches.output_dict (
Dict
[str
,List
[Any
]]) – Stores the output in a specific format. The keys are string names that specify data. The value is a list of data in the shape of (batch_size, Any). There might be additional structures inside Any as specific implementation choices.
-
MultiPackBatchProcessor¶
RequestPackingProcessor¶
-
class
forte.processors.base.batch_processor.
RequestPackingProcessor
[source]¶ A processor that implements the packing batch processor, using a variation of the fixed size batcher
FixedSizeRequestDataPackBatcher
, which will use DataPack.get_data function with the`context_type` and requests parameters.-
classmethod
define_batcher
()[source]¶ Define a specific batcher for this processor. Single pack
BaseBatchProcessor
initialize the batcher to be aProcessingBatcher
. AndMultiPackBatchProcessor
initialize the batcher to be aMultiPackBatchProcessor
.- Return type
-
classmethod
FixedSizeBatchProcessor¶
-
class
forte.processors.base.batch_processor.
FixedSizeBatchProcessor
[source]¶ A processor that implements the packing batch processor, using a fixed size batcher
FixedSizeDataPackBatcher
Predictor¶
-
class
forte.processors.base.batch_processor.
Predictor
[source] Predictor is a special type of batch processor that uses
BaseExtractor
to collect features from data packs, and also uses Extractors to write the prediction back.Predictor implements the PackingBatchProcessor class, and implements the predict and pack function using the extractors.
-
add_extractor
(name, extractor, is_input, converter=None)[source] Extractors can be added to the preprocessor directly via this method.
- Parameters
name (
str
) – The name/identifier of this extractor, the name should be different between different extractors.extractor (
BaseExtractor
) – The extractor instance to be added.is_input (
bool
) – Whether this extractor will be used as input or output.converter (
Optional
[Converter
]) – The converter instance to be applied after running the extractor.
- Returns
None
-
classmethod
define_batcher
()[source] Define a specific batcher for this processor. Single pack
BaseBatchProcessor
initialize the batcher to be aProcessingBatcher
. AndMultiPackBatchProcessor
initialize the batcher to be aMultiPackBatchProcessor
.- Return type
-
classmethod
default_configs
()[source] Defines the default configs for batching processor.
-
initialize
(resources, configs)[source] The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
pack
(pack, predict_results, context=None)[source] The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.
- Parameters
pack (~PackType) – The pack to add entries or fields to.
predict_results (
Dict
) – The prediction results returned bypredict()
. This processor will add these results to the provided pack as entry and attributes.context (
Optional
[Annotation
]) – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.
-
Task Processors¶
ElizaProcessor¶
-
class
forte.processors.nlp.eliza_processor.
ElizaProcessor
[source]¶ The ElizaProcessor adapted from https://github.com/wadetb/eliza. This processor will generate response based on the Eliza rules. For more information, please refer to https://dl.acm.org/doi/10.1145/365153.365168.
This processor is not parallel ready because it will keep an internal state memory.
SubwordTokenizer¶
-
class
forte.processors.nlp.subword_tokenizer.
SubwordTokenizer
[source]¶ Subword Tokenizer using pretrained Bert model.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
record
(record_meta)[source]¶ Method to add output type record of current processor to
forte.data.data_pack.Meta.record
.
-
expected_types_and_attributes
()[source]¶ Method to add expected type for current processor input which would be checked before running the processor if the pipeline is initialized with enforce_consistency=True or
enforce_consistency()
was enabled for the pipeline.
-
classmethod
default_configs
()[source]¶ Returns the configuration with default values.
Here:
tokenizer_configs contains all default hyper-parameters in
BERTTokenizer
, this processor will pass on all the configurations to the tokenizer to create the tokenizer instance.segment_unit contains an Annotation entry type used to split the text into smaller units. For example, setting this to ft.onto.base_ontology.Sentence will make this tokenizer do tokenization on a sentence base, which could be more efficient when the alignment is used.
token_source contains entry name of where the tokens come from. For example, setting this to ft.onto.base_ontology.Token will make this tokenizer split the sub-word based on this token. The default value will use ft.onto.base_ontology.Token. If this value is set to None, then it will use word_tokenization function of this class to do tokenization.
Note that if segment_unit or token_source is provided, the
check_record()
will check if certain types are written before this processor.Returns: Default configuration value for the tokenizer.
-
CoNLLNERPredictor¶
-
class
forte.processors.nlp.ner_predictor.
CoNLLNERPredictor
[source]¶ An Named Entity Recognizer trained according to Ma, Xuezhe, and Eduard Hovy. “End-to-end sequence labeling via bi-directional lstm-cnns-crf.”.
Note that to use
CoNLLNERPredictor
, theontology
ofPipeline
must be an ontology that includeft.onto.base_ontology.Token
andft.onto.base_ontology.Sentence
.-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
predict
(data_batch)[source]¶ The function that task processors should implement. Make predictions for the input
data_batch
.
-
pack
(pack, predict_results, _=None)[source]¶ Write the prediction results back to datapack. by writing the predicted ner to the original tokens.
-
get_batch_tensor
(data, device=None)[source]¶ Get the tensors to be fed into the model.
- Parameters
- Return type
Tuple
[Tensor
,Tensor
,Tensor
,Tensor
]- Returns
A tuple where
words
: A tensor of shape [batch_size, batch_length] representing the word ids in the batchchars
: A tensor of shape [batch_size, batch_length, char_length] representing the char ids for each word in the batchmasks
: A tensor of shape [batch_size, batch_length] representing the indices to be masked in the batch. 1 indicates no masking.lengths
: A tensor of shape [batch_size] representing the length of each sentences in the batch
-
SRLPredictor¶
-
class
forte.processors.nlp.srl_predictor.
SRLPredictor
[source]¶ An Semantic Role labeler trained according to He, Luheng, et al. “Jointly predicting predicates and arguments in neural semantic role labeling.”.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
predict
(data_batch)[source]¶ The function that task processors should implement. Make predictions for the input
data_batch
.
-
pack
(pack, predict_results, _=None)[source]¶ The function that task processors should implement. It is the custom function on how to add the predicted output back to the data pack.
- Parameters
pack (
DataPack
) – The pack to add entries or fields to.predict_results (
Dict
[str
,List
[List
[Tuple
[Span
,List
[Tuple
[Span
,str
]]]]]]) – The prediction results returned bypredict()
. This processor will add these results to the provided pack as entry and attributes.context – The context entry that the prediction is performed, and the pack operation should be performed related to this range annotation. If None, then we consider the whole data pack is used as the context.
-
VocabularyProcessor¶
Alphabet¶
PeriodSentenceSplitter¶
WhiteSpaceTokenizer¶
RemoteProcessor¶
-
class
forte.processors.misc.remote_processor.
RemoteProcessor
[source]¶ RemoteProcessor wraps up the interactions with remote Forte end point. Each input DataPack from the upstream component will be serialized and packed into a POST request to be sent to a remote service, which should return a response that can be parsed into a DataPack to update the input. Example usage:
# Assume that a Forte service is running on "localhost:8080". Pipeline() \ .set_reader(plaintext_reader(), {"input_path":"some/path"}) \ .add(RemoteProcessor(), {"url": "http://localhost:8008"})
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
record
(record_meta)[source]¶ Method to add output type record of RemoteProcessor. The records are queried from the remote service. The types and attributes are populated from all the components in remote pipeline.
-
expected_types_and_attributes
()[source]¶ Method to add expected types and attributes for the input of RemoteProcessor. This should be the expected_types_and_attributes of the first processor in remote pipeline.
-
set_test_mode
(app)[source]¶ Configure the processor into test mode. This should only be called from a pytest program.
- Parameters
app – A fastapi app from a Forte pipeline.
-
classmethod
default_configs
()[source]¶ This defines a basic config structure for RemoteProcessor. Following are the keys for this dictionary:
url
: URL of the remote service end point. Default value is “http://localhost:8008”.validation
: Information for validation.do_init_type_check
: Validate the pipeline by checking the info of the remote pipeline with the expected attributes. Default to False.input_format
: The expected input format of the remote service. Default to “string”.expected_name
: The expected pipeline name. Default to ‘’.
- Returns
A dictionary with the default config for this processor.
- Return type
-
LowerCaserProcessor¶
-
class
forte.processors.misc.lowercaser_processor.
LowerCaserProcessor
[source]¶
DeleteOverlapEntry¶
-
class
forte.processors.misc.delete_overlap_entries.
DeleteOverlapEntry
[source]¶ A processor to delete overlapping annotations in a data pack. When overlapping, the first annotation (based on the iteration order) will be kept and the rest of them will be removed.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
classmethod
default_configs
()[source]¶ The
entry_type
config determines which type of annotation to be checked for duplication. This value should be the name of a class that is sub-class forAnnotation
. Otherwise a ValueError will be raised.- Returns
None.
-
AttributeMasker¶
-
class
forte.processors.misc.attribute_masking_processor.
AttributeMasker
[source]¶ -
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
classmethod
default_configs
()[source]¶ Default config for this processor.
Example usage is shown below
{ "requests": { "ft.onto.base_ontology.Token": ["pos"] } }
Here:
“requests”: dict The entry types and fields required. The keys of the requests dict are the entry types whose fields need to be masked and the value is a list of field names.
-