Data Augmentation

Data Augmentation Processors

BaseDataAugmentProcessor

class forte.processors.data_augment.base_data_augment_processor.BaseDataAugmentProcessor[source]

The base class of processors that augment data. This processor instantiates replacement ops where specific data augmentation algorithms are implemented. The replacement ops will run the algorithms and the processor will create Forte data structures based on the augmented inputs.

DataAugProcessor

class forte.processors.data_augment.data_aug_processor.DataAugProcessor[source]

This is a Base Data Augmentation Op Processor that instantiates data augmentation ops into Forte Data Structures to be used. It can handle augmentations of multiple ontology types simultaneously and copy other existing Forte entries based on policies specified in other_entry_policy configuration from source data pack to augmented data pack.

ReplacementDataAugmentProcessor

class forte.processors.data_augment.base_data_augment_processor.ReplacementDataAugmentProcessor[source]

Most of the Data Augmentation(DA) methods can be considered as replacement-based methods with different levels: character, word, sentence or document.

BaseElasticSearchDataSelector

class forte.processors.base.data_selector_for_da.BaseElasticSearchDataSelector(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The base elastic search indexer for data selector. This class creates an ElasticSearchIndexer and searches for documents according to the user-provided search keys. Currently supported search criteria: random-based and query-based. It then yields the corresponding datapacks of the selected documents.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are json, jsonpickle and pickle. Default is json.

Return type

Dict[str, Any]

RandomDataSelector

class forte.processors.base.data_selector_for_da.RandomDataSelector(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]
classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are json, jsonpickle and pickle. Default is json.

Return type

Dict[str, Any]

QueryDataSelector

class forte.processors.base.data_selector_for_da.QueryDataSelector(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]
classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are json, jsonpickle and pickle. Default is json.

Return type

Dict[str, Any]

UDAIterator

class forte.processors.data_augment.algorithms.UDA.UDAIterator(sup_iterator, unsup_iterator, softmax_temperature=1.0, confidence_threshold=- 1, reduction='mean')[source]

This iterator wraps the Unsupervised Data Augmentation(UDA) algorithm by calculating the unsupervised loss automatically during each iteration. It takes both supervised and unsupervised data iterator as input.

The unsupervised data should contain the original input and the augmented input. The original and augmented inputs should be in the same training example.

During each iteration, the iterator will return the supervised and unsupervised batches. Users can call the calculate_uda_loss() to get the UDA loss and combine it with the supervised loss for model training.

It uses tricks such as prediction sharpening and confidence masking. Please refer to the UDA paper for more details. (https://arxiv.org/abs/1904.12848)

Parameters
  • sup_iterator (DataIterator) – The iterator for supervised data. Each item is a training/evaluation/test example with key-value pairs as inputs.

  • unsup_iterator (DataIterator) – The iterator for unsupervised data. Each training example in it should contain both the original and augmented data.

  • softmax_temperature (float) – The softmax temperature for sharpening the distribution. The value should be larger than 0. Defaults to 1.

  • confidence_threshold (float) – The threshold for confidence-masking. It is a threshold of the probability in [0, 1], rather than of the logit. If set to -1, the threshold will be ignored. Defaults to -1.

  • reduction (str) –

    Default: ‘mean’. This is the same as the reduction argument in texar.torch.losses.info_loss.kl_divg_loss_with_logits(). The loss will be a scalar tensor if the reduction is not 'none'. Specifies the reduction to apply to the output:

    • 'none': no reduction will be applied.

    • 'batchmean': the sum of the output will be divided by the batch size.

    • 'sum': the output will be summed.

    • 'mean': the output will be divided by the number of elements in the output.

calculate_uda_loss(logits_orig, logits_aug)[source]

This function calculate the KL divergence between the output probabilities of original input and augmented input. The two inputs should have the same shape, and the last dimension of them should be the probability distribution.

Parameters
  • logits_orig (Tensor) – A tensor contains the logits of the original data.

  • logits_aug (Tensor) – A tensor contains the logits of the augmented data. Must have the same shape as logits_orig.

Return type

Tensor

Returns

The loss, as a pytorch scalar float tensor if the reduction is not 'none', otherwise a tensor with the same shape as the logits_orig.

Data Augmentation Ops

TextReplacementOp

class forte.processors.data_augment.algorithms.text_replacement_op.TextReplacementOp(configs)[source]

The base class holds the data augmentation algorithm. We leave the replace() method to be implemented by subclasses.

abstract replace(input_anno)[source]

Most data augmentation algorithms can be considered as replacement-based methods on different levels. This function takes in an annotation as input and returns the augmented string.

Parameters

input_anno (Annotation) – the input annotation to be replaced.

Return type

Tuple[bool, str]

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

SingleAnnotationAugmentOp

class forte.processors.data_augment.algorithms.single_annotation_op.SingleAnnotationAugmentOp(configs)[source]

This class extends the BaseDataAugmentationOp to only allow augmentation of one annotation at a time. This operation should be used when we only want to augment one type of annotation in the whole data pack. Thus, to use this operation, the developer only needs to specify how a single annotation will be processed as a part of their augmentation method. We leave the single_annotation_augment() method to be implemented by the subclass. This function will specify what type of augmentation will a given annotation (of a predefined type) undergo.

augment(data_pack)[source]

This method is not to be modified when using the SingleAnnotationAugmentOp. This function takes in the augmentation logic specified by single_annotation_augment() method to apply it to each annotation of the specified type individually.

Parameters

input_anno – the input annotation to be replaced.

Return type

bool

Returns

A boolean value indicating if the augmentation was successful (True) or unsuccessful (False).

abstract single_annotation_augment(input_anno)[source]

This function takes in one annotation at a time and performs the desired augmentation on it. Through this function, one annotation is processed at a time. The developer needs to specify the logic that will be adopted to process one annotation of a given type. This method cannot suggest an augmentation logic which take in multiple annotations of the same type.

Parameters

input_anno (Annotation) – The annotation that needs to be augmented.

Return type

Tuple[bool, str]

Returns

A tuple, where the first element is a boolean value indicating whether the augmentation happens, and the second element is the replaced string.

classmethod default_configs()[source]
Return type

Dict[str, Any]

Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
  • augment_entry:

    Defines the entry the processor will augment. It should be a full qualified name of the entry class. Default value is “ft.onto.base_ontology.Token”.

DistributionReplacementOp

class forte.processors.data_augment.algorithms.DistributionReplacementOp(configs)[source]

This class is a replacement op to replace the input word with a new word that is sampled by a sampler from a distribution.

single_annotation_augment(input_anno)[source]

This function replaces a word by sampling from a distribution.

Parameters

input_anno (Annotation) – the input annotation.

Return type

Tuple[bool, str]

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced word.

cofigure_sampler()[source]

This function sets the sampler that will be used by the distribution replacement op. The sampler will be set according to the configuration values

Return type

None

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:

  • prob:

    The probability of whether to replace the input, it should fall in [0, 1]. Default value is 0.1

  • sampler_data:

    A dictionary representing the configurations required to create the required sampler.

    • type:

      The type of sampler to be used (pass the path of the class which defines the required sampler)

    • kwargs:

      This dictionary contains the data that is to be fed to the required sampler. 2 possible values are sampler_data and data_path.If both parameters are passed, the data read from the file pointed to by data_path will be considered.

      • sampler_data:

        Raw input to the sampler, This will be passed as the sampler_data config to the required sampler.

      • data_path:

        The path to the file that contains the the input that will be given to the sampler. For example, when using UniformSampler, data_path will point to a file (or URl) containing a list of values to be used as sampler_data in UniformSampler.

    {
        "type": "forte.processors.data_augment.algorithms.sampler.UniformSampler",
        "kwargs":{
            "sample": ["apple", "banana", "orange"]
        }
    }
    

Sampler

class forte.processors.data_augment.algorithms.sampler.Sampler(configs)[source]

An abstract sampler class.

UniformSampler

class forte.processors.data_augment.algorithms.sampler.UniformSampler(configs)[source]

A sampler that samples a word from a uniform distribution.

Config Values:

  • sampler_data: a list of words that this sampler uniformly samples from.

UnigramSampler

class forte.processors.data_augment.algorithms.sampler.UnigramSampler(configs)[source]

A sampler that samples a word from a unigram distribution.

Config Values:

  • sampler_data: (dict) The key is a word, the value is the word count or a probability. This sampler samples from this word distribution. Example:

    'sampler_data': {
            "apple": 1,
            "banana": 2,
            "orange": 3
    }
    

MachineTranslator

class forte.processors.data_augment.algorithms.machine_translator.MachineTranslator(src_lang, tgt_lang, device)[source]

This class is a wrapper for machine translation models.

Parameters
  • src_lang (str) – The source language.

  • tgt_lang (str) – The target language.

  • device (str) – “cuda” for gpu, “cpu” otherwise.

abstract translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Return type

str

Returns

The output text in target language.

MarianMachineTranslator

class forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator(src_lang='en', tgt_lang='fr', device='cpu')[source]

This class is a wrapper for the Marian Machine Translator (https://huggingface.co/transformers/model_doc/marian.html). Please refer to their doc for supported languages.

translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Return type

str

Returns

The output text in target language.

BackTranslationOp

class forte.processors.data_augment.algorithms.back_translation_op.BackTranslationOp(configs)[source]

This class is a replacement op using back translation to generate data with the same semantic meanings. The input is translated to another language, then translated back to the original language, with pretrained machine-translation models.

It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

single_annotation_augment(input_anno)[source]

This function replaces a piece of text with back translation.

Parameters

input_anno (Annotation) – An annotation, could be a word, sentence or document.

Return type

Tuple[bool, str]

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

classmethod default_configs()[source]
Return type

Dict[str, Any]

Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:

  • augment_entry (str):

    This indicates the entity that needs to be augmented. By default, this value is set to ft.onto.base_ontology.Sentence.

  • prob (float):

    The probability of replacement, should fall in [0, 1]. The Default value is 0.5

  • src_language (str):

    The source language of back translation.

  • tgt_language (str):

    The target language of back translation.

  • model_to (str):

    The full qualified name of the model from source language to target language.

  • model_back (str):

    The full qualified name of the model from target language to source language.

  • device (str):

    “cpu” for the CPU or “cuda” for GPU. The Default value is cpu.

DictionaryReplacementOp

class forte.processors.data_augment.algorithms.dictionary_replacement_op.DictionaryReplacementOp(configs)[source]

This class is a replacement op utilizing the dictionaries, such as WORDNET, to replace the input word with an synonym. Part-of-Speech(optional) can be provided to the wordnet for retrieving synonyms with the same POS. It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

single_annotation_augment(input_anno)[source]

This function replaces a word with synonyms from a WORDNET dictionary.

Parameters

input_anno (Token) – The input word.

Return type

Tuple[bool, str]

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

classmethod default_configs()[source]
Return type

Dict[str, Any]

Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
  • dictionary (dict):

    The full qualified name of the dictionary class.

  • prob (float):

    The probability of replacement, should fall in [0, 1]. Default value is 0.1

  • lang (str):

    The language of the text.

Dictionary

class forte.processors.data_augment.algorithms.dictionary.Dictionary[source]

This class defines a dictionary for word replacement. Given an input word and its pos_tag(optional), the dictionary will outputs its synonyms, antonyms, hypernyms and hypernyms.

get_synonyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Return type

List[str]

Returns

synonyms of the word.

get_antonyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Return type

List[str]

Returns

Antonyms of the word.

get_hypernyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Return type

List[str]

Returns

Hypernyms of the word.

get_hyponyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Return type

List[str]

Returns

Hyponyms of the word.

WordnetDictionary

class forte.processors.data_augment.algorithms.dictionary.WordnetDictionary[source]

This class wraps the nltk WORDNET to replace the input word with an synonym/antonym/hypernym/hyponym. Part-of-Speech(optional) can be provided to the wordnet for retrieving words with the same POS.

get_lemmas(word, pos_tag='', lang='eng', lemma_type='SYNONYM')[source]

This function gets synonyms/antonyms/hypernyms/hyponyms from a WORDNET dictionary.

Parameters
  • word (str) – The input token.

  • pos_tag (str) – The NLTK POS tag.

  • lang (str) – The input language.

  • lemma_type (str) –

    The type of words to replace, must be one of the following:

    • 'SYNONYM'

    • 'ANTONYM'

    • 'HYPERNYM'

    • 'HYPONYM'

get_synonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with synonyms from a WORDNET dictionary.

Return type

List[str]

get_antonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with antonyms from a WORDNET dictionary.

Return type

List[str]

get_hypernyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hypernyms from a WORDNET dictionary.

Return type

List[str]

get_hyponyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hyponyms from a WORDNET dictionary.

Return type

List[str]

TypoReplacementOp

class forte.processors.data_augment.algorithms.typo_replacement_op.TypoReplacementOp(configs)[source]

This class is a replacement op using a pre-defined spelling mistake dictionary to simulate spelling mistake.

single_annotation_augment(input_anno)[source]

This function replaces a word from a typo dictionary.

Parameters

input_anno (Annotation) – The input annotation.

Return type

Tuple[bool, str]

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
  • prob (float): The probability of replacement, should fall in [0, 1]. Default value is 0.1

  • dict_path (str): the url or the path to the pre-defined typo json file. The key is a word we want to replace. The value is a list containing various typos of the corresponding key.

  • typo_generator (str): A generator that takes in a word and outputs the replacement typo.

CharacterFlipOp

class forte.processors.data_augment.algorithms.character_flip_op.CharacterFlipOp(configs)[source]

A uniform generator that randomly flips a character with a similar looking character from a predefined dictionary imported from “https://github.com/facebookresearch/AugLy/blob/main/” + “augly/text/augmenters/utils.py”. (For example: the cat drank milk -> t/-/3 c@t d12@nk m!|_1<).

Parameters
  • string – input string whose characters need to be replaced,

  • dict_path – the url or the path to the pre-defined typo json file,

  • configs (Union[HParams, Dict[str, Any]]) – prob(float): The probability of replacement, should fall in [0, 1].

single_annotation_augment(input_anno)[source]

Takes in the annotated string and performs the character flip operation on it that randomly augments few characters from it based on the probability value in the configs.

Parameters

input_anno (Annotation) – the input annotation.

Return type

Tuple[bool, str]

Returns

A tuple with the first element being a boolean value indicating whether the replacement happens, and the second element is the final augmented string.

classmethod default_configs()[source]
Return type

Dict[str, Any]

Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
  • dict_path (str):

    the url or the path to the pre-defined typo json file. One example dictionary is provided at https://raw.githubusercontent.com/ArnavParekhji /temporaryJson/main/character_flip.json as a default value.

  • prob (float):

    The probability of replacement. This value should fall in [0, 1]. Default value is 0.1

WordSplittingOp

class forte.processors.data_augment.algorithms.word_splitting_op.RandomWordSplitDataAugmentOp(configs)[source]

This class creates an operation to perform Random Word Splitting. It randomly chooses n words in a sentence and splits each word at a random position where n = alpha * input length. alpha indicates the percent of the words in a sentence that are changed.

augment(data_pack)[source]

This function splits a given word at a random position and replaces the original word with 2 split parts of it.

Return type

bool

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor.

Additional keys for determining how many words will be split:
  • alpha (float):

    0 <= alpha <= 1. indicates the percent of the words in a sentence that are changed.

  • augment_entry (str):

    Defines the entry the processor will augment. It should be a full qualified name of the entry class. For example, “ft.onto.base_ontology.Sentence”.

BaseDataAugmentationOp

class forte.processors.data_augment.algorithms.base_data_augmentation_op.BaseDataAugmentationOp(configs)[source]

The SkeletonOp is the most basic augmentation Op that gives users the most amount of freedom to implement their logic of augmentation. The users are expected to use the provided utility functions to implement their own augmentation logic which will then be substantiated into new data packs. This Op requires the users to have a relatively stronger understanding of Forte’s internal setup.

perform_augmentation(input_pack)[source]

Function to apply the defined augmentation function and instantiating it into a new data pack. This data pack is then returned.

Parameters

data_pack – The Datapack holding the replaced annotations.

Return type

DataPack

Returns

A new data pack holds the text after replacement.

modify_index(index, old_spans, new_spans, is_begin, is_inclusive)[source]

A helper function to map an index before replacement to the index after replacement. An index is the character offset in the data pack. The old_spans are the inputs of replacement, and the new_spans are the outputs. Each of the span has start and end index. The old_spans and new_spans are anchors for the mapping, because we depend on them to determine the position change of the index. Given an index, the function will find its the nearest among the old spans before the index, and calculate the difference between the position of the old span and its corresponding new span. The position change is then applied to the input index. An updated index is then calculated and returned. An inserted span might be included as a part of another span. For example, given a sentence “I love NLP.”, if we insert a token “Yeah” at the beginning of the sentence(index=0), the Sentence should include the new Token, i.e., the Sentence will have a start index equals to 0. In this case, the parameter is_inclusive should be True. However, for another Token “I”, it should not include the new token, so its start index will be larger than 0. The parameter in_inclusive should be False. The input index could be the start or end index of a span, i.e., the left or right boundary of the span. If there is an insertion in the span, we should treat the two boundaries in different ways. For example, we have a paragraph with two sentences “I love NLP! You love NLP too.” If we append another “!” to the end of the first sentence, when modifying the end index of the first Sentence, it should be pushed right to include the extra exclamation. In this case, the is_begin is False. However, if we prepend an “And” to the second sentence, when modifying the start index of the second Sentence, it should be pushed left to include the new Token. In this case, the is_begin is True.

Parameters
  • index (int) – The index to map.

  • old_spans (List[Span]) – The spans before replacement. It should be a sorted list in ascending order.

  • new_spans (List[Span]) – The spans after replacement. It should be a sorted list in ascending order.

  • is_begin (bool) – True if the input index is the start index of a span.

  • is_inclusive (bool) – True if the span constructed by the aligned index should include inserted spans.

Return type

int

Returns

The aligned index.

If the old spans are [0, 1], [2, 3], [4, 6], the new spans are [0, 4], [5, 7], [8, 11], the input index is 3, and there are no insertions, the algorithm will first locate the last span with a begin index less or equal than the target index, ([2,3]), and find the corresponding span in new spans([5,7]). Then we calculate the delta index(7-3=4) and update our input index(3+4=7). The output then is 7.

Note that when the input index locates inside the old spans, instead of on the boundary of the spans, we compute the return index so that it maintains the same offset to the begin of the span it belongs to. In the above example, if we change the input index from 3 to 5, the output will become 9, because we locates the input index in the third span [4, 6] and use the same offset 5-4=1 to calculate the output 8+1=9.

When insertion is considered, there will be spans with the same begin index, for example, [0, 1], [1, 1], [1, 2]. The span [1, 1] indicates an insertion at index 1, because the insertion can be considered as a replacement of an empty input span, with a length of 0. The output will be affected by whether to include the inserted span(is_inclusive), and whether the input index is a begin or end index of its span(is_begin).

If the old spans are [0, 1], [1, 1], [1, 2], the new spans are [0, 2], [2, 4], [4, 5], the input index is 1, the output will be 2 if both is_inclusive and is_begin are True, because the inserted [1, 1] should be included in the span. If the is_inclusive=True but is_begin=False, the output will be 4 because the index is an end index of the span.

insert_span(inserted_text, data_pack, pos)[source]

This is a utility function to insert a new text span to a data pack. The inserted span will not have any annotation associated with it. After getting the inserted text, it will register the input & output for later batch process of building the new data pack. The insertion at each position can only occur once. If there is already an insertion at current position, it will abort the insertion and return False.

Parameters
  • inserted_text (str) – The text string to insert.

  • data_pack (DataPack) – The datapack for insertion.

  • pos (int) – The position(index) of insertion.

Return type

bool

Returns

A bool value. True if the insertion happened, False otherwise.

insert_annotated_span(inserted_text, data_pack, pos, annotation_type)[source]

This is a utility function to insert a new annotation to a data pack. After getting the inserted text, it will register the input & output for later batch process of building the new data pack. The insertion at each position can only occur once. If there is already an insertion at current position, it will abort the insertion and return False.

Parameters
  • inserted_text (str) – The text string to insert.

  • data_pack (DataPack) – The datapack for insertion.

  • pos (int) – The position(index) of insertion.

  • annotation_type (str) – The type of annotation this span represents.

Return type

bool

Returns

A bool value. True if the insertion happened, False otherwise.

delete_annotation(input_anno)[source]

This is a utility function to delete an annotation. If the same annotation is tried to be deleted twice, the function will terminate and return False.

Parameters

input_anno (Annotation) – The annotation to remove.

Return type

bool

Returns

A bool value. True if the deletion happened, False otherwise.

delete_span(data_pack, begin, end)[source]

This is a utility function to delete a span of text. If the same annotation is tried to be deleted twice, the function will terminate and return False. If the method deletes only a portion of an existing annotation, The annotation will be calibrated to represent the remaining part of the span. Moreover, if the deleted span covers an entire annotation, the entire annotation will be deleted.

Parameters
  • input_anno – The annotation to remove.

  • begin (int) – The starting position of the span to delete

  • end (int) – The ending position of the span to delete

Return type

bool

Returns

A bool value. True if the deletion happened, False otherwise.

replace_annotations(replacement_anno, replaced_text)[source]

This is a utility function to record specifically a replacement of the text in an annotation. With this function, the text inside annotation can be replaced with another text. If the same annotation is tried to be replaced twice, the function will terminate and return False.

Parameters

input_anno – The annotation to replace.

Return type

bool

Returns

A bool value. True if the replacement happened, False otherwise.

clear_states()[source]

This function clears the states. It should be called after processing a multipack.

get_maps()[source]

This function simply returns the produced data pack and entry maps after augmentation.

Return type

Tuple[Dict[int, int], Dict[int, Dict[int, int]]]

Returns

A tuple of two elements. The first element is the data pack map (dict) and the second element is the entry maps (dict)

abstract augment(data_pack)[source]

This method is left to be implemented by the user of this Op. The user can use any of the given utility functions to perform augmentation.

Parameters

data_pack (DataPack) – the input data pack to augment

Return type

bool

Returns

A boolean value indicating if the augmentation was successful (True) or unsuccessful (False).

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
  • other_entry_policy:

    A dict specifying the policies for other entries. The key should be a full qualified class name. The policy(value of the dict) specifies how to process the corresponding entries after replacement.

    If the policy is “auto_align”, the span of the entry will be automatically modified according to its original location. However, some spans might become invalid after the augmentation, for example, the tokens within a replaced sentence may disappear.

    Annotations not in the “other_entry_policy” will not be copied to the new data pack. The Links and Groups will be copied as well if the annotations they are attached to are copied. Example:

    'other_entry_policy': {
        "ft.onto.base_ontology.Document": "auto_align",
        "ft.onto.base_ontology.Sentence": "auto_align",
    }
    

EmbeddingSimilarityReplacementOp

class forte.processors.data_augment.algorithms.embedding_similarity_replacement_op.EmbeddingSimilarityReplacementOp(configs)[source]

This class is a replacement op leveraging pre-trained word embeddings, such as word2vec and glove, to replace the input word with another word with similar word embedding. By default, the replacement word is randomly chosen from the top k words with the most similar embeddings.

single_annotation_augment(input_anno)[source]

This function replaces a word words with similar pretrained embeddings.

Parameters

input_anno (Annotation) – The input annotation.

Return type

Tuple[bool, str]

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced word.

classmethod default_configs()[source]
Return type

Dict[str, Any]

Returns

A dictionary with the default config for this processor.

Following are the keys for this dictionary:
“vocab_path”: str

The absolute path to the vocabulary file for the pretrained embeddings

“embed_hparams”: dict

The hyper-parameters to initialize the texar.torch.data.Embedding object.

“top_k”: int

the number of k most similar words to choose from

UniformTypoGenerator

class forte.processors.data_augment.algorithms.typo_replacement_op.UniformTypoGenerator(dict_path)[source]

A uniform generator that generates a typo from a typo dictionary.

Parameters
  • word – input word that needs to be replaced,

  • dict_path (str) –

    the url or the path to the pre-defined typo json file. The key is a word we want to replace. The value is a list containing various typos of the corresponding key.

    {
        "apparent": ["aparent", "apparant"],
        "bankruptcy": ["bankrupcy", "banruptcy"],
        "barbecue": ["barbeque"]
    }
    

RandomSwapDataAugmentOp

class forte.processors.data_augment.algorithms.eda_ops.RandomSwapDataAugmentOp(configs)[source]

Data augmentation operation for the Random Swap operation. Randomly choose two words in the sentence and swap their positions. Do this n times, where n = alpha * input length.

augment(data_pack)[source]

This method is left to be implemented by the user of this Op. The user can use any of the given utility functions to perform augmentation.

Parameters

data_pack (DataPack) – the input data pack to augment

Return type

bool

Returns

A boolean value indicating if the augmentation was successful (True) or unsuccessful (False).

classmethod default_configs()[source]

Additional keys for Random Swap:

  • augment_entry (str):

    Defines the entry the processor will augment. It should be a full qualified name of the entry class. For example, “ft.onto.base_ontology.Sentence”.

  • alpha:

    0 <= alpha <= 1. indicates the percent of the words in a sentence that are changed. The processor will perform the Random Swap operation (input length * alpha) times. Default Value is 0.1.

Returns

A dictionary with the default config for this processor.

RandomInsertionDataAugmentOp

class forte.processors.data_augment.algorithms.eda_ops.RandomInsertionDataAugmentOp(configs)[source]

Data augmentation operation for the Random Insertion operation. Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times, where n = alpha * input length.

augment(data_pack)[source]

This method is left to be implemented by the user of this Op. The user can use any of the given utility functions to perform augmentation.

Parameters

data_pack (DataPack) – the input data pack to augment

Return type

bool

Returns

A boolean value indicating if the augmentation was successful (True) or unsuccessful (False).

classmethod default_configs()[source]

Additional keys for Random Swap:

  • augment_entry (str):

    Defines the entry the processor will augment. It should be a full qualified name of the entry class. For example, “ft.onto.base_ontology.Sentence”.

  • alpha:

    0 <= alpha <= 1. indicates the percent of the words in a sentence that are changed. The processor will perform the Random Insertion operation (input length * alpha) times. Default Value is 0.1

  • stopwords:

    a list of stopword for the language.

  • insertion_op_config:

    A dictionary representing the configurations required operation to take random annotations from the source data pack, augment them based on specified rules and insert them in random positions.

    • type:

      The type of data augmentation operation to be used (pass the path of the class which defines the required operation)

    • kwargs:

      This dictionary contains the data that is to be fed to the required operation (Make sure to be well versed with the required configurations of the operation that is defined in the type config).

    {
        "type": "forte.processors.data_augment.algorithms."
        "dictionary_replacement_op.DictionaryReplacementOp",
        "kwargs":{
            "dictionary_class": (
                "forte.processors.data_augment."
                "algorithms.dictionary.WordnetDictionary"
            ),
            "prob": 1.0,
            "lang": "eng",
        }
    }
    
Returns

A dictionary with the default config for this processor. By default, we use Dictionary Replacement with Wordnet to get synonyms to insert.

RandomDeletionDataAugmentOp

class forte.processors.data_augment.algorithms.eda_ops.RandomDeletionDataAugmentOp(configs)[source]

Data augmentation operation for the Random Insertion operation. Randomly remove each word in the sentence with probability alpha.

augment(data_pack)[source]

This method is left to be implemented by the user of this Op. The user can use any of the given utility functions to perform augmentation.

Parameters

data_pack (DataPack) – the input data pack to augment

Return type

bool

Returns

A boolean value indicating if the augmentation was successful (True) or unsuccessful (False).

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor. Additional keys for Random Deletion:

  • augment_entry (str):

    Defines the entry the processor will augment. It should be a full qualified name of the entry class. For example, “ft.onto.base_ontology.Sentence”. Default Value is 0.1

  • alpha:

    0 <= alpha <= 1. The probability to delete each word.

Data Augmentation Models

Reinforcement Learning

class forte.models.da_rl.aug_wrapper.MetaAugmentationWrapper(augmentation_model, augmentation_optimizer, input_mask_ids, device, num_aug)[source]

A wrapper adding data augmentation to a Bert model with arbitrary tasks. This is used to perform reinforcement learning for joint data augmentation learning and model training.

See: https://arxiv.org/pdf/1910.12795.pdf

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/augmentation/generator.py

Let \(\theta\) be the parameters of the downstream (classifier) model. Let \(\phi\) be the parameters of the augmentation model. Equations to update \(\phi\):

\[ \begin{align}\begin{aligned}\theta'(\phi) = \theta - \nabla_{\theta} L_{train}(\theta, \phi)\\\phi = \phi - \nabla_{\phi} L_{val}(\theta'(\phi))\end{aligned}\end{align} \]
Parameters
  • augmentation_model (Module) – A Bert-based model for data augmentation. E.g. BertForMaskedLM. Model requirement: masked language modeling, the output logits of this model is of shape [batch_size, seq_length, token_size].

  • augmentation_optimizer (Optimizer) – An optimizer that is associated with augmentation_model. E.g. Adam optimizer.

  • input_mask_ids (int) – Bert token id of ‘[MASK]’. This is used to randomly mask out tokens from the input sentence during training.

  • device (device) – The CUDA device to run the model on.

  • num_aug (int) – The number of samples from the augmentation model for every augmented training instance.

Example usage:

aug_wrapper = MetaAugmentationWrapper(
    aug_model, aug_optim, mask_id, device, num_aug)
for batch in training_data:
    # Train augmentation model params.
    aug_wrapper.reset_model()
    for instance in batch:
        # Augmented example with params phi exposed
        aug_instance_features = \
            aug_wrapper.augment_instance(instance_features)
        # Model is the downstream Bert model.
        model.zero_grad()
        loss = model(aug_instance_features)
        meta_model = MetaModule(model)
        meta_model = aug_wrapper.update_meta_model(
            meta_model, loss, model, optim)

        # Compute gradient of the augmentation model on validation
        data for val_batch in validation_data:
            val_loss = meta_model(val_batch_features)
            val_loss = val_loss / num_training_instance / num_aug \
                / num_val_batch
            val_loss.backward()
    # update augmentation model params.
    aug_wrapper.update_phi()

    # train classifier with augmented batch
    aug_batch_features = aug_wrapper.augment_batch(batch_features)
    optim.zero_grad()
    loss = model(aug_batch_features)
    loss.backward()
    optim.step()
augment_instance(features)[source]

Augment a training instance.

Parameters

features (Tuple[Tensor, …]) –

A tuple of Bert features of one training instance. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [seq_len].

input_mask is a tensor of shape [seq_len] with 1 indicating without mask and 0 with mask.

segment_ids is a tensor of shape [seq_len]. label_ids is a tensor of shape [seq_len].

Return type

Tuple[Tensor, …]

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, distributions over vocabulary. It has shape [num_aug, seq_len, token_size]. It keeps \(\phi\) as variable so that after passing it as an input to the classifier, the gradients of \(\theta\) will also apply to \(\phi\).

input_mask_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input input_mask so that it corresponds to the mask of each token in input_probs_aug.

segment_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input segment_ids so that it corresponds to the token type of each token in input_probs_aug.

label_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input label_ids so that it corresponds to the label of each token in input_probs_aug.

augment_batch(batch_features)[source]

Augment a batch of training instances. Append augmented instances to the input instances.

Parameters

batch_features (Tuple[Tensor, …]) –

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Return type

Tuple[Tensor, …]

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, It has shape [batch_size * 2, seq_len, token_size].

input_mask_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input input_mask, the first one corresponds to the mask of the tokens in the original bert instance, the second one corresponds to the mask of the augmented bert instance.

segment_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input segment_ids, the first one corresponds to the segment id of the tokens in the original bert instance, the second one corresponds to the segment id of the augmented bert instance.

label_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input label_ids, the first one corresponds to the labels of the original bert instance, the second one corresponds to the labels of the augmented bert instance.

eval_batch(batch_features)[source]

Evaluate a batch of training instances.

Parameters

batch_features (Tuple[Tensor, …]) –

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Return type

FloatTensor

Returns

The masked language modeling loss of one evaluation batch. It is a torch.FloatTensor of shape [1,].

update_meta_model(meta_model, loss, model, optimizer)[source]

Update the parameters within the MetaModel according to the downstream model loss.

MetaModel is used to calculate \(\nabla_{\phi} L_{val}(\theta'(\phi))\), where it needs gradients applied to \(\phi\).

Perform parameter updates in this function, and later applies gradient change to \(\theta\) and \(\phi\) using validation data.

Parameters
  • meta_model (MetaModule) – A meta model whose parameters will be updated in-place by the deltas calculated from the input loss.

  • loss (Tensor) – The loss of the downstream model that have taken the augmented training instances as input.

  • model (Module) – The downstream Bert model.

  • optimizer (Optimizer) – The optimizer that is associated with the model.

Return type

MetaModule

Returns

The same input meta_model with the updated parameters.

class forte.models.da_rl.magic_model.MetaModule(module)[source]

A class extending torch.nn.ModuleList that registers the parameters of a torch.nn.Module and performs memory-efficient parameter updates locally.

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/magic_module.py

It implements the calculation: \(L(\theta - \nabla_{\theta} L_{train}(\theta, \phi))\).

Parameters

module (Module) – A torch.nn.Module.

This class can be used for simple input module, whose sub-modules don’t contain other helper functions or attributes that do not belong to this class to perform their forward().

Otherwise, since forward() calls the input module’s forward(), in order to perform forward() of the sub-modules of the input module correctly, this class needs to extend those sub-modules that define the methods needed for their forward(), so that it inherits their methods to perform the sub-module’s forward().

For example, if the input module is BERTClassifier, _get_noise_shape(), _split_heads(), _combine_heads() from its sub-modules (E.g. BERTEncoder) are needed to be exposed in this class to perform their forward(). Please refer to TexarBertMetaModule for instructions on creating a subclass from this one for a specific input module.

forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class forte.models.da_rl.magic_model.TexarBertMetaModule(module)[source]

A subclass that extends MetaModule to do parameter updates locally for texar-pytorch Bert related modules. E.g. texar.torch.modules.BERTClassifier

Please refer to its base class MetaModule for more details.

Parameters

module (Module) – A torch.nn.Module.

This class extends EmbedderBase and MultiheadAttentionEncoder, such that it inherits their methods that are needed to perform forward() of the modules that utilizes these methods, E.g. BERTEncoder,

Some notes of the order of the base classes that this class extends:

MetaModule should be the first one, so that its forward() will call MetaModule.forward() instead of the forward() of the other base classes, such as texar.torch.modules.MultiheadAttentionEncoder.forward(). If MetaModule is not the first one, then a forward() should be defined in this class, such that it is called correctly.

Example

def forward(self, *args, **kwargs):
    return MetaModule.forward(self, *args, **kwargs)