# Data Augmentation¶

## Data Augmentation Processors¶

### ReplacementDataAugmentProcessor¶

class forte.processors.data_augment.base_data_augment_processor.ReplacementDataAugmentProcessor[source]

Most of the Data Augmentation(DA) methods can be considered as replacement-based methods with different levels: character, word, sentence or document.

### DataSelector¶

class forte.processors.base.data_selector_for_da.BaseElasticSearchDataSelector(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The base elastic search indexer for data selector. This class creates an ElasticSearchIndexer and searches for documents according to the user-provided search keys. Currently supported search criteria: random-based and query-based. It then yields the corresponding datapacks of the selected documents.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
• resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

• configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:
• zip_pack (bool): whether to zip the results. The default value is

False.

• serialize_method: The method used to serialize the data. Current

available options are “jsonpickle” and “pickle”. Default is “jsonpickle”.

### UDAIterator¶

class forte.processors.data_augment.algorithms.UDA.UDAIterator(sup_iterator, unsup_iterator, softmax_temperature=1.0, confidence_threshold=- 1, reduction='mean')[source]

This iterator wraps the Unsupervised Data Augmentation(UDA) algorithm by calculating the unsupervised loss automatically during each iteration. It takes both supervised and unsupervised data iterator as input.

The unsupervised data should contain the original input and the augmented input. The original and augmented inputs should be in the same training example.

During each iteration, the iterator will return the supervised and unsupervised batches. Users can call the calculate_uda_loss() to get the UDA loss and combine it with the supervised loss for model training.

It uses tricks such as prediction sharpening and confidence masking. Please refer to the UDA paper for more details. (https://arxiv.org/abs/1904.12848)

Parameters
• sup_iterator – The iterator for supervised data. Each item is a training/evaluation/test example with key-value pairs as inputs.

• unsup_iterator – The iterator for unsupervised data. Each training example in it should contain both the original and augmented data.

• softmax_temperature – The softmax temperature for sharpening the distribution. The value should be larger than 0. Defaults to 1.

• confidence_threshold – The threshold for confidence-masking. It is a threshold of the probability in [0, 1], rather than of the logit. If set to -1, the threshold will be ignored. Defaults to -1.

• reduction

Default: ‘mean’. This is the same as the reduction argument in texar.torch.losses.info_loss.kl_divg_loss_with_logits(). The loss will be a scalar tensor if the reduction is not 'none'. Specifies the reduction to apply to the output:

• 'none': no reduction will be applied.

• 'batchmean': the sum of the output will be divided by the batch size.

• 'sum': the output will be summed.

• 'mean': the output will be divided by the number of elements in the output.

calculate_uda_loss(logits_orig, logits_aug)[source]

This function calculate the KL divergence between the output probabilities of original input and augmented input. The two inputs should have the same shape, and the last dimension of them should be the probability distribution.

Parameters
• logits_orig – A tensor contains the logits of the original data.

• logits_aug – A tensor contains the logits of the augmented data. Must have the same shape as logits_orig.

Returns

The loss, as a pytorch scalar float tensor if the reduction is not 'none', otherwise a tensor with the same shape as the logits_orig.

## Data Augmentation Ops¶

### TextReplacementOp¶

class forte.processors.data_augment.algorithms.text_replacement_op.TextReplacementOp(configs)[source]

The base class holds the data augmentation algorithm. We leave the replace() method to be implemented by subclasses.

abstract replace(input_anno)[source]

Most data augmentation algorithms can be considered as replacement-based methods on different levels. This function takes in an annotation as input and returns the augmented string.

Parameters

input_anno – the input annotation to be replaced.

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

### DistributionReplacementOp¶

class forte.processors.data_augment.algorithms.DistributionReplacementOp(configs)[source]

This class is a replacement op to replace the input word with a new word that is sampled by a sampler from a distribution.

Config Values:
• prob:

The probability of whether to replace the input, it should fall in [0, 1].

• sampler_data:

A dictionary representing the configurations required to create the required sampler.

type:

The type of sampler to be used (pass the path of the class which defines the required sampler)

kwargs:

This dictionary contains the data that is to be fed to the required sampler. 2 possible values are sampler_data and data_path.If both parameters are passed, the data read from the file pointed to by data_path will be considered.

• sampler_data:

Raw input to the sampler, This will be passed as the sampler_data config to the required sampler.

• data_path:

The path to the file that contains the the input that will be given to the sampler. For example, when using UniformSampler, data_path will point to a file (or URl) containing a list of values to be used as sampler_data in UniformSampler.

{
"type": "forte.processors.data_augment.algorithms.sampler.UniformSampler",
"kwargs":{
"sample": ["apple", "banana", "orange"]
}
}

replace(input_anno)[source]

This function replaces a word by sampling from a distribution.

Parameters

input_anno (Annotation) – The input annotation.

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced word.

cofigure_sampler()[source]

This function sets the sampler that will be used by the distribution replacement op. The sampler will be set according to the configuration values

### Sampler¶

class forte.processors.data_augment.algorithms.sampler.UniformSampler(configs)[source]

A sampler that samples a word from a uniform distribution.

Config Values:
• sampler_data: (list)

A list of words that this sampler uniformly samples from.

class forte.processors.data_augment.algorithms.sampler.UnigramSampler(configs)[source]

A sampler that samples a word from a unigram distribution.

Config Values:
• sampler_data: (dict)

The key is a word, the value is the word count or a probability. This sampler samples from this word distribution. Example:

'sampler_data': {
"apple": 1,
"banana": 2,
"orange": 3
}


### MachineTranslator¶

class forte.processors.data_augment.algorithms.machine_translator.MachineTranslator(src_lang, tgt_lang, device)[source]

This class is a wrapper for machine translation models.

Parameters
• src_lang – The source language.

• tgt_lang – The target language.

• device – “cuda” for gpu, “cpu” otherwise.

abstract translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Returns

The output text in target language.

### MarianMachineTranslator¶

class forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator(src_lang='en', tgt_lang='fr', device='cpu')[source]

This class is a wrapper for the Marian Machine Translator (https://huggingface.co/transformers/model_doc/marian.html). Please refer to their doc for supported languages.

translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Returns

The output text in target language.

### BackTranslationOp¶

class forte.processors.data_augment.algorithms.back_translation_op.BackTranslationOp(configs)[source]

This class is a replacement op using back translation to generate data with the same semantic meanings. The input is translated to another language, then translated back to the original language, with pretrained machine-translation models.

It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

The configuration should have the following fields:

Config Values:
• prob (float): The probability of replacement, should fall in [0, 1].

• src_lang (str): The source language of back translation.

• tgt_lang (str): The target language of back translation.

• model_to (str): The full qualified name of the model from

source language to target language.

• model_back (str): The full qualified name of the model from

target language to source language.

• device (str): “cpu” for the CPU or “cuda” for GPU.

replace(input_anno)[source]

This function replaces a piece of text with back translation.

Parameters

input_anno (Entry) – An annotation, could be a word, sentence or document.

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

### DictionaryReplacementOp¶

class forte.processors.data_augment.algorithms.dictionary_replacement_op.DictionaryReplacementOp(configs)[source]

This class is a replacement op utilizing the dictionaries, such as WORDNET, to replace the input word with an synonym. Part-of-Speech(optional) can be provided to the wordnet for retrieving synonyms with the same POS. It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

The config should contain the following fields:
• dictionary: The full qualified name of the dictionary class.

• prob: The probability of replacement, should fall in [0, 1].

• lang: The language of the text.

replace(input_anno)[source]

This function replaces a word with synonyms from a WORDNET dictionary.

Parameters

input_anno (Token) – The input word.

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

### Dictionary¶

class forte.processors.data_augment.algorithms.dictionary.Dictionary[source]

This class defines a dictionary for word replacement. Given an input word and its pos_tag(optional), the dictionary will outputs its synonyms, antonyms, hypernyms and hypernyms.

get_synonyms(word, pos_tag='', lang='eng')[source]
Parameters
• word (str) – The input string.

• pos_tag (str) – The Part-of-Speech tag for substitution.

• lang (str) – The language of the input string.

Returns

synonyms of the word.

get_antonyms(word, pos_tag='', lang='eng')[source]
Parameters
• word (str) – The input string.

• pos_tag (str) – The Part-of-Speech tag for substitution.

• lang (str) – The language of the input string.

Returns

Antonyms of the word.

get_hypernyms(word, pos_tag='', lang='eng')[source]
Parameters
• word (str) – The input string.

• pos_tag (str) – The Part-of-Speech tag for substitution.

• lang (str) – The language of the input string.

Returns

Hypernyms of the word.

get_hyponyms(word, pos_tag='', lang='eng')[source]
Parameters
• word (str) – The input string.

• pos_tag (str) – The Part-of-Speech tag for substitution.

• lang (str) – The language of the input string.

Returns

Hyponyms of the word.

### WordnetDictionary¶

class forte.processors.data_augment.algorithms.dictionary.WordnetDictionary[source]

This class wraps the nltk WORDNET to replace the input word with an synonym/antonym/hypernym/hyponym. Part-of-Speech(optional) can be provided to the wordnet for retrieving words with the same POS.

get_lemmas(word, pos_tag='', lang='eng', lemma_type='SYNONYM')[source]

This function gets synonyms/antonyms/hypernyms/hyponyms from a WORDNET dictionary.

Parameters
• word (str) – The input token.

• pos_tag (str) – The NLTK POS tag.

• lang (str) – The input language.

• lemma_type (str) –

The type of words to replace, must be one of the following:

• 'SYNONYM'

• 'ANTONYM'

• 'HYPERNYM'

• 'HYPONYM'

get_synonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with synonyms from a WORDNET dictionary.

get_antonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with antonyms from a WORDNET dictionary.

get_hypernyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hypernyms from a WORDNET dictionary.

get_hyponyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hyponyms from a WORDNET dictionary.

### TypoReplacementOp¶

class forte.processors.data_augment.algorithms.typo_replacement_op.TypoReplacementOp(configs)[source]

This class is a replacement op using a pre-defined spelling mistake dictionary to simulate spelling mistake.

Parameters

configs

The config should contain
prob (float): The probability of replacement,

should fall in [0, 1].

dict_path (str): the url or the path to the pre-defined

typo json file. The key is a word we want to replace. The value is a list containing various typos of the corresponding key.

typo_generator (str): A generator that takes in a word and

outputs the replacement typo.

replace(input_anno)[source]

This function replaces a word from a typo dictionary.

Parameters

input_anno (Annotation) – The input annotation.

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

### WordSplittingOp¶

class forte.processors.data_augment.algorithms.word_splitting_processor.RandomWordSplitDataAugmentProcessor[source]

This class creates a processor to perform Random Word Splitting. It randomly chooses n words in a sentence and splits each word at a random position where n = alpha * input length. alpha indicates the percent of the words in a sentence that are changed.

classmethod default_configs()[source]
Returns

A dictionary with the default config for this processor. Additional keys for determining how many words will be split: - alpha: 0 <= alpha <= 1. indicates the percent of the words in a sentence that are changed.

Config Values:

• other_entry_policy (dict):

A dict specifying the policies for other entries. The key should be a full qualified class name. The policy(value of the dict) specifies how to process the corresponding entries after replacement.

If the policy is “auto_align”, the span of the entry will be automatically modified according to its original location. However, some spans might become invalid after the augmentation, for example, the tokens within a replaced sentence may disappear.

Annotations not in the “other_entry_policy” will not be copied to the new data pack. The Links and Groups will be copied as well if the annotations they are attached to are copied.

Example:
'other_entry_policy': {
"kwargs": {
"ft.onto.base_ontology.Document": "auto_align",
"ft.onto.base_ontology.Sentence": "auto_align",
}
}

• augment_pack_names (dict): The name of the data pack that will

contain the augmented text (Default: augmented_input_src). To update it, pass a dict of form

Example:
'augment_pack_names': {
"kwargs": {
"input_src" : "augmented_input_src",
}
}

• alpha (float):

The probability of splitting in [0, 1](Default = 0.1).

### CharacterFlipOp¶

class forte.processors.data_augment.algorithms.character_flip_op.CharacterFlipOp(configs)[source]

A uniform generator that randomly flips a character with a similar looking character from a predefined dictionary imported from “https://github.com/facebookresearch/AugLy/blob/main/” + “augly/text/augmenters/utils.py”. (For example: the cat drank milk -> t/-/3 c@t d12@nk m!|_1<).

Parameters
• string – input string whose characters need to be replaced,

• dict_path (str) – the url or the path to the pre-defined typo json file,

• configs – prob(float): The probability of replacement, should fall in [0, 1].

replace(input_anno)[source]

Takes in the annotated string and performs the character flip operation on it that randomly augments few characters from it based on the probability value in the configs.

Parameters

input_anno – the input annotation.

Returns

A tuple with the first element being a boolean value indicating whether the replacement happens, and the second element is the final augmented string.

## Data Augmentation Models¶

### Reinforcement Learning¶

class forte.models.da_rl.MetaAugmentationWrapper(augmentation_model, augmentation_optimizer, input_mask_ids, device, num_aug)[source]

A wrapper adding data augmentation to a Bert model with arbitrary tasks. This is used to perform reinforcement learning for joint data augmentation learning and model training.

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/augmentation/generator.py

Let $$\theta$$ be the parameters of the downstream (classifier) model. Let $$\phi$$ be the parameters of the augmentation model. Equations to update $$\phi$$:

\begin{align}\begin{aligned}\theta'(\phi) = \theta - \nabla_{\theta} L_{train}(\theta, \phi)\\\phi = \phi - \nabla_{\phi} L_{val}(\theta'(\phi))\end{aligned}\end{align}
Parameters
• augmentation_model – A Bert-based model for data augmentation. E.g. BertForMaskedLM. Model requirement: masked language modeling, the output logits of this model is of shape [batch_size, seq_length, token_size].

• augmentation_optimizer – An optimizer that is associated with augmentation_model. E.g. Adam optimizer.

• input_mask_ids – Bert token id of ‘[MASK]’. This is used to randomly mask out tokens from the input sentence during training.

• device – The CUDA device to run the model on.

• num_aug – The number of samples from the augmentation model for every augmented training instance.

Example usage:

aug_wrapper = MetaAugmentationWrapper(
aug_model, aug_optim, mask_id, device, num_aug)
for batch in training_data:
# Train augmentation model params.
aug_wrapper.reset_model()
for instance in batch:
# Augmented example with params phi exposed
aug_instance_features = \
aug_wrapper.augment_instance(instance_features)
# Model is the downstream Bert model.
loss = model(aug_instance_features)
meta_model = MetaModule(model)
meta_model = aug_wrapper.update_meta_model(
meta_model, loss, model, optim)

# Compute gradient of the augmentation model on validation
data for val_batch in validation_data:
val_loss = meta_model(val_batch_features)
val_loss = val_loss / num_training_instance / num_aug \
/ num_val_batch
val_loss.backward()
# update augmentation model params.
aug_wrapper.update_phi()

# train classifier with augmented batch
aug_batch_features = aug_wrapper.augment_batch(batch_features)
loss = model(aug_batch_features)
loss.backward()
optim.step()

augment_instance(features)[source]

Augment a training instance.

Parameters

features

A tuple of Bert features of one training instance. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [seq_len].

input_mask is a tensor of shape [seq_len] with 1 indicating without mask and 0 with mask.

segment_ids is a tensor of shape [seq_len]. label_ids is a tensor of shape [seq_len].

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, distributions over vocabulary. It has shape [num_aug, seq_len, token_size]. It keeps $$\phi$$ as variable so that after passing it as an input to the classifier, the gradients of $$\theta$$ will also apply to $$\phi$$.

input_mask_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input input_mask so that it corresponds to the mask of each token in input_probs_aug.

segment_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input segment_ids so that it corresponds to the token type of each token in input_probs_aug.

label_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input label_ids so that it corresponds to the label of each token in input_probs_aug.

augment_batch(batch_features)[source]

Augment a batch of training instances. Append augmented instances to the input instances.

Parameters

batch_features

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, It has shape [batch_size * 2, seq_len, token_size].

input_mask_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input input_mask, the first one corresponds to the mask of the tokens in the original bert instance, the second one corresponds to the mask of the augmented bert instance.

segment_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input segment_ids, the first one corresponds to the segment id of the tokens in the original bert instance, the second one corresponds to the segment id of the augmented bert instance.

label_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input label_ids, the first one corresponds to the labels of the original bert instance, the second one corresponds to the labels of the augmented bert instance.

eval_batch(batch_features)[source]

Evaluate a batch of training instances.

Parameters

batch_features

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Returns

The masked language modeling loss of one evaluation batch. It is a torch.FloatTensor of shape [1,].

update_meta_model(meta_model, loss, model, optimizer)[source]

Update the parameters within the MetaModel according to the downstream model loss.

MetaModel is used to calculate $$\nabla_{\phi} L_{val}(\theta'(\phi))$$, where it needs gradients applied to $$\phi$$.

Perform parameter updates in this function, and later applies gradient change to $$\theta$$ and $$\phi$$ using validation data.

Parameters
• meta_model – A meta model whose parameters will be updated in-place by the deltas calculated from the input loss.

• loss – The loss of the downstream model that have taken the augmented training instances as input.

• model – The downstream Bert model.

• optimizer – The optimizer that is associated with the model.

Returns

The same input meta_model with the updated parameters.

class forte.models.da_rl.MetaModule(module)[source]

A class extending torch.nn.ModuleList that registers the parameters of a torch.nn.Module and performs memory-efficient parameter updates locally.

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/magic_module.py

It implements the calculation: $$L(\theta - \nabla_{\theta} L_{train}(\theta, \phi))$$.

Parameters

module – A torch.nn.Module.

This class can be used for simple input module, whose sub-modules don’t contain other helper functions or attributes that do not belong to this class to perform their forward().

Otherwise, since forward() calls the input module’s forward(), in order to perform forward() of the sub-modules of the input module correctly, this class needs to extend those sub-modules that define the methods needed for their forward(), so that it inherits their methods to perform the sub-module’s forward().

For example, if the input module is BERTClassifier, _get_noise_shape(), _split_heads(), _combine_heads() from its sub-modules (E.g. BERTEncoder) are needed to be exposed in this class to perform their forward(). Please refer to TexarBertMetaModule for instructions on creating a subclass from this one for a specific input module.

forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class forte.models.da_rl.TexarBertMetaModule(module)[source]

A subclass that extends MetaModule to do parameter updates locally for texar-pytorch Bert related modules. E.g. texar.torch.modules.BERTClassifier

Please refer to its base class MetaModule for more details.

Parameters

module – A torch.nn.Module.

This class extends EmbedderBase and MultiheadAttentionEncoder, such that it inherits their methods that are needed to perform forward() of the modules that utilizes these methods, E.g. BERTEncoder,

Some notes of the order of the base classes that this class extends:

MetaModule should be the first one, so that its forward() will call MetaModule.forward() instead of the forward() of the other base classes, such as texar.torch.modules.MultiheadAttentionEncoder.forward(). If MetaModule is not the first one, then a forward() should be defined in this class, such that it is called correctly.

Example

def forward(self, *args, **kwargs):
return MetaModule.forward(self, *args, **kwargs)