Data Augmentation

Data Augmentation Processors

ReplacementDataAugmentProcessor

class forte.processors.data_augment.base_data_augment_processor.ReplacementDataAugmentProcessor[source]

Most of the Data Augmentation(DA) methods can be considered as replacement-based methods with different levels: character, word, sentence or document.

DataSelector

class forte.processors.base.data_selector_for_da.BaseElasticSearchDataSelector(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The base elastic search indexer for data selector. This class creates an ElasticSearchIndexer and searches for documents according to the user-provided search keys. Currently supported search criteria: random-based and query-based. It then yields the corresponding datapacks of the selected documents.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}
class forte.processors.data_augment.selector_index_processor.DataSelectorIndexProcessor[source]

This selector index directly reuse a default PackIndexProcessor.

class forte.processors.ir.elastic_search_index_processor.ElasticSearchPackIndexProcessor[source]

This processor indexes the data packs into an Elasticsearch index.

UDAIterator

class forte.processors.data_augment.algorithms.UDA.UDAIterator(sup_iterator, unsup_iterator, softmax_temperature=1.0, confidence_threshold=- 1, reduction='mean')[source]

This iterator wraps the Unsupervised Data Augmentation(UDA) algorithm by calculating the unsupervised loss automatically during each iteration. It takes both supervised and unsupervised data iterator as input.

The unsupervised data should contain the original input and the augmented input. The original and augmented inputs should be in the same training example.

During each iteration, the iterator will return the supervised and unsupervised batches. Users can call the calculate_uda_loss() to get the UDA loss and combine it with the supervised loss for model training.

It uses tricks such as prediction sharpening and confidence masking. Please refer to the UDA paper for more details. (https://arxiv.org/abs/1904.12848)

Parameters
  • sup_iterator – The iterator for supervised data. Each item is a training/evaluation/test example with key-value pairs as inputs.

  • unsup_iterator – The iterator for unsupervised data. Each training example in it should contain both the original and augmented data.

  • softmax_temperature – The softmax temperature for sharpening the distribution. The value should be larger than 0. Defaults to 1.

  • confidence_threshold – The threshold for confidence-masking. It is a threshold of the probability in [0, 1], rather than of the logit. If set to -1, the threshold will be ignored. Defaults to -1.

  • reduction

    Default: ‘mean’. This is the same as the reduction argument in texar.torch.losses.info_loss.kl_divg_loss_with_logits(). The loss will be a scalar tensor if the reduction is not 'none'. Specifies the reduction to apply to the output:

    • 'none': no reduction will be applied.

    • 'batchmean': the sum of the output will be divided by the batch size.

    • 'sum': the output will be summed.

    • 'mean': the output will be divided by the number of elements in the output.

calculate_uda_loss(logits_orig, logits_aug)[source]

This function calculate the KL divergence between the output probabilities of original input and augmented input. The two inputs should have the same shape, and the last dimension of them should be the probability distribution.

Parameters
  • logits_orig – A tensor contains the logits of the original data.

  • logits_aug – A tensor contains the logits of the augmented data. Must have the same shape as logits_orig.

Returns

The loss, as a pytorch scalar float tensor if the reduction is not 'none', otherwise a tensor with the same shape as the logits_orig.

Data Augmentation Ops

TextReplacementOp

class forte.processors.data_augment.algorithms.text_replacement_op.TextReplacementOp(configs)[source]

The base class holds the data augmentation algorithm. We leave the replace() method to be implemented by subclasses.

abstract replace(input_anno)[source]

Most data augmentation algorithms can be considered as replacement-based methods on different levels. This function takes in an annotation as input and returns the augmented string.

Parameters

input_anno – the input annotation to be replaced.

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

DistributionReplacementOp

class forte.processors.data_augment.algorithms.DistributionReplacementOp(sampler, configs)[source]

This class is a replacement op to replace the input word with a new word that is sampled by a sampler from a distribution.

Parameters
  • sampler – The sampler that samples a word from a distribution.

  • configs – The config should contain prob, The probability of whether to replace the input, it should fall in [0, 1].

replace(input_anno)[source]

This function replaces a word by sampling from a distribution.

Parameters

input_anno (Annotation) – The input annotation.

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced word.

Sampler

class forte.processors.data_augment.algorithms.sampler.UniformSampler(word_list)[source]

A sampler that samples a word from a uniform distribution.

Parameters

word_list – A list of words that this sampler uniformly samples from.

class forte.processors.data_augment.algorithms.sampler.UnigramSampler(unigram)[source]

A sampler that samples a word from a unigram distribution.

Parameters

unigram – A dictionary. The key is a word, the value is the word count or a probability. This sampler samples from this word distribution.

MachineTranslator

class forte.processors.data_augment.algorithms.machine_translator.MachineTranslator(src_lang, tgt_lang, device)[source]

This class is a wrapper for machine translation models.

Parameters
  • src_lang – The source language.

  • tgt_lang – The target language.

  • device – “cuda” for gpu, “cpu” otherwise.

abstract translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Returns

The output text in target language.

MarianMachineTranslator

class forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator(src_lang='en', tgt_lang='fr', device='cpu')[source]

This class is a wrapper for the Marian Machine Translator (https://huggingface.co/transformers/model_doc/marian.html). Please refer to their doc for supported languages.

translate(src_text)[source]

This function translates the input text into target language.

Parameters

src_text (str) – The input text in source language.

Returns

The output text in target language.

BackTranslationOp

class forte.processors.data_augment.algorithms.back_translation_op.BackTranslationOp(configs)[source]

This class is a replacement op using back translation to generate data with the same semantic meanings. The input is translated to another language, then translated back to the original language, with pretrained machine-translation models.

It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

The configuration should have the following fields:

Parameters
  • prob (float) – The probability of replacement, should fall in [0, 1].

  • src_lang (str) – The source language of back translation.

  • tgt_lang (str) – The target language of back translation.

  • model_to (str) – The full qualified name of the model from source language to target language.

  • model_back (str) – The full qualified name of the model from target language to source language.

  • device (str) – “cpu” for the CPU or “cuda” for GPU.

replace(input_anno)[source]

This function replaces a piece of text with back translation.

Parameters

input_anno (Entry) – An annotation, could be a word, sentence or document.

Returns

A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

DictionaryReplacementOp

class forte.processors.data_augment.algorithms.dictionary_replacement_op.DictionaryReplacementOp(configs)[source]

This class is a replacement op utilizing the dictionaries, such as WORDNET, to replace the input word with an synonym. Part-of-Speech(optional) can be provided to the wordnet for retrieving synonyms with the same POS. It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.

The config should contain the following fields:
  • dictionary: The full qualified name of the dictionary class.

  • prob: The probability of replacement, should fall in [0, 1].

  • lang: The language of the text.

replace(input_anno)[source]

This function replaces a word with synonyms from a WORDNET dictionary.

Parameters

input_anno (Annotation) – The input annotation.

Returns

A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.

Dictionary

class forte.processors.data_augment.algorithms.dictionary.Dictionary[source]

This class defines a dictionary for word replacement. Given an input word and its pos_tag(optional), the dictionary will outputs its synonyms, antonyms, hypernyms and hypernyms.

get_synonyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Returns

synonyms of the word.

get_antonyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Returns

Antonyms of the word.

get_hypernyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Returns

Hypernyms of the word.

get_hyponyms(word, pos_tag='', lang='eng')[source]
Parameters
  • word (str) – The input string.

  • pos_tag (str) – The Part-of-Speech tag for substitution.

  • lang (str) – The language of the input string.

Returns

Hyponyms of the word.

WordnetDictionary

class forte.processors.data_augment.algorithms.dictionary.WordnetDictionary[source]

This class wraps the nltk WORDNET to replace the input word with an synonym/antonym/hypernym/hyponym. Part-of-Speech(optional) can be provided to the wordnet for retrieving words with the same POS.

get_lemmas(word, pos_tag='', lang='eng', lemma_type='SYNONYM')[source]

This function gets synonyms/antonyms/hypernyms/hyponyms from a WORDNET dictionary.

Parameters
  • word (str) – The input token.

  • pos_tag (str) – The NLTK POS tag.

  • lang (str) – The input language.

  • lemma_type (str) –

    The type of words to replace, must be one of the following:

    • 'SYNONYM'

    • 'ANTONYM'

    • 'HYPERNYM'

    • 'HYPONYM'

get_synonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with synonyms from a WORDNET dictionary.

get_antonyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with antonyms from a WORDNET dictionary.

get_hypernyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hypernyms from a WORDNET dictionary.

get_hyponyms(word, pos_tag='', lang='eng')[source]

This function replaces a word with hyponyms from a WORDNET dictionary.

Data Augmentation Models

Reinforcement Learning

class forte.models.da_rl.MetaAugmentationWrapper(augmentation_model, augmentation_optimizer, input_mask_ids, device, num_aug)[source]

A wrapper adding data augmentation to a Bert model with arbitrary tasks. This is used to perform reinforcement learning for joint data augmentation learning and model training.

See: https://arxiv.org/pdf/1910.12795.pdf

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/augmentation/generator.py

Let \(\theta\) be the parameters of the downstream (classifier) model. Let \(\phi\) be the parameters of the augmentation model. Equations to update \(\phi\):

\[ \begin{align}\begin{aligned}\theta'(\phi) = \theta - \nabla_{\theta} L_{train}(\theta, \phi)\\\phi = \phi - \nabla_{\phi} L_{val}(\theta'(\phi))\end{aligned}\end{align} \]
Parameters
  • augmentation_model – A Bert-based model for data augmentation. E.g. BertForMaskedLM. Model requirement: masked language modeling, the output logits of this model is of shape [batch_size, seq_length, token_size].

  • augmentation_optimizer – An optimizer that is associated with augmentation_model. E.g. Adam optimizer.

  • input_mask_ids – Bert token id of ‘[MASK]’. This is used to randomly mask out tokens from the input sentence during training.

  • device – The CUDA device to run the model on.

  • num_aug – The number of samples from the augmentation model for every augmented training instance.

Example usage:

aug_wrapper = MetaAugmentationWrapper(
    aug_model, aug_optim, mask_id, device, num_aug)
for batch in training_data:
    # Train augmentation model params.
    aug_wrapper.reset_model()
    for instance in batch:
        # Augmented example with params phi exposed
        aug_instance_features = \
            aug_wrapper.augment_instance(instance_features)
        # Model is the downstream Bert model.
        model.zero_grad()
        loss = model(aug_instance_features)
        meta_model = MetaModule(model)
        meta_model = aug_wrapper.update_meta_model(
            meta_model, loss, model, optim)

        # Compute gradient of the augmentation model on validation
        data for val_batch in validation_data:
            val_loss = meta_model(val_batch_features)
            val_loss = val_loss / num_training_instance / num_aug \
                / num_val_batch
            val_loss.backward()
    # update augmentation model params.
    aug_wrapper.update_phi()

    # train classifier with augmented batch
    aug_batch_features = aug_wrapper.augment_batch(batch_features)
    optim.zero_grad()
    loss = model(aug_batch_features)
    loss.backward()
    optim.step()
augment_instance(features)[source]

Augment a training instance.

Parameters

features

A tuple of Bert features of one training instance. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [seq_len].

input_mask is a tensor of shape [seq_len] with 1 indicating without mask and 0 with mask.

segment_ids is a tensor of shape [seq_len]. label_ids is a tensor of shape [seq_len].

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, distributions over vocabulary. It has shape [num_aug, seq_len, token_size]. It keeps \(\phi\) as variable so that after passing it as an input to the classifier, the gradients of \(\theta\) will also apply to \(\phi\).

input_mask_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input input_mask so that it corresponds to the mask of each token in input_probs_aug.

segment_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input segment_ids so that it corresponds to the token type of each token in input_probs_aug.

label_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input label_ids so that it corresponds to the label of each token in input_probs_aug.

augment_batch(batch_features)[source]

Augment a batch of training instances. Append augmented instances to the input instances.

Parameters

batch_features

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Returns

A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).

input_probs_aug is a tensor of soft Bert embeddings, It has shape [batch_size * 2, seq_len, token_size].

input_mask_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input input_mask, the first one corresponds to the mask of the tokens in the original bert instance, the second one corresponds to the mask of the augmented bert instance.

segment_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input segment_ids, the first one corresponds to the segment id of the tokens in the original bert instance, the second one corresponds to the segment id of the augmented bert instance.

label_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input label_ids, the first one corresponds to the labels of the original bert instance, the second one corresponds to the labels of the augmented bert instance.

eval_batch(batch_features)[source]

Evaluate a batch of training instances.

Parameters

batch_features

A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).

input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].

input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].

Returns

The masked language modeling loss of one evaluation batch. It is a torch.FloatTensor of shape [1,].

update_meta_model(meta_model, loss, model, optimizer)[source]

Update the parameters within the MetaModel according to the downstream model loss.

MetaModel is used to calculate \(\nabla_{\phi} L_{val}(\theta'(\phi))\), where it needs gradients applied to \(\phi\).

Perform parameter updates in this function, and later applies gradient change to \(\theta\) and \(\phi\) using validation data.

Parameters
  • meta_model – A meta model whose parameters will be updated in-place by the deltas calculated from the input loss.

  • loss – The loss of the downstream model that have taken the augmented training instances as input.

  • model – The downstream Bert model.

  • optimizer – The optimizer that is associated with the model.

Returns

The same input meta_model with the updated parameters.

class forte.models.da_rl.MetaModule(module)[source]

A class extending torch.nn.ModuleList that registers the parameters of a torch.nn.Module and performs memory-efficient parameter updates locally.

This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/magic_module.py

It implements the calculation: \(L(\theta - \nabla_{\theta} L_{train}(\theta, \phi))\).

Parameters

module – A torch.nn.Module.

This class can be used for simple input module, whose sub-modules don’t contain other helper functions or attributes that do not belong to this class to perform their forward().

Otherwise, since forward() calls the input module’s forward(), in order to perform forward() of the sub-modules of the input module correctly, this class needs to extend those sub-modules that define the methods needed for their forward(), so that it inherits their methods to perform the sub-module’s forward().

For example, if the input module is BERTClassifier, _get_noise_shape(), _split_heads(), _combine_heads() from its sub-modules (E.g. BERTEncoder) are needed to be exposed in this class to perform their forward(). Please refer to TexarBertMetaModule for instructions on creating a subclass from this one for a specific input module.

forward(*args, **kwargs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class forte.models.da_rl.TexarBertMetaModule(module)[source]

A subclass that extends MetaModule to do parameter updates locally for texar-pytorch Bert related modules. E.g. texar.torch.modules.BERTClassifier

Please refer to its base class MetaModule for more details.

Parameters

module – A torch.nn.Module.

This class extends EmbedderBase and MultiheadAttentionEncoder, such that it inherits their methods that are needed to perform forward() of the modules that utilizes these methods, E.g. BERTEncoder,

Some notes of the order of the base classes that this class extends:

MetaModule should be the first one, so that its forward() will call MetaModule.forward() instead of the forward() of the other base classes, such as texar.torch.modules.MultiheadAttentionEncoder.forward(). If MetaModule is not the first one, then a forward() should be defined in this class, such that it is called correctly.

Example

def forward(self, *args, **kwargs):
    return MetaModule.forward(self, *args, **kwargs)