Data Augmentation¶
Data Augmentation Processors¶
ReplacementDataAugmentProcessor¶
DataSelector¶
-
class
forte.processors.base.data_selector_for_da.
BaseElasticSearchDataSelector
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ The base elastic search indexer for data selector. This class creates an
ElasticSearchIndexer
and searches for documents according to the user-provided search keys. Currently supported search criteria: random-based and query-based. It then yields the corresponding datapacks of the selected documents.-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.- Parameters
resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Config) – The configuration passed in to set up this component.
-
UDAIterator¶
Data Augmentation Ops¶
TextReplacementOp¶
-
class
forte.processors.data_augment.algorithms.text_replacement_op.
TextReplacementOp
(configs)[source]¶ The base class holds the data augmentation algorithm. We leave the
replace()
method to be implemented by subclasses.-
abstract
replace
(input_anno)[source]¶ Most data augmentation algorithms can be considered as replacement-based methods on different levels. This function takes in an annotation as input and returns the augmented string.
- Parameters
input_anno – the input annotation to be replaced.
- Returns
A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.
-
abstract
DistributionReplacementOp¶
-
class
forte.processors.data_augment.algorithms.
DistributionReplacementOp
(sampler, configs)[source]¶ This class is a replacement op to replace the input word with a new word that is sampled by a sampler from a distribution.
- Parameters
sampler – The sampler that samples a word from a distribution.
configs – The config should contain prob, The probability of whether to replace the input, it should fall in [0, 1].
-
replace
(input_anno)[source]¶ This function replaces a word by sampling from a distribution.
- Parameters
input_anno (Annotation) – The input annotation.
- Returns
A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced word.
Sampler¶
MachineTranslator¶
MarianMachineTranslator¶
-
class
forte.processors.data_augment.algorithms.machine_translator.
MarianMachineTranslator
(src_lang='en', tgt_lang='fr', device='cpu')[source]¶ This class is a wrapper for the Marian Machine Translator (https://huggingface.co/transformers/model_doc/marian.html). Please refer to their doc for supported languages.
BackTranslationOp¶
-
class
forte.processors.data_augment.algorithms.back_translation_op.
BackTranslationOp
(configs)[source]¶ This class is a replacement op using back translation to generate data with the same semantic meanings. The input is translated to another language, then translated back to the original language, with pretrained machine-translation models.
It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.
The configuration should have the following fields:
- Parameters
prob (float) – The probability of replacement, should fall in [0, 1].
src_lang (str) – The source language of back translation.
tgt_lang (str) – The target language of back translation.
model_to (str) – The full qualified name of the model from source language to target language.
model_back (str) – The full qualified name of the model from target language to source language.
device (str) – “cpu” for the CPU or “cuda” for GPU.
-
replace
(input_anno)[source]¶ This function replaces a piece of text with back translation.
- Parameters
input_anno (Entry) – An annotation, could be a word, sentence or document.
- Returns
A tuple, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.
DictionaryReplacementOp¶
-
class
forte.processors.data_augment.algorithms.dictionary_replacement_op.
DictionaryReplacementOp
(configs)[source]¶ This class is a replacement op utilizing the dictionaries, such as WORDNET, to replace the input word with an synonym. Part-of-Speech(optional) can be provided to the wordnet for retrieving synonyms with the same POS. It will sample from a Bernoulli distribution to decide whether to replace the input, with prob as the probability of replacement.
- The config should contain the following fields:
dictionary: The full qualified name of the dictionary class.
prob: The probability of replacement, should fall in [0, 1].
lang: The language of the text.
-
replace
(input_anno)[source]¶ This function replaces a word with synonyms from a WORDNET dictionary.
- Parameters
input_anno (Annotation) – The input annotation.
- Returns
A tuple of two values, where the first element is a boolean value indicating whether the replacement happens, and the second element is the replaced string.
Dictionary¶
-
class
forte.processors.data_augment.algorithms.dictionary.
Dictionary
[source]¶ This class defines a dictionary for word replacement. Given an input word and its pos_tag(optional), the dictionary will outputs its synonyms, antonyms, hypernyms and hypernyms.
WordnetDictionary¶
-
class
forte.processors.data_augment.algorithms.dictionary.
WordnetDictionary
[source]¶ This class wraps the nltk WORDNET to replace the input word with an synonym/antonym/hypernym/hyponym. Part-of-Speech(optional) can be provided to the wordnet for retrieving words with the same POS.
-
get_lemmas
(word, pos_tag='', lang='eng', lemma_type='SYNONYM')[source]¶ This function gets synonyms/antonyms/hypernyms/hyponyms from a WORDNET dictionary.
-
get_synonyms
(word, pos_tag='', lang='eng')[source]¶ This function replaces a word with synonyms from a WORDNET dictionary.
-
get_antonyms
(word, pos_tag='', lang='eng')[source]¶ This function replaces a word with antonyms from a WORDNET dictionary.
-
Data Augmentation Models¶
Reinforcement Learning¶
-
class
forte.models.da_rl.
MetaAugmentationWrapper
(augmentation_model, augmentation_optimizer, input_mask_ids, device, num_aug)[source]¶ A wrapper adding data augmentation to a Bert model with arbitrary tasks. This is used to perform reinforcement learning for joint data augmentation learning and model training.
See: https://arxiv.org/pdf/1910.12795.pdf
This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/augmentation/generator.py
Let \(\theta\) be the parameters of the downstream (classifier) model. Let \(\phi\) be the parameters of the augmentation model. Equations to update \(\phi\):
\[ \begin{align}\begin{aligned}\theta'(\phi) = \theta - \nabla_{\theta} L_{train}(\theta, \phi)\\\phi = \phi - \nabla_{\phi} L_{val}(\theta'(\phi))\end{aligned}\end{align} \]- Parameters
augmentation_model – A Bert-based model for data augmentation. E.g. BertForMaskedLM. Model requirement: masked language modeling, the output logits of this model is of shape [batch_size, seq_length, token_size].
augmentation_optimizer – An optimizer that is associated with augmentation_model. E.g. Adam optimizer.
input_mask_ids – Bert token id of ‘[MASK]’. This is used to randomly mask out tokens from the input sentence during training.
device – The CUDA device to run the model on.
num_aug – The number of samples from the augmentation model for every augmented training instance.
Example usage:
aug_wrapper = MetaAugmentationWrapper( aug_model, aug_optim, mask_id, device, num_aug) for batch in training_data: # Train augmentation model params. aug_wrapper.reset_model() for instance in batch: # Augmented example with params phi exposed aug_instance_features = \ aug_wrapper.augment_instance(instance_features) # Model is the downstream Bert model. model.zero_grad() loss = model(aug_instance_features) meta_model = MetaModule(model) meta_model = aug_wrapper.update_meta_model( meta_model, loss, model, optim) # Compute gradient of the augmentation model on validation data for val_batch in validation_data: val_loss = meta_model(val_batch_features) val_loss = val_loss / num_training_instance / num_aug \ / num_val_batch val_loss.backward() # update augmentation model params. aug_wrapper.update_phi() # train classifier with augmented batch aug_batch_features = aug_wrapper.augment_batch(batch_features) optim.zero_grad() loss = model(aug_batch_features) loss.backward() optim.step()
-
augment_instance
(features)[source]¶ Augment a training instance.
- Parameters
features –
A tuple of Bert features of one training instance. (input_ids, input_mask, segment_ids, label_ids).
input_ids is a tensor of Bert token ids. It has shape [seq_len].
input_mask is a tensor of shape [seq_len] with 1 indicating without mask and 0 with mask.
segment_ids is a tensor of shape [seq_len]. label_ids is a tensor of shape [seq_len].
- Returns
A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).
input_probs_aug is a tensor of soft Bert embeddings, distributions over vocabulary. It has shape [num_aug, seq_len, token_size]. It keeps \(\phi\) as variable so that after passing it as an input to the classifier, the gradients of \(\theta\) will also apply to \(\phi\).
input_mask_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input input_mask so that it corresponds to the mask of each token in input_probs_aug.
segment_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input segment_ids so that it corresponds to the token type of each token in input_probs_aug.
label_ids_aug is a tensor of shape [num_aug, seq_len], it concatenates num_aug the input label_ids so that it corresponds to the label of each token in input_probs_aug.
-
augment_batch
(batch_features)[source]¶ Augment a batch of training instances. Append augmented instances to the input instances.
- Parameters
batch_features –
A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).
input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].
input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].
- Returns
A tuple of Bert features of augmented training instances. (input_probs_aug, input_mask_aug, segment_ids_aug, label_ids_aug).
input_probs_aug is a tensor of soft Bert embeddings, It has shape [batch_size * 2, seq_len, token_size].
input_mask_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input input_mask, the first one corresponds to the mask of the tokens in the original bert instance, the second one corresponds to the mask of the augmented bert instance.
segment_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input segment_ids, the first one corresponds to the segment id of the tokens in the original bert instance, the second one corresponds to the segment id of the augmented bert instance.
label_ids_aug is a tensor of shape [batch_size * 2, seq_len], it concatenates two input label_ids, the first one corresponds to the labels of the original bert instance, the second one corresponds to the labels of the augmented bert instance.
-
eval_batch
(batch_features)[source]¶ Evaluate a batch of training instances.
- Parameters
batch_features –
A tuple of Bert features of a batch training instances. (input_ids, input_mask, segment_ids, label_ids).
input_ids is a tensor of Bert token ids. It has shape [batch_size, seq_len].
input_mask, segment_ids, label_ids are all tensors of shape [batch_size, seq_len].
- Returns
The masked language modeling loss of one evaluation batch. It is a torch.FloatTensor of shape [1,].
-
update_meta_model
(meta_model, loss, model, optimizer)[source]¶ Update the parameters within the MetaModel according to the downstream model loss.
MetaModel is used to calculate \(\nabla_{\phi} L_{val}(\theta'(\phi))\), where it needs gradients applied to \(\phi\).
Perform parameter updates in this function, and later applies gradient change to \(\theta\) and \(\phi\) using validation data.
- Parameters
meta_model – A meta model whose parameters will be updated in-place by the deltas calculated from the input loss.
loss – The loss of the downstream model that have taken the augmented training instances as input.
model – The downstream Bert model.
optimizer – The optimizer that is associated with the model.
- Returns
The same input meta_model with the updated parameters.
-
class
forte.models.da_rl.
MetaModule
(module)[source]¶ A class extending
torch.nn.ModuleList
that registers the parameters of atorch.nn.Module
and performs memory-efficient parameter updates locally.This code is adapted from: https://github.com/tanyuqian/learning-data-manipulation/blob/master/magic_module.py
It implements the calculation: \(L(\theta - \nabla_{\theta} L_{train}(\theta, \phi))\).
- Parameters
module – A
torch.nn.Module
.
This class can be used for simple input module, whose sub-modules don’t contain other helper functions or attributes that do not belong to this class to perform their
forward()
.Otherwise, since
forward()
calls the input module’sforward()
, in order to performforward()
of the sub-modules of the input module correctly, this class needs to extend those sub-modules that define the methods needed for theirforward()
, so that it inherits their methods to perform the sub-module’sforward()
.For example, if the input module is
BERTClassifier
,_get_noise_shape()
,_split_heads()
,_combine_heads()
from its sub-modules (E.g.BERTEncoder
) are needed to be exposed in this class to perform theirforward()
. Please refer toTexarBertMetaModule
for instructions on creating a subclass from this one for a specific input module.-
forward
(*args, **kwargs)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
-
class
forte.models.da_rl.
TexarBertMetaModule
(module)[source]¶ A subclass that extends
MetaModule
to do parameter updates locally for texar-pytorch Bert related modules. E.g.texar.torch.modules.BERTClassifier
Please refer to its base class
MetaModule
for more details.- Parameters
module – A
torch.nn.Module
.
This class extends
EmbedderBase
andMultiheadAttentionEncoder
, such that it inherits their methods that are needed to performforward()
of the modules that utilizes these methods, E.g.BERTEncoder
,Some notes of the order of the base classes that this class extends:
MetaModule should be the first one, so that its
forward()
will callMetaModule.forward()
instead of theforward()
of the other base classes, such astexar.torch.modules.MultiheadAttentionEncoder.forward()
. If MetaModule is not the first one, then aforward()
should be defined in this class, such that it is called correctly.Example
def forward(self, *args, **kwargs): return MetaModule.forward(self, *args, **kwargs)