Handling Structured Data in DataPack

Retrieve data

DataPack.get() and DataPack.get_data() are methods commonly used to retrieve data from a DataPack. Let’s start with introducing DataPack.get(), which returns a generator that generates requested data instance.

We can set up the data_pack using the following code.

[6]:
import os

from forte.data.data_pack import DataPack
from forte.pipeline import Pipeline
from forte.utils import utils
from ft.onto.base_ontology import (
    Token,
    Sentence,
    Document,
    AudioAnnotation,
    AudioUtterance,
)
from forte.data.ontology import Annotation
from forte.data.readers import OntonotesReader, AudioReader
from forte.data.data_pack import DataPack
from forte.pipeline import Pipeline
# notebook should be running from project root folder
data_path = os.path.abspath(
            os.path.join("data_samples", "ontonotes/one_file")
        )
pipeline: Pipeline = Pipeline()
pipeline.set_reader(OntonotesReader())
pipeline.initialize()
data_pack: DataPack = pipeline.process_one(data_path)
WARNING:root:Re-declared a new class named [ConstituentNode], which is probably used in import.

The following code explains how to retrieve data instances and access data fields in it.

[7]:
for doc_idx, instance in enumerate(data_pack.get(Document)):
    print(doc_idx, "document instance:  ", instance)
    print(doc_idx, "document text:  ", instance.text)
0 document instance:   Document(document_class=[], sentiment={}, classifications=<forte.data.ontology.core.FDict object at 0x7f0654e37a50>)
0 document text:   The Indonesian billionaire James Riady has agreed to pay $ 8.5 million and plead guilty to illegally donating money for Bill Clinton 's 1992 presidential campaign . He admits he was trying to influence American policy on China .

As we can see, we can get data instance from the generator returned by data_pack.get(Document). And we can get the document text by instance.text.

By contrast, DataPack.get_data() returns a generator that generates dictionaries containing requested data, and each dictionary has a scope covering certain range of data in the DataPack.

To understand this, let’s consider a dummy case. Given that there is a document in the DataPack instance data_pack, we want to get the full document in data_pack.

Then we can run the following code to get the full document.

[8]:
for doc_idx, doc_d in enumerate(data_pack.get_data(context_type=Document)):
    print(doc_idx, ":  ", doc_d['context'])
0 :   The Indonesian billionaire James Riady has agreed to pay $ 8.5 million and plead guilty to illegally donating money for Bill Clinton 's 1992 presidential campaign . He admits he was trying to influence American policy on China .

As we can see, the generator generates a dictionary each iteration (in this dummy case, we only have one iteration), and the document data is retrieved by dictionary key 'context'.

To better understand this, let’s consider a more concrete case. Since the document contains two sentences, suppose we want to retrieve text data sentence by sentence for a linguistic analysis task. In other words, we expect two dictionaries in the generator, and each dictionary stores a sentence.

We can get each sentence by the following code.

[9]:
data_generator = data_pack.get_data(context_type=Sentence)
for sent_idx, sent_d in enumerate(data_generator):
    print(sent_idx, sent_d['context'])
0 The Indonesian billionaire James Riady has agreed to pay $ 8.5 million and plead guilty to illegally donating money for Bill Clinton 's 1992 presidential campaign .
1 He admits he was trying to influence American policy on China .

As we can see, we get the two sentences by two iterations.

So far, we have introduced two examples to explain the first parameter, context_type, which controls the granularity of the data context. Depending on the task, we can generate data of different granularities. We assigned context_type from Document to Sentence for sentence tasks, and we can even further change it to Token for token tasks.

Suppose we don’t want to analyze the first sentence in the data_pack, there is skip_k parameter that skips k data of context_type and starts generating data from (k+1)th instance. In this case, we want to start generating from the second instance, so we set skip_k to 1 to skip the first instance.

[10]:
data_generator = data_pack.get_data(context_type=Sentence, skip_k=1)
for sent_idx, sent_d in enumerate(data_generator):
    print(sent_idx, sent_d['context'])
0 He admits he was trying to influence American policy on China .

We have introduced three “data types”, Document, Sentence, and Token. They are three common data entries for text analysis.

They are also subclasses of Annotation, a parent class for text data entries, and can record text span, the range of data we have explained. However, such retrieval is usually not flexible enough for a real task.

Suppose we want to do part-of-speech tagging for each sentence, it means we need to tag Token pos within each sentence. Therefore, we need data entries of Token and Sentence. Moreover, we want to analyze POS sentence by sentence and Token data entries, and its POS is better nested in retrieved Sentence data. Same as before, we should set context_type to be Sentence. Moreover, we introduce parameter request, which supports retrieval of Token and its POS within the scope of Sentence context type.

See the example below for how to set requests, and for simplicity, we still skip the first sentence.

[11]:
requests = {
    Token: ["pos"],
}
data_generator = data_pack.get_data(context_type=Sentence, request=requests, skip_k=1)
for sent_idx, sent_d in enumerate(data_generator):
    print(sent_idx, sent_d['context'])
    print(sent_d['Token']['pos'])
    print("Token list length:", len(sent_d['Token']["text"]))
    print("POS list length:", len(sent_d['Token']['pos']))

0 He admits he was trying to influence American policy on China .
['PRP' 'VBZ' 'PRP' 'VBD' 'VBG' 'TO' 'VB' 'JJ' 'NN' 'IN' 'NNP' '.']
Token list length: 12
POS list length: 12

From the example, we can see requests is a dictionary where keys are data entries of Annotation type and values are requested data entry attributes. And the retrieved data dictionary sent_d now has the key Token, and sent_d['Token'] is a dictionary that has a key pos. It’s exactly the data entries that we requested.

Moreover, we should pay attention to the range of Token data, values of sent_d['Token'] is a list of data that are all within one sentence, and lists’ lengths are all the same since each list item is one Token’s data.

See the example below to see the dissembled data and their correspondence.

[12]:
data_generator = data_pack.get_data(context_type=Sentence, request=requests, skip_k=1)
for sent_idx, sent_d in enumerate(data_generator):
    print(sent_idx, sent_d['context'])
    for token_txt, token_pos in (zip(sent_d['Token']['text'], sent_d['Token']['pos'])):
        print(token_txt, token_pos)
0 He admits he was trying to influence American policy on China .
He PRP
admits VBZ
he PRP
was VBD
trying VBG
to TO
influence VB
American JJ
policy NN
on IN
China NNP
. .
[13]:
# intialize a token data dictionary
data_generator = data_pack.get_data(context_type=Token, skip_k=1)
token_d = next(data_generator)

print(doc_d.keys()) # document data dictionary
print(sent_d.keys()) # sentence data dictionary
print(token_d.keys()) # token data dictionary

dict_keys(['context', 'offset', 'tid'])
dict_keys(['context', 'offset', 'tid', 'Token'])
dict_keys(['context', 'offset', 'tid'])

There are four data fields as we check dictionary keys for document, sentence, and token data returned by the get_data method. Except for Token we requested earlier, all other three are returned by default.

A natural question arises: do those data classes have a parent class with common attributes of 'context', 'offset', 'tid'. The answer is positive. We have Annotation class that represents generic text data. * context: data within the context type scope. * offset: the first character of the text class index * tid: id of the text data instances.

Below we will dive into the attributes of Annotation class.

Annotation

In forte, each annotation has an attribute span, which represents begin and end of annotation-specific data of that particular annotation. For Annotation type, range means the begin index and end index of characters under Annotation type in the text payload of the DataPack.

For an Token instance, which is a subtype of Annotation, its annotation-specific data is text, and therefore, range means the begin and end of characters of that Token instance. For a Recording instance, which is a subtype of AudioAnnotation, its annotation-specific data is audio, and their range means the begin and end index of that Recording instance.

As we are extending forte’s capabilities of dealing with more modalities, we also have a parent class for audio data, AudioAnnotation.

AudioAnnotation

Based on the idea of “range”, in the example code, the entry AudioUtterance will be searched in DataPack.audio_annotations, and the requested data field speaker will be included in the generator’s data.

For AudioAnnotation type, range means the begin index and end index of the sound sample under AudioAnnotation type in the audio payload of the DataPack.

For example, if User wants to get data of AudioAnnotation from a DataPack instance pack. Users can call the function as the code below. It returns a generator that User can iterate over. AudioAnnotation is passed into the method as parameter context_type.

Build Coverage Index

DataPack.get() is commonly used to retrieve entries from a datapack. In some cases, we are only interested in getting entries from a specific range. DataPack.get() allows users to set range_annotation, which controls the search area of the sub-types. If DataPack.get() is called frequently with queries related to the range_annotation, you may consider building the coverage index regarding the related entry types. Users can call DataPack.build_coverage_for(context_type, covered_type) to create a mapping between a pair of entry types and target entries covered in ranges specified by outer entries.

For example, if you need to get all the Tokens from some Sentence, you can write your code as:

[14]:
# Iterate through all the sentences in the pack.
for sentence in data_pack.get(Sentence):
    # Take all tokens from a sentence
    token_entries = data_pack.get(
        entry_type=Token, range_annotation=sentence
    )

However, the snippet above may become a bottleneck if you have a lot of Sentence and Token entries inside the DataPack. To speed up this process, you can build a coverage index first:

[15]:
# Build coverage index between `Token` and `Sentence`
data_pack.build_coverage_for(
    context_type=Sentence,
    covered_type=Token
)

This DataPack.build_coverage_for(context_type, covered_type) function is able to build a mapping from context_type to covered_type, allowing faster retrieval of inner entries covered by outer entries inside the datapack. We also provide a function called DataPack.covers(context_entry, covered_entry) for coverage checking. It returns True if the span of covered_entry is covered by the span of context_entry.