Data

Ontology

base

class forte.data.span.Span(begin, end)[source]

A class recording the span of annotations. Span objects can be totally ordered according to their begin as the first sort key and end as the second sort key.

Parameters
  • begin (int) – The offset of the first character in the span.

  • end (int) – The offset of the last character in the span + 1. So the span is a left-closed and right-open interval [begin, end).

core

Entry

class forte.data.ontology.core.Entry(pack)[source]

The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are Annotation, Link, Generics, and Group.

An forte.data.ontology.top.Annotation object represents a span in text.

A forte.data.ontology.top.Link object represents a binary link relation between two entries.

A forte.data.ontology.top.Generics object.

A forte.data.ontology.top.Group object represents a collection of multiple entries.

Main Attributes:

  • embedding: The embedding vectors (numpy array of floats) of this entry.

Parameters

pack (~ContainerType) – Each entry should be associated with one pack upon creation.

property embedding

Get the embedding vectors (numpy array of floats) of the entry.

property tid

Get the id of this entry.

Return type

int

Returns

id of the entry

property pack_id

Get the id of the pack that contains this entry.

Return type

int

Returns

id of the pack that contains this entry.

entry_type()[source]

Return the full name of this entry type.

Return type

str

abstract set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent (Entry) – The parent entry.

abstract set_child(child)[source]

This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.

Parameters

child (Entry) – The child entry

abstract get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link from the given DataPack.

abstract get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link from the given DataPack.

class forte.data.ontology.core.BaseGroup(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

This is the BaseGroup interface. Specific member constraints are defined in the inherited classes.

abstract add_member(member)[source]

Add one entry to the group.

Parameters

member (~EntryType) – One member to be added to the group.

add_members(members)[source]

Add members to the group.

Parameters

members (Iterable[~EntryType]) – An iterator of members to be added to the group.

abstract get_members()[source]

Get the member entries in the group.

Return type

List[~EntryType]

Returns

Instances of Entry that are the members of the group.

top

class forte.data.ontology.top.Generics(pack)[source]
class forte.data.ontology.top.Annotation(pack, begin, end)[source]

Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a Span corresponding to its offset in the text.

Parameters
  • pack (~PackType) – The container that this annotation will be added to.

  • begin (int) – The offset of the first character in the annotation.

  • end (int) – The offset of the last character in the annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]

This function wraps the get() method to find entries “covered” by this annotation. See that method for more information.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from each sentence created by NLTKTokenizer.
    token_entries = sentence.get(
        entry_type=Token,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used.

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

class forte.data.ontology.top.AudioAnnotation(pack, begin, end)[source]

AudioAnnotation type entries, such as “recording” and “audio utterance”. Each audio annotation has a Span corresponding to its offset in the audio. Most methods in this class are the same as the ones in Annotation, except that it replaces property text with audio.

Parameters
  • pack (~PackType) – The container that this audio annotation will be added to.

  • begin (int) – The offset of the first sample in the audio annotation.

  • end (int) – The offset of the last sample in the audio annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]

This function wraps the get() method to find entries “covered” by this audio annotation. See that method for more information. For usage details, refer to forte.data.ontology.top.Annotation.get().

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

Link type entries, such as “predicate link”. Each link has a parent node and a child node.

Parameters
  • pack (~PackType) – The container that this annotation will be added to.

  • parent (Optional[Entry]) – the parent entry of the link.

  • child (Optional[Entry]) – the child entry of the link.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent (Entry) – The parent entry.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.

Parameters

child (Entry) – The child entry.

get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Group(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group. The update will be populated to the corresponding list in DataStore of self.pack.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group. The function will retrieve a list of member entries’s tid``s from ``DataStore and convert them to entry object on the fly.

Return type

List[Entry]

Returns

A set of instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackGeneric(pack)[source]
class forte.data.ontology.top.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type

List[Entry]

Returns

Instances of Entry that are the members of the group.

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

parent_id()[source]

Return the tid of the parent entry.

Return type

int

Returns

The tid of the parent entry.

child_id()[source]

Return the tid of the child entry.

Return type

int

Returns

The tid of the child entry.

parent_pack_id()[source]

Return the pack_id of the parent pack.

Return type

int

Returns

The pack_id of the parent pack..

child_pack_id()[source]

Return the pack_id of the child pack.

Return type

int

Returns

The pack_id of the child pack.

set_parent(parent)[source]

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

parent (Entry) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

child (Entry) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Query(pack)[source]

An entry type representing queries for information retrieval tasks.

Parameters

pack (~PackType) – Data pack reference to which this query will be added

add_result(pid, score)[source]

Set the result score for a particular pack (based on the pack id).

Parameters
  • pid (str) – the pack id.

  • score (float) – the score for the pack

Returns

None

update_results(pid_to_score)[source]

Updates the results for this query.

Parameters

pid_to_score (Dict[str, float]) – A dict containing pack id -> score mapping

Packs

BasePack

class forte.data.base_pack.BasePack(pack_name=None)[source]

The base class of DataPack and MultiPack.

Parameters

pack_name (Optional[str]) – a string name of the pack.

delete_entry(entry)[source]

Remove the entry from the pack.

Parameters

entry (~EntryType) – The entry to be removed.

Returns

None

add_entry(entry, component_name=None)[source]

Add an Entry object to the BasePack object. Allow duplicate entries in a pack.

Parameters
  • entry (Union[Entry, int]) – An Entry object to be added to the pack.

  • component_name (Optional[str]) – A name to record that the entry is created by this component.

Return type

~EntryType

Returns

The input entry itself

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

to_string(drop_record=False, json_method='jsonpickle', indent=None)[source]

Return the string representation (json encoded) of this method.

Parameters
  • drop_record (Optional[bool]) – Whether to drop the creation records, default is False.

  • json_method (str) – What method is used to convert data pack to json. Only supports json_pickle for now. Default value is json_pickle.

  • indent (Optional[int]) – The indent used for json string.

Returns: String representation of the data pack.

Return type

str

serialize(output_path, zip_pack=False, drop_record=False, serialize_method='jsonpickle', indent=None)[source]

Serializes the data pack to the provided path. The output of this function depends on the serialization method chosen.

Parameters
  • output_path (Union[str, Path]) – The path to write data to.

  • zip_pack (bool) – Whether to compress the result with gzip.

  • drop_record (bool) – Whether to drop the creation records, default is False.

  • serialize_method (str) – The method used to serialize the data. Currently supports jsonpickle (outputs str) and Python’s built-in pickle (outputs bytes).

  • indent (Optional[int]) – Whether to indent the file if written as JSON.

Returns: Results of serialization.

set_control_component(component)[source]

Record the current component that is taking control of this pack.

Parameters

component (str) – The component that is going to take control

Returns:

record_field(entry_id, field_name)[source]

Record who modifies the entry, will be called in Entry

Parameters
  • entry_id (int) – The id of the entry.

  • field_name (str) – The name of the field modified.

Returns:

on_entry_creation(entry, component_name=None)[source]

Call this when adding a new entry, will be called in Entry when its __init__ function is called. This method does the following 2 operations with regards to creating a new entry.

  • All dataclass attributes of the entry to be created

    are stored in the class level dictionary of Entry called cached_attributes_data. This is used to initialize the corresponding entry’s objects data store entry

  • On creation of the data store entry, this methods associates

    getter and setter properties to all dataclass attributes of this entry to allow direct interaction between the attributes of the entry and their copy being stored in the data store. For example, the setter method updates the data store value of an attribute of a given entry whenever the attribute in the entry’s object is updated.

Parameters
  • entry (Entry) – The entry to be added.

  • component_name (Optional[str]) – A name to record that the entry is created by this component.

Returns:

get_entry(tid)[source]

Look up the entry_index with tid. Specific implementation depends on the actual class.

Return type

~EntryType

get_entry_raw(tid)[source]

Retrieve the raw entry data in list format from DataStore.

Return type

List

A List container of all links in this data pack.

abstract property groups

A List container of all groups in this pack.

abstract get(entry_type, **kwargs)[source]

Implementation of this method should provide to obtain the entries in entry ordering. If there are orders defined between the entries, they should be used first. Otherwise, the insertion order should be used (FIFO).

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of the entry to obtain.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the provided arguments.

get_single(entry_type)[source]

Take a single entry of type entry_type from this data pack. This is useful when the target entry type appears only one time in the DataPack for e.g., a Document entry. Or you just intended to take the first one.

Parameters

entry_type (Union[str, Type[~EntryType]]) – The entry type to be retrieved.

Return type

~EntryType

Returns

A single data entry.

get_ids_by_creator(component)[source]

Look up the component_index with key component. This will return the entry ids that are created by the component

Parameters

component (str) – The component (creator) to find ids for.

Return type

Set[int]

Returns

A set of entry ids that are created by the component.

is_created_by(entry, components)[source]

Check if the entry is created by any of the provided components.

Parameters
Return type

bool

Returns

True if the entry is created by the component, False otherwise.

get_entries_from(component)[source]

Look up all entries from the component as a unordered set

Parameters

component (str) – The component (creator) to get the entries. It is normally the full qualified name of the creator class, but it may also be customized based on the implementation.

Return type

Set[~EntryType]

Returns

The set of entry ids that are created by the input component.

get_ids_from(components)[source]

Look up entries using a list of components (creators). This will find each creator iteratively and combine the result.

Parameters

components (List[str]) – The list of components to find.

Return type

Set[int]

Returns

The list of entry ids that are created from these components.

get_entries_of(entry_type, exclude_sub_types=False)[source]

Return all entries of this particular type without orders. If you need to get the annotations based on the entry ordering, use forte.data.base_pack.BasePack.get().

Parameters
  • entry_type (Type[~EntryType]) – The type of the entry you are looking for.

  • exclude_sub_types – Whether to ignore the inherited sub type of the provided entry_type. Default is True.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the type constraint.

DataPack

class forte.data.data_pack.DataPack(pack_name=None)[source]

A DataPack contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.

Parameters

pack_name (Optional[str]) – A name for this data pack.

property text

Get the first text data stored in the DataPack. If there is no text payload in the DataPack, it will return empty string.

Parameters

text_payload_index – the index of the text payload. Defaults to 0.

Raises

ValueError – raised when the index is out of bound of the text payload list.

Return type

str

Returns

text data in the text payload.

property audio

Return the audio data from the first audio payload in the DataPack.

property image

Return the image data from the first image payload in the data pack.

get_image(index)[source]

Return the image data from the image payload at the specified index.

Parameters

index (int) – image payload index for retrieving the image data.

Returns

image payload data at the specified index.

property all_annotations

An iterator of all annotations in this data pack.

Returns: Iterator of all annotations, of type Annotation.

Return type

Iterator[Annotation]

property num_annotations

Number of annotations in this data pack.

Returns: (int) Number of the links.

Return type

int

An iterator of all links in this data pack.

Returns: Iterator of all links, of type Link.

Return type

Iterator[Link]

Number of links in this data pack.

Returns: Number of the links.

Return type

int

property all_groups

An iterator of all groups in this data pack.

Returns: Iterator of all groups, of type Group.

Return type

Iterator[Group]

property num_groups

Number of groups in this data pack.

Returns: Number of groups.

property all_generic_entries

An iterator of all generic entries in this data pack.

Returns: Iterator of generic

Return type

Iterator[Generics]

property num_generics_entries

Number of generics entries in this data pack.

Returns: Number of generics entries.

property all_audio_annotations

An iterator of all audio annotations in this data pack.

Returns: Iterator of all audio annotations, of type AudioAnnotation.

Return type

Iterator[AudioAnnotation]

property num_audio_annotations

Number of audio annotations in this data pack.

Returns: Number of audio annotations.

property annotations

A SortedList container of all annotations in this data pack.

Returns: SortedList of all annotations, of type Annotation.

property generics

A SortedList container of all generic entries in this data pack.

Returns: SortedList of generics

property audio_annotations

A SortedList container of all audio annotations in this data pack.

Returns: SortedList of all audio annotations, of type AudioAnnotation.

A List container of all links in this data pack.

Returns: List of all links, of type Link.

property groups

A List container of all groups in this data pack.

Returns: List of all groups, of type Group.

get_payload_at(modality, payload_index)[source]

Get Payload of requested modality at the requested payload index.

Parameters
  • modality (IntEnum) – data modality among “text”, “audio”, “image”

  • payload_index (int) – the zero-based index of the Payload in this DataPack’s Payload entries of the requested modality.

Raises

ValueError – raised when the requested modality is not supported.

Returns

Payload entry containing text data, image or audio data.

get_payload_data_at(modality, payload_index)[source]

Get Payload of requested modality at the requested payload index.

Parameters
  • modality (IntEnum) – data modality among “text”, “audio”, “image”

  • payload_index (int) – the zero-based index of the Payload in this DataPack’s Payload entries of the requested modality.

Raises

ValueError – raised when the requested modality is not supported.

Return type

Union[str, ndarray]

Returns

different data types for different data modalities.

  1. str data for text data.

  2. Numpy array for image and audio data.

get_span_text(begin, end, text_payload_index=0)[source]

Get the text in the data pack contained in the span.

Parameters
  • begin (int) – begin index to query.

  • end (int) – end index to query.

  • text_payload_index (int) – the zero-based index of the TextPayload in this DataPack’s TextPayload entries. Defaults to 0.

Return type

str

Returns

The text within this span.

get_span_audio(begin, end, audio_payload_index=0)[source]

Get the audio in the data pack contained in the span. begin and end represent the starting and ending indices of the span in audio payload respectively. Each index corresponds to one sample in audio time series.

Parameters
  • begin (int) – begin index to query.

  • end (int) – end index to query.

  • audio_payload_index – the zero-based index of the AudioPayload in this DataPack’s AudioPayload entries. Defaults to 0.

Return type

ndarray

Returns

The audio within this span.

add_text(text)[source]

Add a text payload to this data pack.

Parameters

text – Text to be added.

set_text(text, replace_func=None, text_payload_index=0)[source]

Set text for TextPayload at a specified index or add a new TextPayload in the DataPack.

Raises

ValueError – raised when the text payload index is out of range.

Parameters
  • text (str) – a str text.

  • replace_func (Optional[Callable[[str], List[Tuple[Span, str]]]]) – function that replace text. Defaults to None.

  • text_payload_index (int) – the zero-based index of the TextPayload in this DataPack’s TextPayload entries. If it’s 0, it adds a new TextPayload if there is no text payload in the data pack.

set_audio(audio, sample_rate, audio_payload_index=0)[source]

Set audio for AudioPayload at a specified index or add a new AudioPayload in the DataPack.

Raises

ValueError – raised when the audio payload index is out of range.

Parameters
  • audio (ndarray) – A numpy array storing the audio waveform.

  • sample_rate (int) – An integer specifying the sample rate.

  • audio_payload_index (int) – the zero-based index of the AudioPayload in this DataPack’s AudioPayload entries. Defaults to 0, and it adds a new audio payload if there is no audio payload in the data pack.

add_audio(audio)[source]

Add an AudioPayload storing the audio given in the parameters.

Parameters

audio – A numpy array storing the audio.

add_image(image)[source]

Add an ImagePayload storing the image given in the parameters.

Parameters

image – A numpy array storing the image.

set_image(image, image_payload_index=0)[source]

Set the image payload of the DataPack object.

Parameters
  • image – A numpy array storing the image.

  • image_payload_index (int) – the zero-based index of the ImagePayload in this DataPack’s ImagePayload entries. Defaults to 0.

get_original_text(text_payload_index=0)[source]

Get original unmodified text from the DataPack object.

Parameters

text_payload_index (int) – the zero-based index of the TextPayload in this DataPack’s entries. Defaults to 0.

Returns

Original text after applying the replace_back_operations of DataPack object to the modified text

get_original_span(input_processed_span, align_mode='relaxed')[source]

Function to obtain span of the original text that aligns with the given span of the processed text.

Parameters
  • input_processed_span (Span) – Span of the processed text for which the corresponding span of the original text is desired.

  • align_mode (str) –

    The strictness criteria for alignment in the ambiguous cases, that is, if a part of input_processed_span spans a part of the inserted span, then align_mode controls whether to use the span fully or ignore it completely according to the following possible values:

    • ”strict” - do not allow ambiguous input, give ValueError.

    • ”relaxed” - consider spans on both sides.

    • ”forward” - align looking forward, that is, ignore the span towards the left, but consider the span towards the right.

    • ”backward” - align looking backwards, that is, ignore the span towards the right, but consider the span towards the left.

Returns

Span of the original text that aligns with input_processed_span

Example

  • Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.

  • Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.

  • Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->

  • Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->

  • Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]

>> data_pack = DataPack()
>> original_text = "He plays in the park"
>> data_pack.set_text(original_text,\
>>                    lambda _: [(Span(0, 2), "She"))]
>> data_pack.text
"She plays in the park"
>> input_processed_span = Span(0, len("She plays"))
>> orig_span = data_pack.get_original_span(input_processed_span)
>> data_pack.get_original_text()[orig_span.begin: orig_span.end]
"He plays"
classmethod deserialize(data_source, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Data Pack from a string. This internally calls the internal _deserialize() function from BasePack.

Parameters
  • data_source (Union[Path, str]) – The path storing data source.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

DataPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the DataPack. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.

Please note that deleting a entry do not guarantee the deletion of the related entries.

Parameters

entry (~EntryType) – An Entry object to be deleted from the pack.

get_data(context_type, request=None, skip_k=0, payload_index=0)[source]

Fetch data from entries in the data_pack of type context_type. Data includes “span”, annotation-specific default data fields and specific data fields by “request”.

Annotation-specific data fields means:

  • “text” for Type[Annotation]

  • “audio” for Type[AudioAnnotation]

Currently, we do not support Groups and Generics in the request.

Example

requests = {
    base_ontology.Sentence:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense"],
    base_ontology.EntityMention: {
    },
}
pack.get_data(base_ontology.Sentence, requests)
Parameters
  • context_type (Union[str, Type[Annotation], Type[AudioAnnotation]]) –

    The granularity of the data context, which could be any Annotation or AudioAnnotation type. Behaviors under different context_type varies:

    • str type will be converted into either Annotation type or AudioAnnotation type.

    • Type[Annotation]: the default data field for getting context data is text. This function iterates all_annotations to search target entry data.

    • Type[AudioAnnotation]: the default data field for getting context data is audio which stores audio data in numpy arrays. This function iterates all_audio_annotations to search target entry data.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) –

    The entry types and fields User wants to request. The keys of the requests dict are the required entry types and the value should be either:

    • a list of field names or

    • a dict which accepts three keys: “fields”, “component”, and “unit”.

      • By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.

      • By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.

      • By setting “unit” (string), users can specify a unit by which the annotations are indexed.

    Note that for all annotation types, “span” fields and annotation-specific data fields are returned by default.

    For all link types, “child” and “parent” fields are returned by default.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.

  • payload_index (int) – the zero-based index of the Payload in this DataPack’s Payload entries of a particular modality. The modality is dependent on context_type. Defaults to 0.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).

build_coverage_for(context_type, covered_type)[source]

User can call this function to build coverage index for specific types. The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See forte.data.data_pack.DataIndex for more details.

Parameters
covers(context_entry, covered_entry)[source]

Check if the covered_entry is covered (in span) of the context_type.

See in_span() and in_audio_span() for the definition of in span.

Parameters
  • context_entry (Union[Annotation, AudioAnnotation]) – The context entry.

  • covered_entry (~EntryType) – The entry to be checked on whether it is in span of the context entry.

Returns (bool): True if in span.

Return type

bool

get(entry_type, range_annotation=None, components=None, include_sub_type=True)[source]

This function is used to get data from a data pack with various methods.

Depending on the provided arguments, the function will perform several different filtering of the returned data.

The entry_type is mandatory, where all the entries matching this type will be returned. The sub-types of the provided entry type will be also returned if include_sub_type is set to True (which is the default behavior).

The range_annotation controls the search area of the sub-types. An entry E will be returned if in_span() or in_audio_span() returns True. If this function is called frequently with queries related to the range_annotation, please consider to build the coverage index regarding the related entry types. User can call build_coverage_for(context_type, covered_type)() in order to build a mapping between a pair of entry types and target entries that are covered in ranges specified by outer entries.

The components list will filter the results by the component (i.e the creator of the entry). If components is provided, only the entries created by one of the components will be returned.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from a sentence created by NLTKTokenizer.
    token_entries = input_pack.get(
        entry_type=Token,
        range_annotation=sentence,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used:

# Build coverage index between `Token` and `Sentence`
input_pack.build_coverage_for(
    context_type=Sentence
    covered_type=Token
)

After building the index from the snippet above, you will be able to retrieve the tokens covered by sentence much faster.

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • range_annotation (Union[Annotation, AudioAnnotation, None]) – The range of entries requested. If None, will return valid entries in the range of whole data pack.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type (bool) – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

update(datapack)[source]

Update the attributes and properties of the current DataPack with another DataPack.

Parameters

datapack (DataPack) – A reference datapack to update

MultiPack

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters

packs (Iterator[DataPack]) – a data pack iterator.

Returns

None

get_subentry(pack_idx, entry_id)[source]

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id.

Noted this is changed from the way of accessing such pack before v0.0.1, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters
  • pack_idx (int) – The pack_id for the data_pack in the multi pack.

  • entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters
  • index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.

  • clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.

  • purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type

bool

Returns

True if successful.

add_pack(ref_name=None, pack_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters
  • ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.

  • pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type

DataPack

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters
  • pack (DataPack) – The existing data pack.

  • ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]

Get data pack at provided index.

Parameters

index (int) – The index of the pack.

Return type

DataPack

Returns

The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters

pack_id (int) – The global pack id to find.

Return type

int

Returns

None

get_pack(name)[source]

Get data pack of name.

Parameters

name (str) – The name of the pack.

Return type

DataPack

Returns

The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Return type

List[DataPack]

Returns

List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters
  • old_name (str) – The old name of the pack.

  • new_name (str) – The new name to be assigned for the pack.

Returns

None

An iterator of all links in this multi pack.

Return type

Iterator[MultiPackLink]

Returns

Iterator of all links, of type MultiPackLink.

Number of links in this multi pack.

Return type

int

Returns

Number of links.

property all_groups

An iterator of all groups in this multi pack.

Return type

Iterator[MultiPackGroup]

Returns

Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Return type

int

Returns

Number of groups.

property generic_entries

An iterator of all generics in this multi pack.

Return type

Iterator[MultiPackGeneric]

Returns

Iterator of all generics, of type MultiPackGeneric.

A List container of all links in this multi pack.

Returns: List of all links, of type MultiPackLink.

property groups

A List container of all groups in this multi pack.

Returns: List of all groups, of type MultiPackGroup.

property generics

A SortedList container of all generic entries in this multi pack.

Returns: SortedList of generics

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters
  • pack_index (int) – The index of a single pack.

  • context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.

  • skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

  • “fields”, this specifies the attribute field names to be obtained

  • “unit”, this specifies the unit used to index the annotation

  • “component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)
Parameters

request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.

Returns

None

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.

  • components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.

  • include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters
  • data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

BaseMeta

class forte.data.base_pack.BaseMeta(pack_name=None)[source]

Basic Meta information for both DataPack and MultiPack.

Parameters

pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Meta

class forte.data.data_pack.Meta(pack_name=None, language='eng', span_unit='character', sample_rate=None, info=None)[source]

Basic Meta information associated with each instance of DataPack.

Parameters
  • pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

  • language (str) – The language used by this data pack, default is English.

  • span_unit (str) – The unit used for interpreting the Span object of this data pack. Default is character.

  • sample_rate (Optional[int]) – An integer specifying the sample rate of audio payload. Default is None.

  • info (Optional[Dict[str, str]]) – Store additional string based information that the user add.

pack_name

storing the provided pack_name.

language

storing the provided language.

sample_rate

storing the provided sample_rate.

info

storing the provided info.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

DataIndex

class forte.data.data_pack.DataIndex[source]

A set of indexes used in DataPack, note that this class is used by the DataPack internally.

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. component_index, the index from each component to the entries generated by that component

  4. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  5. group_index, the index from group members to groups.

  6. _coverage_index, the index that maps from an annotation to the entries it covers. _coverage_index is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.

coverage_index(outer_type, inner_type)[source]

Get the coverage index from outer_type to inner_type.

Parameters
Return type

Optional[Dict[int, Set[int]]]

Returns

If the coverage index does not exist, return None. Otherwise, return a dict.

get_covered(data_pack, context_annotation, inner_type)[source]

Get the entries covered by a certain context annotation

Parameters
  • data_pack (DataPack) – The data pack to search for.

  • context_annotation (Union[Annotation, AudioAnnotation]) – The context annotation to search in.

  • inner_type (Type[~EntryType]) – The inner type to be searched for.

Return type

Set[int]

Returns

Entry ID of type inner_type that is covered by context_annotation.

build_coverage_index(data_pack, outer_type, inner_type)[source]

Build the coverage index from outer_type to inner_type.

Parameters
  • data_pack (DataPack) – The data pack to build coverage for.

  • outer_type (Type[Union[Annotation, AudioAnnotation]]) – an annotation or AudioAnnotation type.

  • inner_type (Type[~EntryType]) – an entry type, can be Annotation, Link, Group, AudioAnnotation.

have_overlap(entry1, entry2)[source]

Check whether the two annotations have overlap in span.

Parameters
Return type

bool

in_span(inner_entry, span)[source]

Check whether the inner entry is within the given span. The criterion are as followed:

Annotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both Annotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type Annotation, this function will always return False.

Group entries: if the child type of the group is Annotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not Annotation type, this function will always return False.

Other entries (i.e Generics and AudioAnnotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters
  • inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

in_audio_span(inner_entry, span)[source]

Check whether the inner entry is within the given audio span. This method is identical to :meth:in_span() except that it operates on the audio payload of datapack. The criterion are as followed:

AudioAnnotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both AudioAnnotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type AudioAnnotation, this function will always return False.

Group entries: if the child type of the group is AudioAnnotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not AudioAnnotation type, this function will always return False.

Other entries (i.e Generics and Annotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters
  • inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

MultiPack

MultiPackMeta

class forte.data.multi_pack.MultiPackMeta(pack_name=None)[source]

Meta information of a MultiPack.

MultiPack

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

relink(packs)[source]

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters

packs (Iterator[DataPack]) – a data pack iterator.

Returns

None

get_subentry(pack_idx, entry_id)[source]

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id.

Noted this is changed from the way of accessing such pack before v0.0.1, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters
  • pack_idx (int) – The pack_id for the data_pack in the multi pack.

  • entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters
  • index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.

  • clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.

  • purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type

bool

Returns

True if successful.

add_pack(ref_name=None, pack_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters
  • ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.

  • pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type

DataPack

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters
  • pack (DataPack) – The existing data pack.

  • ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]

Get data pack at provided index.

Parameters

index (int) – The index of the pack.

Return type

DataPack

Returns

The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters

pack_id (int) – The global pack id to find.

Return type

int

Returns

None

get_pack(name)[source]

Get data pack of name.

Parameters

name (str) – The name of the pack.

Return type

DataPack

Returns

The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Return type

List[DataPack]

Returns

List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters
  • old_name (str) – The old name of the pack.

  • new_name (str) – The new name to be assigned for the pack.

Returns

None

property all_links

An iterator of all links in this multi pack.

Return type

Iterator[MultiPackLink]

Returns

Iterator of all links, of type MultiPackLink.

property num_links

Number of links in this multi pack.

Return type

int

Returns

Number of links.

property all_groups

An iterator of all groups in this multi pack.

Return type

Iterator[MultiPackGroup]

Returns

Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Return type

int

Returns

Number of groups.

property generic_entries

An iterator of all generics in this multi pack.

Return type

Iterator[MultiPackGeneric]

Returns

Iterator of all generics, of type MultiPackGeneric.

property links

A List container of all links in this multi pack.

Returns: List of all links, of type MultiPackLink.

property groups

A List container of all groups in this multi pack.

Returns: List of all groups, of type MultiPackGroup.

property generics

A SortedList container of all generic entries in this multi pack.

Returns: SortedList of generics

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters
  • pack_index (int) – The index of a single pack.

  • context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.

  • skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

  • “fields”, this specifies the attribute field names to be obtained

  • “unit”, this specifies the unit used to index the annotation

  • “component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)
Parameters

request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.

Returns

None

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.

  • components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.

  • include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters
  • data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

MultiPackGroup

class forte.data.multi_pack.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type

List[Entry]

Returns

Instances of Entry that are the members of the group.

Readers

BaseReader

class forte.data.base_reader.BaseReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic data reader class. To be inherited by all data readers.

Parameters
  • from_cache (bool) – Decide whether to read from cache if cache file exists. By default (False), the reader will only read from the original file and use the cache file path for caching, it will not read from the cache_directory. If True, the reader will try to read a datapack from the caching file.

  • cache_directory (Optional[str]) –

    The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by _cache_key_function().

    Note

    A collection is the data returned by _collect().

  • append_to_cache (bool) – Decide whether to append write if cache file already exists. By default (False), we will overwrite the existing caching file. If True, we will cache the datapack append to end of the caching file.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

parse_pack(collection)[source]

Calls _parse_pack() to create packs from the collection. This internally setup the component meta data. Users should implement the _parse_pack() method.

Return type

Iterator[~PackType]

text_replace_operation(text)[source]

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters

text (str) – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]):

the replacement operations.

Return type

List[Tuple[Span, str]]

set_profiling(enable_profiling=True)[source]

Set profiling option.

Parameters

enable_profiling (bool) – A boolean of whether to enable profiling for the reader or not (the default is True).

timer_yield(pack)[source]

Wrapper generator for time profiling. Insert timers around ‘yield’ to support time profiling for reader.

Parameters

pack (~PackType) – DataPack passed from self.iter()

iter(*args, **kwargs)[source]

An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call collect.

Parameters
  • args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.

  • kwargs – Iterator of DataPacks.

Return type

Iterator[~PackType]

record(record_meta)[source]

Modify the pack meta record field of the reader’s output. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters

record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

cache_data(collection, pack, append)[source]

Specify the path to the cache directory.

After you call this method, the dataset reader will use its cache_directory to store a cache of BasePack read from every document passed to read, serialized as one string-formatted BasePack. If the cache file for a given file_path exists, we read the BasePack from the cache. If the cache file does not exist, we will create it on our first pass through the data.

Parameters
  • collection (Any) – The collection is a piece of data from the _collect() function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.

  • pack (~PackType) – The data pack to be cached.

  • append (bool) – Whether to allow appending to the cache.

read_from_cache(cache_filename)[source]

Reads one or more Packs from cache_filename, and yields Pack(s) from the cache file.

Parameters

cache_filename (Union[Path, str]) – Path to the cache file.

Return type

Iterator[~PackType]

Returns

List of cached data packs.

finish(resource)[source]

The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.

Parameters

resource (Resources) – A global resource registry.

set_text(pack, text)[source]

Assign the text value to the DataPack. This function will pass the text_replace_operation to the DataPack to conduct the pre-processing step.

Parameters
  • pack (DataPack) – The DataPack to assign value for.

  • text (str) – The original text to be recorded in this dataset.

PackReader

class forte.data.base_reader.PackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

A Pack Reader reads data into DataPack.

MultiPackReader

class forte.data.base_reader.MultiPackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic MultiPack data reader class. To be inherited by all data readers which return MultiPack.

CoNLL03Reader

ConllUDReader

class forte.data.readers.conllu_ud_reader.ConllUDReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

ConllUDReader is designed to read in the Universal Dependencies 2.4 dataset.

BaseDeserializeReader

RawDataDeserializeReader

RecursiveDirectoryDeserializeReader

HTMLReader

MSMarcoPassageReader

class forte.data.readers.ms_marco_passage_reader.MSMarcoPassageReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

MultiPackSentenceReader

MultiPackTerminalReader

OntonotesReader

PlainTextReader

ProdigyReader

RACEMultiChoiceQAReader

StringReader

SemEvalTask8Reader

OpenIEReader

SquadReader

class forte.datasets.mrc.squad_reader.SquadReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

Reader for processing Stanford Question Answering Dataset (SQuAD).

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span.

Dataset can be downloaded at https://rajpurkar.github.io/SQuAD-explorer/.

SquadReader reads each paragraph in the dataset as a separate Document, and the questions are concatenated behind the paragraph, form a Passage. Phrase are MRC answers marked as text spans. Each MRCQuestion has a list of answers.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

record(record_meta)[source]

Method to add output type record of PlainTextReader which is ft.onto.base_ontology.Document with an empty set to forte.data.data_pack.Meta.record.

Parameters

record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

ClassificationDatasetReader

Selector

Selector

class forte.data.selector.Selector[source]

DummySelector

class forte.data.selector.DummySelector[source]

Do nothing, return the data pack itself, which can be either DataPack or MultiPack.

SinglePackSelector

class forte.data.selector.SinglePackSelector[source]

This is the base class that select a DataPack from MultiPack.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

NameMatchSelector

class forte.data.selector.NameMatchSelector(select_name=None)[source]

Select a DataPack from a MultiPack with specified name. This implementation takes special care for backward compatibility.

Deprecated:

selector = NameMatchSelector(select_name="foo")
selector = NameMatchSelector("foo")

Now:

selector = NameMatchSelector()
    selector.initialize(
        configs={
            "select_name": "foo"
        }
)

WARNING: Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Returns

A boolean value to indicate whether pack will be returned.

RegexNameMatchSelector

class forte.data.selector.RegexNameMatchSelector(select_name=None)[source]

Select a DataPack from a MultiPack using a regex.

This implementation takes special care for backward compatibility.

Deprecated:

selector = RegexNameMatchSelector(select_name="^.*\\d$")
selector = RegexNameMatchSelector("^.*\\d$")

Now:

selector = RegexNameMatchSelector()
selector.initialize(
    configs={
        "select_name": "^.*\\d$"
    }
)

Warning

Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

FirstPackSelector

class forte.data.selector.FirstPackSelector[source]

Select the first entry from MultiPack and yield it.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

AllPackSelector

class forte.data.selector.AllPackSelector[source]

Select all the packs from MultiPack and yield them.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

Index

BaseIndex

class forte.data.index.BaseIndex[source]

A set of indexes used in BasePack:

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  4. group_index, the index from group members to groups.

update_basic_index(entries)[source]

Build or update the basic indexes, including

(1) entry_index, the index from each tid to the corresponding entry;

(2) type_index, the index from each type to the entries of that type;

(3) component_index, the index from each component to the entries generated by that component.

Parameters

entries (list) – a list of entries to be added into the basic index.

query_by_type_subtype(t)[source]

Look up the entry indices that are instances of entry_type, including children classes of entry_type.

Note

all the known types to this data pack will be scanned to find all sub-types. This method will try to cache the sub-type information after the first call, but the cached information could be invalidated by other operations (such as adding new items to the data pack).

Parameters

t (Type[~EntryType]) – The type of the entry you are looking for.

Return type

Set[int]

Returns

A set of entry ids. The entries are instances of entry_type ( and also includes instances of the subclasses of entry_type).

Build the link_index, the index from child and parent nodes to links. It will build the links with the links in the dataset.

link_index consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:

build_group_index(groups)[source]

Build group_index, the index from group members to groups.

Returns

None

Look up the link_index with key tid. If the link index is not built, this will throw a PackIndexError.

Parameters
  • tid (int) – the tid of the entry being looked up.

  • as_parent (bool) – If as_patent is True, will look up link_index["parent_index"] and return the tids of links whose parent is ``tid`. Otherwise, will look up link_index["child_index"] and return the tids of links whose child is ``tid`.

Return type

Set[int]

group_index(tid)[source]

Look up the group_index with key tid. If the index is not built, this will raise a PackIndexError.

Return type

Set[int]

Update link_index with the provided links, the index from child and parent to links.

link_index consists of two sub-indexes:

  • “child_index” is the index from child nodes to their corresponding links

  • “parent_index” is the index from parent nodes to their corresponding links.

Parameters

links (List[~LinkType]) – a list of links to be added into the index.

update_group_index(groups)[source]

Build or update group_index, the index from group members to groups.

Parameters

groups (List[~GroupType]) – a list of groups to be added into the index.

Store

BaseStore

class forte.data.base_store.BaseStore[source]

The base class which will be used by DataStore.

serialize(output_path, serialize_method='json', save_attribute=True, indent=None)[source]

Serializes the data store to the provided path. The output of this function depends on the serialization method chosen.

Parameters
  • output_path (str) – The path to write data to.

  • serialize_method (str) – The method used to serialize the data. Currently supports json (outputs json dictionary).

  • save_attribute (bool) – Boolean value indicating whether users want to save attributes for field checks later during deserialization. Attributes and their indices for every entry type will be saved.

  • indent (Optional[int]) – Whether to indent the file if written as JSON.

Returns: Results of serialization.

to_string(json_method='json', save_attribute=True, indent=None)[source]

Return the string representation (json encoded) of this method.

Parameters
  • json_method (str) – What method is used to convert data pack to json. Only supports json for now. Default value is json.

  • save_attribute (bool) – Boolean value indicating whether users want to save attributes for field checks later during deserialization. Attributes and their indices for every entry type will be saved.

Returns: String representation of the data pack.

Return type

str

abstract add_entry_raw(type_name, tid=None, allow_duplicate=True, attribute_data=None)[source]

This function provides a general implementation to add all types of entries to the data store. It can add namely Annotation, AudioAnnotation, ImageAnnotation, Link, Group and Generics. Returns the tid for the inserted entry.

Parameters
  • type_name (str) – The fully qualified type name of the new Entry.

  • tid (Optional[int]) – tid of the Entry that is being added. It’s optional, and it will be auto-assigned if not given.

  • allow_duplicate (bool) – Whether we allow duplicate in the DataStore. When it’s set to False, the function will return the tid of existing entry if a duplicate is found. Default value is True.

  • attribute_data (Optional[List]) – It is a list that stores attributes relevant to the entry being added. The attributes passed in attributes_data must be present in that entries type_attributes and must only be those entries which are relevant to the initialization of the entry. For example, begin and end position when creating an entry of type Annotation.

Return type

int

Returns

tid of the entry.

abstract all_entries(entry_type_name)[source]

Retrieve all entry data of entry type entry_type_name and entries of subclasses of entry type entry_type_name.

Parameters

entry_type_name (str) – the type name of entries that the User wants to retrieve.

Yields

Iterator of raw entry data in list format.

Return type

Iterator[List]

abstract num_entries(entry_type_name)[source]

Compute the number of entries of given entry_type_name and entries of subclasses of entry type entry_type_name.

Parameters

entry_type_name (str) – the type name of entries that the User wants to get its count.

Return type

int

Returns

The number of entries of given entry_type_name.

abstract set_attribute(tid, attr_name, attr_value)[source]

This function locates the entry data with tid and sets its attr_name with attr_value.

Parameters
  • tid (int) – Unique Id of the entry.

  • attr_name (str) – Name of the attribute.

  • attr_value (Any) – Value of the attribute.

abstract get_attribute(tid, attr_name)[source]

This function finds the value of attr_name in entry with tid.

Parameters
  • tid (int) – Unique id of the entry.

  • attr_name (str) – Name of the attribute.

Returns

The value of attr_name for the entry with tid.

abstract delete_entry(tid)[source]

This function removes the entry with tid from the data store.

Parameters

tid (int) – Unique id of the entry.

abstract get_entry(tid)[source]

Look up the tid_ref_dict or tid_idx_dict with key tid. Return the entry and its type_name.

Parameters

tid (int) – Unique id of the entry.

Return type

Tuple[List, str]

Returns

The entry which tid corresponds to and its type_name.

abstract get_entry_index(tid)[source]

Look up the tid_ref_dict or tid_idx_dict with key tid. Return the index_id of the entry.

Parameters

tid (int) – Unique id of the entry.

Return type

int

Returns

Index of the entry which tid corresponds to in the entry_type list.

abstract get(type_name, include_sub_type, range_span=None)[source]

This function fetches entries from the data store of type type_name.

Parameters
  • type_name (str) – The index of the list in self.__elements.

  • include_sub_type (bool) – A boolean to indicate whether get its subclass.

  • range_span (Optional[Tuple[int, int]]) – A tuple that contains the begin and end indices of the searching range of annotation-like entries.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

abstract next_entry(tid)[source]

Get the next entry of the same type as the tid entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

The next entry of the same type as the tid entry.

abstract prev_entry(tid)[source]

Get the previous entry of the same type as the tid entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

The previous entry of the same type as the tid entry.

Data Store

DataStore

class forte.data.data_store.DataStore(onto_file_path=None, dynamically_add_type=True)[source]
classmethod deserialize(data_source, serialize_method='json', check_attribute=True, suppress_warning=True, accept_unknown_attribute=True)[source]

Deserialize a DataStore from serialized data in data_source.

Parameters
  • data_source (str) – The path storing data source.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current option is json.

  • check_attribute (bool) – Boolean value indicating whether users want to check compatibility of attributes. Only applicable when the data being serialized is done with save_attribute set to True in BaseStore.serialize. If true, it will compare fields of the serialized object and the current DataStore class. If there are fields that have different orders in the current class and the serialized object, it switches the order of fields to match the current class. If there are fields that appear in the current class, but not in the serialized object, it handles those fields with accept_unknown_attribute. If there are fields that appear in the serialized object, but not in the current class, it drops those fields.

  • suppress_warning (bool) – Boolean value indicating whether users want to see warnings when it checks attributes. Only applicable when check_attribute is set to True. If true, it will log warnings when there are mismatched fields.

  • accept_unknown_attribute (bool) – Boolean value indicating whether users want to fill fields that appear in the current class, but not in the serialized object with none. Only applicable when check_attribute is set to True. If false, it will raise an ValueError if there are any contradictions in fields.

Raises

ValueError – raised when 1. the serialized object has unknown fields, but accept_unknown_attribute is False. 2. the serialized object does not store attributes, but check_attribute is True. 3. the serialized object does not support json deserialization. We may change this error when we have other options for deserialization.

Return type

DataStore

Returns

An data store object deserialized from the string.

get_annotation_sorting_fn(type_name)[source]

This function creates a lambda method to generate the sorted list of an entry of given type. The type of the entry must be a successor of Annotation. It creates a lambda function that sorts annotation type entries based on their begin and end index. The function first fetches the indices of the positions where the begin and end index is stored for the data store entry specified by type_name. These index positions are then used to create the lambda function to sort the data store entries given by type_name.

Parameters

type_name (str) – A string representing a fully qualified type name of the entry whose sorting function we want to fetch.

Returns

A lambda function representing the sorting function for entries of type type_name.

fetch_entry_type_data(type_name, attributes=None)[source]

This function takes a fully qualified type_name class name and a set of tuples representing an attribute and its required type (only in the case where the type_name class name represents an entry being added from a user defined ontology) and creates a dictionary where the key is attribute of the entry and value is the type information of that attribute.

There are two cases in which a fully qualified type_name class name can be handled:

  1. If the class being added is of an existing entry: This means

    that there is information present about this entry through its dataclass attributes and their respective types. Thus, we use the _get_entry_attributes_by_class method to fetch this information.

  2. If the class being added is of a user defined entry: In this

    case, we fetch the information about the entry’s attributes and their types from the attributes argument.

Parameters
  • type_name (str) – A fully qualified name of an entry class.

  • attributes (Optional[Set[Tuple[str, str]]]) –

    This argument is used when parsing ontology files. The entries in the set are a tuples of two elements.

    attributes =  {
                ('passage_id', 'str'),
                ('author', 'str')
            }
    

Returns: A dictionary representing attributes as key and type

information as value. For each attribute, the type information is represented by a tuple of two elements. The first element is the unsubscripted version of the attribute’s type and the second element is the type arguments for the same. The type_dict is used to populate the type information for attributes of an entry specified by type_name in _type_attributes. For example,

type_dict =  {
            "document_class": (list, (str,)),
            "sentiment": (dict, (str, float)),
            "classifications": (FDict, (str, Classification))
        }
Return type

Dict[str, Tuple]

get_attr_type(type_name, attr_name)[source]

Retrieve the type information of a given attribute attr_name in an entry of type type_name

Parameters
  • type_name (str) – The type name of the entry whose attribute entry type needs to be fetched

  • attr_name (str) – The name of the attribute in the entry whose type information needs to be fetched.

Return type

Tuple[Any, Tuple]

Returns

The type information of the required attribute. This information is stored in the _type_attributes dictionary of the Data Store.

all_entries(entry_type_name)[source]

Retrieve all entry data of entry type entry_type_name and entries of subclasses of entry type entry_type_name.

Parameters

entry_type_name (str) – the type name of entries that the User wants to retrieve.

Yields

Iterator of raw entry data in list format.

Return type

Iterator[List]

num_entries(entry_type_name)[source]

Compute the number of entries of given entry_type_name and entries of subclasses of entry type entry_type_name.

Parameters

entry_type_name (str) – the type name of entries that the User wants to get its count.

Return type

int

Returns

The number of entries of given entry_type_name.

get_datastore_attr_idx(type_name, attr)[source]

This function returns the index of where a given attribute attr is stored in the Data Store entry of type type_name

Parameters
  • type_name (str) – The fully qualified type name of the entry.

  • attr (str) – The name of the attribute whose index needs to be fetched.

Return type

int

Returns

An integer representing the attributes position in the Data Store entry.

initialize_and_validate_entry(entry, attribute_data)[source]

This function performs validation checks on the initial attributes added to a data store entry. This functions also modifies the value of certain attributes to fit data store’s purpose of storing primitive types. For example, In the data store entry of type Group, attribute member_type is converted from an object to str. When initializing entries, this function makes certain assumptions based on the type of entry.

  • if the entry is of type Annotation

    or AudioAnnotation, we assume that attribute_data is a list of two elements, indicating the begin and end index of the annotation respectively.

  • if the entry is of type Group or

    MultiPackGroup, we assume that attribute_data is a list of one element representing the group’s member type.

  • if the entry is of type Link or

    MultiPackLink, we assume that attribute_data is a list of two elements representing the link’s parent and child type respectively.

Parameters

entry (list) – The initial version of the data store entry whose values need to be validated.

Return type

List

Returns

The list that represents the entry with all its values validated

and modified (if necessary).

add_entry_raw(type_name, tid=None, allow_duplicate=True, attribute_data=None)[source]

This function provides a general implementation to add all types of entries to the data store. It can add namely Annotation, AudioAnnotation, ImageAnnotation, Link, Group and Generics. Returns the tid for the inserted entry.

Parameters
  • type_name (str) – The fully qualified type name of the new Entry.

  • tid (Optional[int]) – tid of the Entry that is being added. It’s optional, and it will be auto-assigned if not given.

  • allow_duplicate (bool) – Whether we allow duplicate in the DataStore. When it’s set to False, the function will return the tid of existing entry if a duplicate is found. Default value is True.

  • attribute_data (Optional[List]) – It is a list that stores attributes relevant to the entry being added. The attributes passed in attributes_data must be present in that entries type_attributes and must only be those entries which are relevant to the initialization of the entry. For example, begin and end position when creating an entry of type Annotation.

Return type

int

Returns

tid of the entry.

set_attribute(tid, attr_name, attr_value)[source]

This function locates the entry data with tid and sets its attr_name with attr_value. It first finds attr_id according to attr_name. tid, attr_id, and attr_value are passed to set_attr().

Parameters
  • tid (int) – Unique Id of the entry.

  • attr_name (str) – Name of the attribute.

  • attr_value (Any) – Value of the attribute.

Raises

KeyError – when tid or attr_name is not found.

get_attribute(tid, attr_name)[source]

This function finds the value of attr_name in entry with tid. It locates the entry data with tid and finds attr_id of its attribute attr_name. tid and attr_id are passed to get_attr().

Parameters
  • tid (int) – Unique id of the entry.

  • attr_name (str) – Name of the attribute.

Return type

Any

Returns

The value of attr_name for the entry with tid.

Raises

KeyError – when tid or attr_name is not found.

delete_entry(tid)[source]

This function locates the entry data with tid and removes it from the data store. This function removes it from tid_ref_dict or tid_idx_dict and finds its index in the list. If it is an annotation-like entry, we retrieve the entry from tid_ref_dict and bisect the list to find its index. If it is an non-annotation-like entry, we get the type_name and its index in the list directly from tid_idx_dict.

Parameters

tid (int) – Unique id of the entry.

Raises
  • KeyError – when entry with tid is not found.

  • RuntimeError – when internal storage is inconsistent.

get_entry(tid)[source]

This function finds the entry with tid. It returns the entry and its type_name.

Parameters

tid (int) – Unique id of the entry.

Return type

Tuple[List, str]

Returns

The entry which tid corresponds to and its type_name.

Raises
  • ValueError – An error occurred when input tid is not found.

  • KeyError – An error occurred when entry_type is not found.

get_entry_index(tid)[source]

Look up the tid_ref_dict and tid_idx_dict with key tid. Return the index_id of the entry.

Parameters

tid (int) – Unique id of the entry.

Return type

int

Returns

Index of the entry which tid corresponds to in the entry_type list.

Raises

ValueError – An error occurred when no corresponding entry is found.

get_length(type_name)[source]

This function find the length of the type_name entry list. It should not count None placeholders that appear in non-annotation-like entry lists.

Parameters

type_name (str) – The fully qualified type name of a type.

Return type

int

Returns

The count of not None entries.

co_iterator_annotation_like(type_names, range_span=None)[source]

Given two or more type names, iterate their entry lists from beginning to end together.

For every single type, their entry lists are sorted by the begin and end fields. The co_iterator_annotation_like function will iterate those sorted lists together, and yield each entry in sorted order. This tasks is quite similar to merging several sorted list to one sorted list. We internally use a MinHeap to order the order of yielded items, and the ordering is determined by:

  • start index of the entry.

  • end index of the entry.

  • the index of the entry type name in input parameter type_names.

The precedence of those values indicates their priority in the min heap ordering.

Lastly, the range_span argument determines the start and end position of the span range within which entries of specified by type_name need to be fetched. For example, if two entries have both the same begin and end field, then their order is decided by the order of user input type_name (the type that first appears in the target type list will return first). For entries that have the exact same begin, end and type_name, the order will be determined arbitrarily.

For example, let’s say we have two entry types, Sentence and EntityMention. Each type has two entries. The two entries of type Sentence ranges from span (0,5) and (6,10). Similarly, the two entries of type EntityMention has span (0,3) and (15,20).

# function signature
entries = list(
    co_iterator_annotation_like(
        type_names = [
            "ft.onto.base_ontology.Sentence",
            "ft.onto.base_ontology.EntityMention"
        ],
        range_span = (0,12)
    )
)

# Fetching result
result = [
    all_anno.append([type(anno).__name__, anno.begin, anno.end])
    for all_anno in entries
]

# return
result = [
    ['Sentence', 0, 5],
    ['EntityMention', 0, 5],
    ['Sentence', 6, 10]
]

From this we can see how range_span affects which entries will be fetched and also how the function chooses the order in which entries are fetched.

Parameters
  • type_names (List[str]) – a list of string type names

  • range_span (Optional[Tuple[int, int]]) – a tuple that indicates the start and end index of the range in which we want to get required entries

Return type

Iterator[List]

Returns

An iterator of entry elements.

get(type_name, include_sub_type=True, range_span=None)[source]

This function fetches entries from the data store of type type_name. If include_sub_type is set to True and type_name is in [Annotation, Group, List], this function also fetches entries of subtype of type_name. Otherwise, it only fetches entries of type type_name.

Parameters
  • type_name (str) – The fully qualified name of the entry.

  • include_sub_type (bool) – A boolean to indicate whether get its subclass.

  • range_span (Optional[Tuple[int, int]]) – A tuple that contains the begin and end indices of the searching range of entries.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

iter(type_name)[source]

This function iterates all type_name entries. It skips None placeholders that appear in non-annotation-like entry lists.

Parameters

type_name (str) – The fully qualified type name of a type.

Return type

Iterator[List]

Returns

An iterator of the entries.

next_entry(tid)[source]

Get the next entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the next entry. If it is a non-annotation type, it will be sorted in the insertion order, which means next_entry would return the next inserted entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

A list of attributes representing the next entry of the same type as the tid entry. Return None when accessing the next entry of the last element in entry list.

Raises

IndexError – An error occurred accessing index out out of entry list.

prev_entry(tid)[source]

Get the previous entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the previous entry. If it is a non-annotation type, it will be sorted in the insertion order, which means prev_entry would return the previous inserted entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

A list of attributes representing the previous entry of the same type as the tid entry. Return None when accessing the previous entry of the first element in entry list.

Raises

IndexError – An error occurred accessing index out out of entry list.

DataPack Dataset

DataPackIterator

class forte.data.data_pack_dataset.DataPackIterator(pack_iterator, context_type, request=None, skip_k=0)[source]

An iterator generating data example from a stream of data packs.

Parameters
  • pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

  • context_type (Type[Annotation]) – The granularity of a single example which could be any Annotation type. For example, it can be Sentence, then each training example will represent the information of a sentence.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The request of type Dict sent to DataPack to query specific data.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.

Returns

An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type DataPack).

Here is an example usage:

file_path: str = "data_samples/data_pack_dataset_test"
reader = CoNLL03Reader()
context_type = Sentence
request = {Sentence: []}
skip_k = 0

train_pl: Pipeline = Pipeline()
train_pl.set_reader(reader)
train_pl.initialize()
pack_iterator: Iterator[PackType] =
    train_pl.process_dataset(file_path)

iterator: DataPackIterator = DataPackIterator(pack_iterator,
                                              context_type,
                                              request,
                                              skip_k)

for tid, data_pack in iterator:
    # process tid and data_pack

Note

For parameters context_type, request, skip_k, please refer to get_data() in DataPack.

DataPackDataset

class forte.data.data_pack_dataset.DataPackDataset(data_source, feature_schemes, hparams=None, device=None)[source]

A dataset representing data packs. Calling an DataIterator over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.

Parameters
process(raw_example)[source]

Given an input which is a single data example, extract feature from it.

Parameters

raw_example (tuple(dict, DataPack)) –

A Tuple where

Return type

Dict[str, Feature]

Returns

A Dict mapping from user-specified tags to the Feature extracted.

Note

Please refer to request() for details about user-specified tags.

collate(examples)[source]

Given a batch of output from process(), produce pre-processed data as well as masks and features.

Parameters

examples (List[Dict[str, Feature]]) – A List of result from process().

Return type

Batch

Returns

A Texar Pytorch Batch It can be treated as a Dict with the following structure:

  • ”data”: List or np.ndarray or torch.tensor The pre-processed data.

    Please refer to Converter for details.

  • ”masks”: np.ndarray or torch.tensor All the masks for pre-processed data.

    Please refer to Converter for details.

  • ”features”: List[Feature] A List of Feature. This is useful when users want to do customized pre-processing.

    Please refer to Feature for details.

{
    "tag_a": {
        "data": <tensor>,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    },
    "tag_b": {
        "data": Tensor,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    }
}

Note

The first level key in returned batch is the user-specified tags. Please refer to request() for details about user-specified tags.

RawExample

forte.data.data_pack_dataset.RawExample

alias of Tuple[int, forte.data.data_pack.DataPack]

FeatureCollection

forte.data.data_pack_dataset.FeatureCollection

alias of Dict[str, forte.data.converter.feature.Feature]

Batchers

ProcessingBatcher

class forte.data.batchers.ProcessingBatcher[source]

This defines the basis interface of the batcher used in BaseBatchProcessor. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

flush()[source]

Flush the remaining data.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

A triplet contains datapack, context instance and batched data.

Note

For backward compatibility issues, this function return list of None contexts.

get_batch(input_pack)[source]

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters

input_pack (~PackType) – The input data pack to get features from.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

An iterator of A tuple contains datapack, context instance and batch data.

Note

For backward compatibility issues, this function return a list of None as contexts.

classmethod default_configs()[source]

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

  • use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

  • cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type

Dict[str, Any]

Returns

The default configuration.

FixedSizeDataPackBatcherWithExtractor

class forte.data.batchers.FixedSizeDataPackBatcherWithExtractor[source]

This batcher uses extractor to extract features from dataset and group them into batch. In this class, more pools are added. One is instance_pool, which is used to record the instance from which feature is extracted. The other one is feature_pool, which is used to record features before they can be yield in batch.

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

add_feature_scheme(tag, scheme)[source]

Add feature scheme to the batcher.

Parameters
  • tag (str) – The name/tag of the scheme.

  • scheme (str) – The scheme content, which should be a dict containing the extractor and converter used to create features.

collate(features_collection)[source]

This function use the Converter interface to turn a list of features into batches, where each feature is converted to tensor/matrix format. The resulting features are organized as a dictionary, where the keys are the feature names/tags, and the values are the converted features. Each feature contains the data and mask in MatrixLike form, as well as the original raw features.

Parameters

features_collection (List[Dict[str, Feature]]) – A list of features.

Return type

Dict[str, Dict[str, Any]]

Returns

A instance of Dict[str, Union[Tensor, Dict]], which is a batch of features.

flush()[source]

Flush data in batches. Each return value contains a tuple of 3 items: the corresponding data pack, the list of annotation objects that represent the context type, and the features.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict[str, Dict[str, Any]]]]

get_batch(input_pack)[source]

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters

input_pack (~PackType) – The input data pack to get features from.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

An iterator of a tuple contains datapack, context instance and batch data.

classmethod default_configs()[source]

Defines the configuration of this batcher, here:

  • context_type: The context scope to extract data from. It could be a annotation class or a string that is the fully qualified name of the annotation class.

  • feature_scheme: A dictionary of (extractor name, extractor) that can be used to extract features.

  • batch_size: The batch size, default is 10.

Return type

Dict[str, Any]

Returns

The default configuration structure.

FixedSizeRequestDataPackBatcher

class forte.data.batchers.FixedSizeRequestDataPackBatcher[source]
initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

The configuration of a batcher.

Here:

  • context_type (str): The fully qualified name of an Annotation type, which will be used as the context to retrieve data from. For example, if a ft.onto.Sentence type is provided, then it will extract data within each sentence.

  • requests: The request detail. See get_data() on what a request looks like.

Return type

Dict

Returns

The default configuration structure and default value.

FixedSizeMultiPackProcessingBatcher

class forte.data.batchers.FixedSizeMultiPackProcessingBatcher[source]

A Batcher used in MultiPackBatchProcessor.

Note

this implementation is not finished.

The Batcher calls the ProcessingBatcher inherently on each specified data pack in the MultiPack.

It’s flexible to query MultiPack so we delegate the task to the subclasses such as:

  • query all packs with the same context and input_info.

  • query different packs with different context and input_info.

Since the batcher will save the data_pack_pool on the fly, it’s not trivial to do batching and slicing multiple data packs in the same time

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

  • use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

  • cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type

Dict

Returns

The default configuration.

FixedSizeDataPackBatcher

class forte.data.batchers.FixedSizeDataPackBatcher[source]
initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

The configuration of a batcher.

Here:

  • batch_size: the batch size, default is 10.

Return type

Dict

Returns

The default configuration structure and default value.

Caster

Caster

class forte.data.caster.Caster[source]

MultiPackBoxer

class forte.data.caster.MultiPackBoxer[source]

This class creates a MultiPack from a DataPack, this MultiPack will only contains the original DataPack, indexed by the pack_name.

cast(pack)[source]

Auto-box the DataPack into a MultiPack by simple wrapping.

Parameters

pack (DataPack) – The DataPack to be boxed

Return type

MultiPack

Returns

An iterator that produces the boxed MultiPack.

classmethod default_configs()[source]

Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

MultiPackUnboxer

class forte.data.caster.MultiPackUnboxer[source]

This passes on a single DataPack within the MultiPack.

cast(pack)[source]

Auto-box the MultiPack into a DataPack by using pack_index to take the unique pack.

Parameters

pack (MultiPack) – The MultiPack to be boxed.

Return type

DataPack

Returns

A DataPack boxed from the MultiPack.

classmethod default_configs()[source]

Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

Container

EntryContainer

class forte.data.container.EntryContainer[source]

Types

ReplaceOperationsType

forte.data.types.ReplaceOperationsType

alias of List[Tuple[forte.data.span.Span, str]]

DataRequest

forte.data.types.DataRequest

alias of Dict[Type[forte.data.ontology.core.Entry], Union[Dict, List]]

MatrixLike

forte.data.types.MatrixLike

alias of Union[torch._C.TensorType, numpy.ndarray, List]

Data Utilities

maybe_download

forte.data.data_utils.maybe_download(urls: List[str], path: Union[str, PathLike], filenames: Optional[List[str]] = None, extract: bool = False, num_gdrive_retries: int = 1)List[str][source]
forte.data.data_utils.maybe_download(urls: str, path: Union[str, PathLike], filenames: Optional[str] = None, extract: bool = False, num_gdrive_retries: int = 1)str

Downloads a set of files.

Parameters
  • urls (Union[List[str], str]) – A (list of) URLs to download files.

  • path (Union[str, ~PathLike]) – The destination path to save the files.

  • filenames (Union[List[str], str, None]) – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.

  • extract (bool) – Whether to extract compressed files.

  • num_gdrive_retries (int) – An integer specifying the number of attempts to download file from Google Drive. Default value is 1.

Returns

A list of paths to the downloaded files.

batch_instances

forte.data.data_utils_io.batch_instances(instances)[source]

Merge a list of instances.

merge_batches

forte.data.data_utils_io.merge_batches(batches)[source]

Merge a list of batches.

slice_batch

forte.data.data_utils_io.slice_batch(batch, start, length)[source]

Return a sliced batch of size length from start in batch.

dataset_path_iterator

forte.data.data_utils_io.dataset_path_iterator(dir_path, file_extension)[source]

An iterator returning the file paths in a directory containing files of the given datasets.

Return type

Iterator[str]

Entry Utilities

create_utterance

forte.data.common_entry_utils.create_utterance(input_pack, text, speaker)[source]

Create an utterance in the datapack. This is composed of two steps:

  1. Append the utterance text to the data pack.

  2. Create Utterance entry on the text.

  3. Set the speaker of the utterance to the provided speaker.

Parameters
  • input_pack (DataPack) – The data pack to add utterance into.

  • text (str) – The text of the utterance.

  • speaker (str) – The speaker name to be associated with the utterance.

get_last_utterance

forte.data.common_entry_utils.get_last_utterance(input_pack, target_speaker)[source]

Get the last utterance from a particular speaker. An utterance is an entry of type Utterance

Parameters
  • input_pack (DataPack) – The data pack to find utterances.

  • target_speaker (str) – The name of the target speaker.

Return type

Optional[Utterance]

Returns

The last Utterance from the speaker if found, None otherwise.