Data

Ontology

base

class forte.data.span.Span(begin, end)[source]

A class recording the span of annotations. Span objects can be totally ordered according to their begin as the first sort key and end as the second sort key.

Parameters
  • begin (int) – The offset of the first character in the span.

  • end (int) – The offset of the last character in the span + 1. So the span is a left-closed and right-open interval [begin, end).

core

class forte.data.ontology.core.Entry(pack)[source]

The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are Annotation, Link and Group.

An forte.data.ontology.top.Annotation object represents a span in text.

A forte.data.ontology.top.Link object represents a binary link relation between two entries.

A forte.data.ontology.top.Group object represents a collection of multiple entries.

self.embedding

The embedding vectors (numpy array of floats) of this entry.

Parameters

pack – Each entry should be associated with one pack upon creation.

property embedding

Get the embedding vectors (numpy array of floats) of the entry.

property tid

Get the id of this entry.

Returns:

property pack_id

Get the id of the pack that contains this entry.

Returns:

as_pointer(from_entry)[source]

Return this entry as a pointer of this entry relative to the from_entry.

Parameters

from_entry – The entry to point from.

Returns

A pointer to the this entry from the from_entry.

resolve_pointer(ptr)[source]

Resolve into an entry on the provided pointer ptr from this entry.

Parameters

ptr

Returns:

entry_type()[source]

Return the full name of this entry type.

abstract set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent – The parent entry.

abstract set_child(child)[source]

This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.

Parameters

child – The child entry

abstract get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the child of the link from the given DataPack.

abstract get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link from the given DataPack.

class forte.data.ontology.core.BaseGroup(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

This is the BaseGroup interface. Specific member constraints are defined in the inherited classes.

abstract add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

add_members(members)[source]

Add members to the group.

Parameters

members – An iterator of members to be added to the group.

abstract get_members()[source]

Get the member entries in the group.

Returns

Instances of Entry that are the members of the group.

top

class forte.data.ontology.top.Generics(pack)[source]
class forte.data.ontology.top.Annotation(pack, begin, end)[source]

Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a Span corresponding to its offset in the text.

Parameters
  • pack (PackType) – The container that this annotation will be added to.

  • begin (int) – The offset of the first character in the annotation.

  • end (int) – The offset of the last character in the annotation + 1.

set_span(begin, end)[source]

Set the span of the annotation.

Link type entries, such as “predicate link”. Each link has a parent node and a child node.

Parameters
  • pack (EntryContainer) – The container that this annotation will be added to.

  • parent (Entry, optional) – the parent entry of the link.

  • child (Entry, optional) – the child entry of the link.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent – The parent entry.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.

Parameters

child – The child entry.

property parent

Get tid of the parent node. To get the object of the parent node, call get_parent().

property child

Get tid of the child node. To get the object of the child node, call get_child().

get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Group(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Returns

A set of instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackGeneric(pack)[source]
class forte.data.ontology.top.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Returns

Instances of Entry that are the members of the group.

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

parent_id()[source]

Return the tid of the parent entry.

Returns: The tid of the parent entry.

child_id()[source]

Return the tid of the child entry.

Returns: The tid of the child entry.

parent_pack_id()[source]

Return the pack_id of the parent pack.

Returns: The pack_id of the parent pack..

child_pack_id()[source]

Return the pack_id of the child pack.

Returns: The pack_id of the child pack.

set_parent(parent)[source]

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

parent – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

child – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Query(pack)[source]

An entry type representing queries for information retrieval tasks.

Parameters

pack (Data pack) – Data pack reference to which this query will be added

add_result(pid, score)[source]

Set the result score for a particular pack (based on the pack id).

Parameters
  • pid – the pack id.

  • score – the score for the pack

Returns:

update_results(pid_to_score)[source]

Updates the results for this query.

Parameters

pid_to_score (dict) – A dict containing pack id -> score mapping

Packs

BasePack

class forte.data.base_pack.BasePack(pack_name=None)[source]

The base class of DataPack and MultiPack.

Parameters

pack_name (str, optional) – a string name of the pack.

abstract delete_entry(entry)[source]

Remove the entry from the pack.

Parameters

entry – The entry to be removed.

Returns:

add_entry(entry, component_name=None)[source]

Add an Entry object to the BasePack object. Allow duplicate entries in a pack.

Parameters
  • entry (Entry) – An Entry object to be added to the pack.

  • component_name (str) – A name to record that the entry is created by this component.

Returns

The input entry itself

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (str) – Overwrite the component record with this.

Returns:

serialize(drop_record=False)[source]

Serializes a pack to a string.

set_control_component(component)[source]

Record the current component that is taking control of this pack.

Parameters

component – The component that is going to take control

Returns:

record_field(entry_id, field_name)[source]

Record who modifies the entry, will be called in Entry

Parameters
  • entry_id – The id of the entry.

  • field_name – The name of the field modified.

Returns:

on_entry_creation(entry, component_name=None)[source]

Call this when adding a new entry, will be called in Entry when its __init__ function is called.

Parameters
  • entry (Entry) – The entry to be added.

  • component_name (str) – A name to record that the entry is created by this component.

Returns:

regret_creation(entry)[source]

Will remove the entry from the pending entries internal state of the pack.

Parameters

entry – The entry that we would not add the the pack anymore.

Returns:

get_entry(tid)[source]

Look up the entry_index with key ptr. Specific implementation depends on the actual class.

abstract get(entry_type, **kwargs)[source]

Implementation of this method should provide to obtain the entries in entry ordering. If there are orders defined between the entries, they should be used first. Otherwise, the insertion order should be used (FIFO).

Parameters

entry_type – The type of the entry to obtain.

Returns

An iterator of the entries matching the provided arguments.

get_single(entry_type)[source]

Take a single entry of type entry_type from this data pack. This is useful when the target entry type appears only one time in the DataPack for e.g., a Document entry. Or you just intended to take the first one.

Parameters

entry_type – The entry type to be retrieved.

Returns

A single data entry.

get_ids_by_creator(component)[source]

Look up the component_index with key component. This will return the entry ids that are created by the component

Parameters

component – The component (creator) to find ids for.

Returns

A set of entry ids that are created by the component.

is_created_by(entry, components)[source]

Check if the entry is created by any of the provided components.

Parameters
  • entry – The entry to check.

  • components – The list of component names.

Returns (bool):

True if the entry is created by the component, False otherwise.

get_entries_from(component)[source]

Look up all entries from the component as a unordered set

Parameters
  • component – The component (creator) to get the entries. It is

  • the full qualified name of the creator class (normally) –

  • it (but) –

  • also be customized based on the implementation. (may) –

Returns

The set of entry ids that are created by the input component.

get_ids_from(components)[source]

Look up entries using a list of components (creators). This will find each creator iteratively and combine the result.

Parameters

components (List[str]) – The list of components to find.

Returns

The list of entry ids that are created from these components.

get_ids_by_type_subtype(entry_type)[source]

Look up the type_index with key entry_type.

Parameters

entry_type – The type of the entry you are looking for.

Returns

A set of entry ids. The entries are instances of entry_type ( and also includes instances of the subclasses of entry_type).

get_entries_of(entry_type, exclude_sub_types=False)[source]

Return all entries of this particular type without orders. If you need to get the annotations based on the entry ordering, use forte.data.base_pack.get().

Parameters
  • entry_type – The type of the entry you are looking for.

  • exclude_sub_types (bool) – Whether to ignore the inherited sub type

  • the provided entry_type. Default is True. (of) –

Returns

An iterator of the entries matching the type constraint.

DataPack

class forte.data.data_pack.DataPack(pack_name=None)[source]

A DataPack contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.

Parameters

pack_name (str, optional) – A name for this data pack.

property text

Return the text of the data pack

property all_annotations

An iterator of all annotations in this data pack.

Returns: Iterator of all annotations, of type Annotation.

property num_annotations

Number of annotations in this data pack.

Returns: (int) Number of the links.

An iterator of all links in this data pack.

Returns: Iterator of all links, of type Link.

Number of links in this data pack.

Returns: Number of the links.

property all_groups

An iterator of all groups in this data pack.

Returns: Iterator of all groups, of type Group.

property num_groups

Number of groups in this data pack.

Returns: Number of groups.

property all_generic_entries

An iterator of all generic entries in this data pack.

Returns: Iterator of generic

property num_generics_entries

Number of generics entries in this data pack.

Returns: Number of generics entries.

get_span_text(span)[source]

Get the text in the data pack contained in the span

Parameters

span (Span) – Span object which contains a begin and an end index

Returns

The text within this span

get_original_text()[source]

Get original unmodified text from the DataPack object.

Returns

Original text after applying the replace_back_operations of DataPack object to the modified text

get_original_span(input_processed_span, align_mode='relaxed')[source]

Function to obtain span of the original text that aligns with the given span of the processed text.

Parameters
  • input_processed_span – Span of the processed text for which the

  • span of the original text is desired (corresponding) –

  • align_mode – The strictness criteria for alignment in the ambiguous

  • cases

  • is (that) –

  • a part of input_processed_span spans a part (if) –

  • the inserted span (of) –

  • align_mode controls whether to use the (then) –

  • fully or ignore it completely according to the following (span) –

  • values (possible) –

  • "strict" - do not allow ambiguous input (-) –

  • ValueError (give) –

  • "relaxed" - consider spans on both sides (-) –

  • "forward" - align looking forward (-) –

  • is

  • the span (ignore) –

  • the left (towards) –

  • consider the span towards the right (but) –

  • "backward" - align looking backwards (-) –

  • is

  • the span

  • the right (towards) –

  • consider the span towards the left (but) –

Returns

Span of the original text that aligns with input_processed_span

Example

  • Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.

  • Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.

  • Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->

  • Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->

  • Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]

>> data_pack = DataPack() >> original_text = “He plays in the park” >> data_pack.set_text(original_text,>> lambda _: [(Span(0, 2), “She”))] >> data_pack.text “She plays in the park” >> input_processed_span = Span(0, len(“She plays”)) >> orig_span = data_pack.get_original_span(input_processed_span) >> data_pack.get_original_text()[orig_span.begin: orig_span.end] “He plays”

classmethod deserialize(data_pack_string)[source]

Deserialize a Data Pack from a string. This internally calls the internal _deserialize() function from BasePack.

Parameters

data_pack_string – The serialized string of a data pack to be deserialized.

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the DataPack. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.

Please note that deleting a entry do not guarantee the deletion of the related entries.

Parameters

entry (Entry) – An Entry object to be deleted from the pack.

get_data(context_type, request=None, skip_k=0)[source]

Fetch entries from the data_pack of type context_type.

Currently, we do not support Groups and Generics in the request.

Example

requests = {
    base_ontology.Sentence:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_data(base_ontology.Sentence, requests)
Parameters
  • context_type (str) – The granularity of the data context, which could be any Annotation type.

  • request (dict) –

    The entry types and fields required. The keys of the requests dict are the required entry types and the value should be either:

    • a list of field names or

    • a dict which accepts three keys: “fields”, “component”, and “unit”.

      • By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.

      • By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.

      • By setting “unit” (string), users can specify a unit by which the annotations are indexed.

    Note that for all annotation types, “text” and “span” fields are returned by default; for all link types, “child” and “parent” fields are returned by default.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.

Returns

A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).

build_coverage_for(context_type, covered_type)[source]
User can call this function to build coverage index for specific types.

The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See forte.data.data_pack.DataIndex for more details.

Parameters
  • context_type – The context/covering type.

  • covered_type – The entry to find under the context type.

iter_in_range(entry_type, range_annotation)[source]

Iterate the entries of the provided type within or fulfill the constraints of the range_annotation. The constraint is True if an entry is in_span of the provided range_annotation.

Internally, if the coverage index between the entry type and the type of the range_annotation is built, then this will create the iterator from the index. Otherwise, the function will iterate them from scratch (which is slower). If there are frequent usage of this function, it is suggested to build the coverage index.

Parameters
  • entry_type – The type of entry to iterate over.

  • range_annotation – The range annotation that serve as the constraint.

Returns

An iterator of the entries with in the range_annotation.

get(entry_type, range_annotation=None, components=None, include_sub_type=True)[source]

This function is used to get data from a data pack with various methods.

Depending on the provided arguments, the function will perform several different filtering of the returned data.

The entry_type is mandatory, where all the entries matching this type will be returned. The sub-types of the provided entry type will be also returned if include_sub_type is set to True (which is the default behavior).

The range_annotation controls the search area of the sub-types. An entry E will be returned if in_span(E, range_annotation() returns True. If this function is called frequently with queries related to the range_annotation, please consider to build the coverage index regarding the related entry types.

The components list will filter the results by the component (i.e the creator of the entry). If components is provided, only the entries created by one of the components will be returned.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from a sentence created by NLTKTokenizer.
    token_entries = input_pack.get(
        entry_type=Token,
        range_annotation=sentence,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used.

Parameters
  • entry_type (type) – The type of entries requested.

  • range_annotation (Annotation, optional) – The range of entries requested. If None, will return valid entries in the range of whole data_pack.

  • components (str or list, optional) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type (bool) – whether to consider the sub types of the provided entry type. Default True.

MultiPack

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

add_pack(ref_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters

ref_name (str) – The pack name used to reference this data pack from the multi pack.

Returns: The newly created data pack.

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters
  • pack (DataPack) – The existing data pack.

  • ref_name (str) – The name to used in this multi pack.

Returns:

get_pack_at(index)[source]

Get data pack at provided index.

Parameters

index – The index of the pack.

Returns: The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters

pack_id – The global pack id to find.

Returns:

get_pack(name)[source]

Get data pack of name.

Parameters

name – The name of the pack.

Returns: The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Returns: List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters
  • old_name – The old name of the pack.

  • new_name – The new name to be assigned for the pack.

Returns:

An iterator of all links in this multi pack.

Returns: Iterator of all links, of type MultiPackLink.

Number of groups in this multi pack.

Returns: Number of links.

property all_groups

An iterator of all groups in this multi pack.

Returns: Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Returns: Number of groups.

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (str) – Overwrite the component record with this.

Returns:

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters
  • pack_index (int) – The index of a single pack.

  • context_type (str) – The granularity of the data context, which could be any Annotation type.

  • request (dict) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.

  • skip_k – Will skip the first k instances and generate data from the k + 1 instance.

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

NOTE: This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

  • “fields”, this specifies the attribute field names to be obtained

  • “unit”, this specifies the unit used to index the annotation

  • “component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)
Parameters

request – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.

Returns:

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters
  • entry_type (type) – The type of the entries requested.

  • components (str or list, optional) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.

  • include_sub_type (bool) – whether to return the sub types of the queried entry_type. True by default.

Returns: An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(string)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters

string – The serialized string of a Multi pack to be deserialized.

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the MultiPack.

Parameters

entry (Entry) – An Entry object to be deleted from the pack.

BaseMeta

class forte.data.base_pack.BaseMeta(pack_name=None)[source]

Basic Meta information for both DataPack and MultiPack.

Parameters

pack_name – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Meta

class forte.data.data_pack.Meta(pack_name=None, language='eng', span_unit='character')[source]

Basic Meta information associated with each instance of DataPack.

Parameters
  • pack_name – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

  • language – The language used by this data pack, default is English.

  • span_unit – The unit used for interpreting the Span object of this data pack. Default is character.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

BaseIndex

class forte.data.base_pack.BaseIndex[source]

A set of indexes used in BasePack:

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  4. group_index, the index from group members to groups.

update_basic_index(entries)[source]

Build or update the basic indexes, including

(1) entry_index, the index from each tid to the corresponding entry;

(2) type_index, the index from each type to the entries of that type;

(3) component_index, the index from each component to the entries generated by that component.

Parameters

entries (list) – a list of entries to be added into the basic index.

Build the link_index, the index from child and parent nodes to links. It will build the links with the links in the dataset.

link_index consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:

build_group_index(groups)[source]

Build group_index, the index from group members to groups.

Returns:

Look up the link_index with key tid. If the link index is not built, this will throw a PackIndexError.

Parameters
  • tid (int) – the tid of the entry being looked up.

  • as_parent (bool) – If as_patent is True, will look up link_index["parent_index"] and return the tids of links whose parent is `tid. Otherwise, will look up link_index["child_index"] and return the tids of links whose child is `tid.

group_index(tid)[source]

Look up the group_index with key tid. If the index is not built, this will raise a PackIndexError.

Update link_index with the provided links, the index from child and parent to links.

link_index consists of two sub-indexes:
  • “child_index” is the index from child nodes to their corresponding links

  • “parent_index” is the index from parent nodes to their corresponding links.

Parameters

links (list) – a list of links to be added into the index.

update_group_index(groups)[source]

Build or update group_index, the index from group members to groups.

Parameters

groups (list) – a list of groups to be added into the index.

DataIndex

class forte.data.data_pack.DataIndex[source]

A set of indexes used in DataPack, note that this class is used by the DataPack internally.

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. component_index, the index from each component to the entries generated by that component

  4. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  5. group_index, the index from group members to groups.

  6. _coverage_index, the index that maps from an annotation to the entries it covers. _coverage_index is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.

coverage_index(outer_type, inner_type)[source]

Get the coverage index from outer_type to inner_type.

Parameters
  • outer_type (type) – an annotation type.

  • inner_type (type) – an entry type.

Returns

If the coverage index does not exist, return None. Otherwise, return a dict.

build_coverage_index(data_pack, outer_type, inner_type)[source]

Build the coverage index from outer_type to inner_type.

Parameters
  • data_pack (DataPack) – The data pack to build coverage for.

  • outer_type (type) – an annotation type.

  • inner_type (type) – an entry type, can be Annotation, Link, Group.

have_overlap(entry1, entry2)[source]

Check whether the two annotations have overlap in span.

Parameters
  • entry1 (str or Annotation) – An Annotation object to be checked, or the tid of the Annotation.

  • entry2 (str or Annotation) – Another Annotation object to be checked, or the tid of the Annotation.

in_span(inner_entry, span)[source]

Check whether the inner entry is within the given span. The criterion are as followed:

Annotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both Annotation type, this link will be considered in span if both parent and child are in_span of the provided span. If either the parent and the child is not of type Annotation, this function will always return False.

Group entries: if the child type of the group is Annotation type, then the group will be considered in span if all the elements are in_span of the provided span. If the child type is not Annotation type, this function will always return False.

Other entries (i.e Generics): they will not be considered in_span of any spans. The function will always return False.

Parameters
  • inner_entry (int or Entry) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Returns

True if the inner_entry is considered to be in span of the provided span.

Readers

BaseReader

class forte.data.base_reader.BaseReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic data reader class. To be inherited by all data readers.

Parameters
  • from_cache (bool, optional) – Decide whether to read from cache if cache file exists. By default (False), the reader will only read from the original file and use the cache file path for caching, it will not read from the cache_directory. If True, the reader will try to read a datapack from the caching file.

  • cache_directory (str, optional) – The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by _cache_key_function(). Note: A collection is the data returned by _collect().

  • append_to_cache (bool, optional) – Decide whether to append write if cache file already exists. By default (False), we will overwrite the existing caching file. If True, we will cache the datapack append to end of the caching file.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}
parse_pack(collection)[source]

Calls _parse_pack() to create packs from the collection. This internally setup the component meta data. Users should implement the _parse_pack() method.

text_replace_operation(text)[source]

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters

text – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]): the replacement operations.

set_profiling(enable_profiling=True)[source]

Set profiling option.

Parameters

enable_profiling – A boolean of whether to enable profiling for the reader or not (the default is True).

timer_yield(pack)[source]

Wrapper generator for time profiling. Insert timers around ‘yield’ to support time profiling for reader.

Parameters

pack – DataPack passed from self.iter()

iter(*args, **kwargs)[source]

An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call collect().

Parameters
  • args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.

  • kwargs – Iterator of DataPacks.

record(record_meta)[source]

Modify the pack meta record field of the reader’s output. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters

record_meta – the field in the datapack for type record that need to fill in for consistency checking.

cache_data(collection, pack, append)[source]

Specify the path to the cache directory.

After you call this method, the dataset reader will use its cache_directory to store a cache of BasePack read from every document passed to read(), serialized as one string-formatted BasePack. If the cache file for a given file_path exists, we read the BasePack from the cache. If the cache file does not exist, we will create it on our first pass through the data.

Parameters
  • collection – The collection is a piece of data from the _collect() function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.

  • pack – The data pack to be cached.

  • append – Whether to allow appending to the cache.

read_from_cache(cache_filename)[source]

Reads one or more Packs from cache_filename, and yields Pack(s) from the cache file.

Parameters

cache_filename – Path to the cache file.

Returns: List of cached data packs.

finish(resources)[source]

The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.

Parameters

resource (Resources) – A global resource registry.

set_text(pack, text)[source]

Assign the text value to the DataPack. This function will pass the text_replace_operation to the DataPack to conduct the pre-processing step.

Parameters
  • pack – The DataPack to assign value for.

  • text – The original text to be recorded in this dataset.

PackReader

class forte.data.base_reader.PackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

A Pack Reader reads data into DataPack.

MultiPackReader

class forte.data.base_reader.MultiPackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic MultiPack data reader class. To be inherited by all data readers which return MultiPack.

CoNLL03Reader

ConllUDReader

class forte.data.readers.conllu_ud_reader.ConllUDReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

conllUReader is designed to read in the Universal Dependencies 2.4 dataset.

BaseDeserializeReader

RawDataDeserializeReader

RecursiveDirectoryDeserializeReader

HTMLReader

MSMarcoPassageReader

class forte.data.readers.ms_marco_passage_reader.MSMarcoPassageReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

MultiPackSentenceReader

MultiPackTerminalReader

OntonotesReader

PlainTextReader

ProdigyReader

RACEMultiChoiceQAReader

StringReader

SemEvalTask8Reader

OpenIEReader

DataPack Dataset

DataPackIterator

class forte.data.data_pack_dataset.DataPackIterator(pack_iterator, context_type, request=None, skip_k=0)[source]

An iterator over single data example from multiple data packs.

Parameters
  • pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

  • context_type – The granularity of a single example which could be any Annotation type. For example, it can be Sentence, then each training example will represent the information of a sentence.

  • request – The request of type Dict sent to DataPack to query specific data.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.

Returns

An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type DataPack).

Here is an example usage:
file_path: str = "data_samples/data_pack_dataset_test"
reader = CoNLL03Reader()
context_type = Sentence
request = {Sentence: []}
skip_k = 0

train_pl: Pipeline = Pipeline()
train_pl.set_reader(reader)
train_pl.initialize()
pack_iterator: Iterator[PackType] =
    train_pl.process_dataset(file_path)

iterator: DataPackIterator = DataPackIterator(pack_iterator,
                                              context_type,
                                              request,
                                              skip_k)

for tid, data_pack in iterator:
    # process tid and data_pack

Note

For parameters context_type, request, skip_k, please refer to get_data() in DataPack.

DataPackDataset

class forte.data.data_pack_dataset.DataPackDataset(data_source, feature_schemes, hparams=None, device=None)[source]

A dataset representing data packs. Calling an DataIterator over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.

Parameters
  • data_source – A data source of type DataPackDataSource.

  • feature_schemes (dict) – A Dict containing all the information to do data pre-processing. This is exactly the same as the schemes in feature_resource. Please refer to feature_resource() in TrainPreprocessor for details.

  • hparams – A dict or instance of : class:~texar.torch.HParams containing hyperparameters. See default_hparams() in DatasetBase for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

process(raw_example)[source]

Given an input which is a single data example, extract feature from it.

Parameters

raw_example (tuple(dict, DataPack)) –

A Tuple where

The first element is a Dict produced by get_data() in DataPack.

The second element is an instance of type DataPack.

Returns

A Dict mapping from user-specified tags to the Feature extracted.

Note

Please refer to feature_resource() in TrainPreprocessor for details about user-specified tags.

collate(examples)[source]

Given a batch of output from process(), produce pre-processed data as well as masks and features.

Parameters

examples – A List of result from process().

Returns

A Texar Pytorch Batch It can be treated as a Dict with the following structure:

{
    "tag_a": {
        "data": <tensor>,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    },
    "tag_b": {
        "data": Tensor,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    }
}
”data”: List or np.ndarray or torch.tensor

The pre-processed data.

Please refer to Converter for details.

”masks”: np.ndarray or torch.tensor

All the masks for pre-processed data.

Please refer to Converter for details.

”features”: List[Feature]

A List of Feature. This is useful when users want to do customized pre-processing.

Please refer to Feature for details.

Note

The first level key in returned batch is the user-specified tags. Please refer to feature_resource() in TrainPreprocessor for details about user-specified tags.

Batchers

ProcessingBatcher

class forte.data.batchers.ProcessingBatcher(cross_pack=True)[source]

This defines the basis interface of the Batcher used in BatchProcessor. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.

Parameters
  • cross_pack (bool, optional) – whether to allow batches go across

  • packs when there is no enough data at the end. (data) –

initialize(_)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This batcher will be called at the pipeline initialize stage.

Returns:

flush()[source]

Flush the remaining data.

Returns

A tuple contains datapack, instance and batch data. In the basic ProcessingBatcher, to be compatible with existing implementation, instance is not needed, thus using None.

get_batch(input_pack, context_type, requests)[source]

Returns an iterator of A tuple contains datapack, instance and batch data. In the basic ProcessingBatcher, to be compatible with existing implementation, instance is not needed, thus using None.

Data Utilities

maybe_download

forte.data.data_utils.maybe_download(urls, path, filenames=None, extract=False)[source]

Downloads a set of files.

Parameters
  • urls – A (list of) URLs to download files.

  • path – The destination path to save the files.

  • filenames – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.

  • extract – Whether to extract compressed files.

Returns

A list of paths to the downloaded files.

batch_instances

forte.data.data_utils_io.batch_instances(instances)[source]

Merge a list of instances.

merge_batches

forte.data.data_utils_io.merge_batches(batches)[source]

Merge a list of batches.

slice_batch

forte.data.data_utils_io.slice_batch(batch, start, length)[source]

Return a sliced batch of size length from start in batch.

dataset_path_iterator

forte.data.data_utils_io.dataset_path_iterator(dir_path, file_extension)[source]

An iterator returning the file paths in a directory containing files of the given datasets.