Data

Ontology

base

class forte.data.span.Span(begin, end)[source]

A class recording the span of annotations. Span objects can be totally ordered according to their begin as the first sort key and end as the second sort key.

Parameters
  • begin (int) – The offset of the first character in the span.

  • end (int) – The offset of the last character in the span + 1. So the span is a left-closed and right-open interval [begin, end).

core

class forte.data.ontology.core.Entry(pack)[source]

The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are Annotation, Link and Group.

An forte.data.ontology.top.Annotation object represents a span in text.

A forte.data.ontology.top.Link object represents a binary link relation between two entries.

A forte.data.ontology.top.Group object represents a collection of multiple entries.

self.embedding

The embedding vectors (numpy array of floats) of this entry.

Parameters

pack – Each entry should be associated with one pack upon creation.

property embedding

Get the embedding vectors (numpy array of floats) of the entry.

property tid

Get the id of this entry.

Returns:

property pack_id

Get the id of the pack that contains this entry.

Returns:

as_pointer(from_entry)[source]

Return this entry as a pointer of this entry relative to the from_entry.

Parameters

from_entry – The entry to point from.

Returns

A pointer to the this entry from the from_entry.

resolve_pointer(ptr)[source]

Resolve into an entry on the provided pointer ptr from this entry.

Parameters

ptr

Returns:

abstract set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent – The parent entry.

abstract set_child(child)[source]

This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.

Parameters

child – The child entry

abstract get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the child of the link from the given DataPack.

abstract get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link from the given DataPack.

class forte.data.ontology.core.BaseGroup(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

This is the BaseGroup interface. Specific member constraints are defined in the inherited classes.

abstract add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

add_members(members)[source]

Add members to the group.

Parameters

members – An iterator of members to be added to the group.

abstract get_members()[source]

Get the member entries in the group.

Returns

Instances of Entry that are the members of the group.

top

class forte.data.ontology.top.Generics(pack)[source]
class forte.data.ontology.top.Annotation(pack, begin, end)[source]

Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a Span corresponding to its offset in the text.

Parameters
  • pack (PackType) – The container that this annotation will be added to.

  • begin (int) – The offset of the first character in the annotation.

  • end (int) – The offset of the last character in the annotation + 1.

set_span(begin, end)[source]

Set the span of the annotation.

Link type entries, such as “predicate link”. Each link has a parent node and a child node.

Parameters
  • pack (EntryContainer) – The container that this annotation will be added to.

  • parent (Entry, optional) – the parent entry of the link.

  • child (Entry, optional) – the child entry of the link.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent – The parent entry.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.

Parameters

child – The child entry.

property parent

Get tid of the parent node. To get the object of the parent node, call get_parent().

property child

Get tid of the child node. To get the object of the child node, call get_child().

get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Group(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Returns

A set of instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackGeneric(pack)[source]
class forte.data.ontology.top.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Returns

Instances of Entry that are the members of the group.

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

parent_id()[source]

Return the tid of the parent entry.

Returns: The tid of the parent entry.

child_id()[source]

Return the tid of the child entry.

Returns: The tid of the child entry.

parent_pack_id()[source]

Return the pack_id of the parent pack.

Returns: The pack_id of the parent pack..

child_pack_id()[source]

Return the pack_id of the child pack.

Returns: The pack_id of the child pack.

set_parent(parent)[source]

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

parent – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

child – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]

Get the parent entry of the link.

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Query(pack)[source]

An entry type representing queries for information retrieval tasks.

Parameters

pack (Data pack) – Data pack reference to which this query will be added

add_result(pid, score)[source]

Set the result score for a particular pack (based on the pack id).

Parameters
  • pid – the pack id.

  • score – the score for the pack

Returns:

update_results(pid_to_score)[source]

Updates the results for this query.

Parameters

pid_to_score (dict) – A dict containing pack id -> score mapping

Packs

BasePack

class forte.data.base_pack.BasePack(pack_name=None)[source]

The base class of DataPack and MultiPack.

Parameters

pack_name (str, optional) – a string name of the pack.

abstract delete_entry(entry)[source]

Remove the entry from the pack.

Parameters

entry – The entry to be removed.

Returns:

add_entry(entry, component_name=None)[source]

Add an Entry object to the BasePack object. Allow duplicate entries in a pack.

Parameters
  • entry (Entry) – An Entry object to be added to the pack.

  • component_name (str) – A name to record that the entry is created by this component.

Returns

The input entry itself

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (str) – Overwrite the component record with this.

Returns:

serialize(drop_record=False)[source]

Serializes a pack to a string.

set_control_component(component)[source]

Record the current component that is taking control of this pack.

Parameters

component – The component that is going to take control

Returns:

record_field(entry_id, field_name)[source]

Record who modifies the entry, will be called in Entry

Parameters
  • entry_id – The id of the entry.

  • field_name – The name of the field modified.

Returns:

on_entry_creation(entry, component_name=None)[source]

Call this when adding a new entry, will be called in Entry when its __init__ function is called.

Parameters
  • entry (Entry) – The entry to be added.

  • component_name (str) – A name to record that the entry is created by this component.

Returns:

regret_creation(entry)[source]

Will remove the entry from the pending entries internal state of the pack.

Parameters

entry – The entry that we would not add the the pack anymore.

Returns:

get_entry(tid)[source]

Look up the entry_index with key ptr. Specific implementation depends on the actual class.

get_single(entry_type)[source]

Take a single entry of type entry_type from this data pack. This is useful when the target entry type appears only one time in the DataPack for e.g., a Document entry. Or you just intended to take the first one.

Parameters

entry_type – The entry type to be retrieved.

Returns

A single data entry.

get_ids_by_creator(component)[source]

Look up the component_index with key component.

get_entries_by_creator(component)[source]

Return all entries created by the particular component, an unordered set.

Parameters

component – The component to get the entries.

Returns:

get_ids_by_creators(components)[source]

Look up component_index using a list of components.

get_ids_by_type(entry_type)[source]

Look up the type_index with key entry_type.

Parameters

entry_type – The type of the entry you are looking for.

Returns

A set of entry tids. The entries are instances of entry_type ( and also includes instances of the subclasses of entry_type).

get_entries_by_type(entry_type)[source]

Return all entries of this particular type without orders. If you need to use natural order of the annotations, use forte.data.data_pack.get_entries().

Parameters

entry_type – The type of the entry you are looking for.

Returns:

DataPack

class forte.data.data_pack.DataPack(pack_name=None)[source]

A DataPack contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.

Parameters

pack_name (str, optional) – A name for this data pack.

validate(entry)[source]

Validate whether this entry type can be added. This method is called by the __init__() method when an instance of Entry is being added to the pack.

Parameters

item – The entry itself.

property text

Return the text of the data pack

property all_annotations

An iterator of all annotations in this data pack.

Returns: Iterator of all annotations, of

type :class:”~forte.data.ontology.top.Annotation”.

property num_annotations

Number of annotations in this data pack.

Returns: (int) Number of the links.

An iterator of all links in this data pack.

Returns: Iterator of all links, of

type :class:”~forte.data.ontology.top.Link”.

Number of links in this data pack.

Returns: Number of the links.

property all_groups

An iterator of all groups in this data pack.

Returns: Iterator of all groups, of

type :class:”~forte.data.ontology.top.Group”.

property num_groups

Number of groups in this data pack.

Returns: Number of groups.

property all_generic_entries

An iterator of all generic entries in this data pack.

Returns: Iterator of generic

property num_generics_entries

Number of generics entries in this data pack.

Returns: Number of generics entries.

get_span_text(span)[source]

Get the text in the data pack contained in the span

Parameters

span (Span) – Span object which contains a begin and an end index

Returns

The text within this span

get_original_text()[source]

Get original unmodified text from the DataPack object.

Returns

Original text after applying the replace_back_operations of DataPack object to the modified text

get_original_span(input_processed_span, align_mode='relaxed')[source]

Function to obtain span of the original text that aligns with the given span of the processed text.

Parameters
  • input_processed_span – Span of the processed text for which the

  • span of the original text is desired (corresponding) –

  • align_mode – The strictness criteria for alignment in the ambiguous

  • cases

  • is (that) –

  • a part of input_processed_span spans a part (if) –

  • the inserted span (of) –

  • align_mode controls whether to use the (then) –

  • fully or ignore it completely according to the following (span) –

  • values (possible) –

  • "strict" - do not allow ambiguous input (-) –

  • ValueError (give) –

  • "relaxed" - consider spans on both sides (-) –

  • "forward" - align looking forward (-) –

  • is

  • the span (ignore) –

  • the left (towards) –

  • consider the span towards the right (but) –

  • "backward" - align looking backwards (-) –

  • is

  • the span

  • the right (towards) –

  • consider the span towards the left (but) –

Returns

Span of the original text that aligns with input_processed_span

Example

  • Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.

  • Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.

  • Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->

  • Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->

  • Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]

>> data_pack = DataPack() >> original_text = “He plays in the park” >> data_pack.set_text(original_text,>> lambda _: [(Span(0, 2), “She”))] >> data_pack.text “She plays in the park” >> input_processed_span = Span(0, len(“She plays”)) >> orig_span = data_pack.get_original_span(input_processed_span) >> data_pack.get_original_text()[orig_span.begin: orig_span.end] “He plays”

classmethod deserialize(string)[source]
Deserialize a Data Pack from a string. This internally calls the

internal _deserialize() function from the BasePack.

Parameters

string – The serialized string of a data pack to be deserialized.

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the DataPack. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.

Please note that deleting a entry do not guarantee the deletion of the related entries.

Parameters

entry (Entry) – An Entry object to be deleted from the pack.

get_data(context_type, request=None, skip_k=0)[source]

Fetch entries from the data_pack of type context_type.

Currently, we do not support Groups and Generics in the request.

Example

requests = {
    base_ontology.Sentence:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_data(base_ontology.Sentence, requests)
Parameters
  • context_type (str) – The granularity of the data context, which could be any Annotation type.

  • request (dict) –

    The entry types and fields required. The keys of the requests dict are the required entry types and the value should be either:

    • a list of field names or

    • a dict which accepts three keys: “fields”, “component”, and “unit”.

      • By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.

      • By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.

      • By setting “unit” (string), users can specify a unit by which the annotations are indexed.

    Note that for all annotation types, “text” and “span” fields are returned by default; for all link types, “child” and “parent” fields are returned by default.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.

Returns

A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).

build_coverage_for(context_type, covered_type)[source]
User can call this function to build coverage index for specific types.

The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See forte.data.data_pack.DataIndex for more details.

Parameters
  • context_type – The context/covering type.

  • covered_type – The entry to find under the context type.

get(entry_type, range_annotation=None, components=None)[source]

This function is used to get data from a data pack with various methods.

Example

for sentence in input_pack.get(Sentence):
    token_entries = input_pack.get(entry_type=Token,
                                   range_annotation=sentence,
                                   component=token_component)
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by token_component

Parameters
  • entry_type (type) – The type of entries requested.

  • range_annotation (Annotation, optional) – The range of entries requested. If None, will return valid entries in the range of whole data_pack.

  • components (str or list, optional) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

BaseMeta

class forte.data.base_pack.BaseMeta(pack_name=None)[source]

Basic Meta information for both DataPack and MultiPack.

Parameters

pack_name – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

Meta

class forte.data.data_pack.Meta(pack_name=None, language='eng', span_unit='character')[source]

Basic Meta information associated with each instance of DataPack.

Parameters
  • pack_name – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

  • language – The language used by this data pack, default is English.

  • span_unit – The unit used for interpreting the Span object of this data pack. Default is character.

BaseIndex

class forte.data.base_pack.BaseIndex[source]

A set of indexes used in BasePack:

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  4. group_index, the index from group members to groups.

update_basic_index(entries)[source]

Build or update the basic indexes, including

(1) entry_index, the index from each tid to the corresponding entry;

(2) type_index, the index from each type to the entries of that type;

(3) component_index, the index from each component to the entries generated by that component.

Parameters

entries (list) – a list of entries to be added into the basic index.

Build the link_index, the index from child and parent nodes to links. It will build the links with the links in the dataset.

link_index consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:

build_group_index(groups)[source]

Build group_index, the index from group members to groups.

Returns:

Look up the link_index with key tid. If the link index is not built, this will throw a PackIndexError.

Parameters
  • tid (int) – the tid of the entry being looked up.

  • as_parent (bool) – If as_patent is True, will look up link_index["parent_index"] and return the tids of links whose parent is `tid. Otherwise, will look up link_index["child_index"] and return the tids of links whose child is `tid.

group_index(tid)[source]

Look up the group_index with key tid. If the index is not built, this will raise a PackIndexError.

Update link_index with the provided links, the index from child and parent to links.

link_index consists of two sub-indexes:
  • “child_index” is the index from child nodes to their corresponding links

  • “parent_index” is the index from parent nodes to their corresponding links.

Parameters

links (list) – a list of links to be added into the index.

update_group_index(groups)[source]
Build or update group_index, the index from group members

to groups.

Parameters

groups (list) – a list of groups to be added into the index.

DataIndex

class forte.data.data_pack.DataIndex[source]

A set of indexes used in DataPack, note that this class is used by the DataPack internally.

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. component_index, the index from each component to the entries generated by that component

  4. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  5. group_index, the index from group members to groups.

  6. _coverage_index, the index that maps from an annotation to the entries it covers. _coverage_index is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.

coverage_index(outer_type, inner_type)[source]

Get the coverage index from outer_type to inner_type.

Parameters
  • outer_type (type) – an annotation type.

  • inner_type (type) – an entry type.

Returns

If the coverage index does not exist, return None. Otherwise, return a dict.

build_coverage_index(data_pack, outer_type, inner_type)[source]

Build the coverage index from outer_type to inner_type.

Parameters
  • data_pack (DataPack) – The data pack to build coverage for.

  • outer_type (type) – an annotation type.

  • inner_type (type) – an entry type, can be Annotation, Link, Group.

have_overlap(entry1, entry2)[source]

Check whether the two annotations have overlap in span.

Parameters
  • entry1 (str or Annotation) – An Annotation object to be checked, or the tid of the Annotation.

  • entry2 (str or Annotation) – Another Annotation object to be checked, or the tid of the Annotation.

in_span(inner_entry, span)[source]

Check whether the inner entry is within the given span. Link entries are considered in a span if both the parent and the child are within the span. Group entries are considered in a span if all the members are within the span.

Parameters
  • inner_entry (int or Entry) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Readers

BaseReader

class forte.data.readers.base_reader.BaseReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic data reader class. To be inherited by all data readers.

Parameters
  • from_cache (bool, optional) – Decide whether to read from cache if cache file exists. By default (False), the reader will only read from the original file and use the cache file path for caching, it will not read from the cache_directory. If True, the reader will try to read a datapack from the caching file.

  • cache_directory (str, optional) – The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by _cache_key_function(). Note: A collection is the data returned by _collect().

  • append_to_cache (bool, optional) – Decide whether to append write if cache file already exists. By default (False), we will overwrite the existing caching file. If True, we will cache the datapack append to end of the caching file.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}
parse_pack(collection)[source]

Calls _parse_pack() to create packs from the collection. This internally setup the component meta data. Users should implement the _parse_pack() method.

text_replace_operation(text)[source]

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters

text – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]): the replacement operations.

iter(*args, **kwargs)[source]

An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call collect().

Parameters
  • args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.

  • kwargs – Iterator of DataPacks.

cache_data(collection, pack, append)[source]

Specify the path to the cache directory.

After you call this method, the dataset reader will use its cache_directory to store a cache of BasePack read from every document passed to read(), serialized as one string-formatted BasePack. If the cache file for a given file_path exists, we read the BasePack from the cache. If the cache file does not exist, we will create it on our first pass through the data.

Parameters
  • collection – The collection is a piece of data from the _collect() function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.

  • pack – The data pack to be cached.

  • append – Whether to allow appending to the cache.

read_from_cache(cache_filename)[source]

Reads one or more Packs from cache_filename, and yields Pack(s) from the cache file.

Parameters

cache_filename – Path to the cache file.

Returns: List of cached data packs.

finish(resources)[source]

The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.

Parameters

resource (Resources) – A global resource registry.

PackReader

class forte.data.readers.base_reader.PackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

A Pack Reader reads data into DataPack.

set_text(pack, text)[source]

Assign the text value to the DataPack. This function will pass the text_replace_operation to the DataPack to conduct the pre-processing step.

Parameters
  • pack – The DataPack to assign value for.

  • text – The original text to be recorded in this dataset.

MultiPackReader

class forte.data.readers.base_reader.MultiPackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic MultiPack data reader class. To be inherited by all data readers which return MultiPack.

CoNLL03Reader

class forte.data.readers.conll03_reader.CoNLL03Reader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

CoNLL03Reader is designed to read in the CoNLL03 dataset.

The dataset is from the following paper, Sang, Erik F., and Fien De Meulder. “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.” arXiv preprint cs/0306050 (2003).

Data could be downloaded from https://deepai.org/dataset/conll-2003-english

Data format: Data files contains one line “-DOCSTART- -X- -X- O” to represent the start of a document. After that, each line will contain one word and an empty line represent the start of a new sentence. Each line contains four fields, the word, its part-of-speech tag, its chunk tag and its named entity tag.

Example

EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O

ConllUDReader

class forte.data.readers.conllu_ud_reader.ConllUDReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

conllUReader is designed to read in the Universal Dependencies 2.4 dataset.

BaseDeserializeReader

class forte.data.readers.deserialize_reader.BaseDeserializeReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

RawDataDeserializeReader

class forte.data.readers.deserialize_reader.RawDataDeserializeReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

This reader assumes the data passed in are raw DataPack strings.

RecursiveDirectoryDeserializeReader

class forte.data.readers.deserialize_reader.RecursiveDirectoryDeserializeReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

This reader find all the files under the directory and read each one as a DataPack.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}

HTMLReader

class forte.data.readers.html_reader.HTMLReader(*args, **kwargs)[source]

HTMLReader is designed to read in list of html strings.

It takes in list of html strings, cleans the HTML tags and stores the cleaned text in pack.

text_replace_operation(text)[source]

Replace html tag locations with blank string.

Parameters

text – The original html text to be cleaned.

Returns: List[Tuple[Span, str]]: the replacement operations

MSMarcoPassageReader

class forte.data.readers.ms_marco_passage_reader.MSMarcoPassageReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

MultiPackSentenceReader

class forte.data.readers.multipack_sentence_reader.MultiPackSentenceReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

MultiPackSentenceReader is designed to read a directory of files and convert each file’s contents into a data pack. This class yields a multipack with pack input_pack_name containing the file’s contents. It additionally packs an empty pack with name output_pack_name into the multipack.

classmethod default_configs()[source]

Returns a dictionary of hyperparameters with default values.

{
    "name": "multipack_sentence_reader"
    "input_pack_name": "input_src",
    "output_pack_name": "output_tgt"
}

Here:

“name”: str

Name of the reader

“input_pack_name”: str

Name of the input pack. This name can be used to retrieve the input pack from the multipack.

“output_pack_name”: str

Name of the output pack. This name can be used to retrieve the output pack from the multipack.

MultiPackTerminalReader

class forte.data.readers.multipack_terminal_reader.MultiPackTerminalReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

A reader designed to read text from the terminal.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}

OntonotesReader

class forte.data.readers.ontonotes_reader.OntonotesReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

OntonotesReader is designed to read in the English OntoNotes v5.0 data in the datasets used by the CoNLL 2011/2012 shared tasks. To use this Reader, you must follow the instructions provided here (v12 release)::, which will allow you to download the CoNLL style annotations for the OntoNotes v5.0 release – LDC2013T19.tgz obtained from LDC.

Parameters

column_format

A list of strings indicating which field each column in a line corresponds to. The length of the list should be equal to the number of columns in the files to be read. Available field types include:

  • "document_id"

  • "part_number"

  • "word"

  • "pos_tag"

  • "lemmatised_word"

  • "framenet_id"

  • "word_sense"

  • "speaker"

  • "entity_label"

  • "coreference"

  • "*predicate_labels"

Field types marked with * indicate a variable-column field: it could span multiple fields. Only one such field is allowed in the format specification.

If a column should be ignored, fill in None at the corresponding position.

class ParsedFields(word, predicate_labels, document_id, part_number, pos_tag, lemmatised_word, framenet_id, word_sense, speaker, entity_label, coreference)[source]
property word

Alias for field number 0

property predicate_labels

Alias for field number 1

property document_id

Alias for field number 2

property part_number

Alias for field number 3

property pos_tag

Alias for field number 4

property lemmatised_word

Alias for field number 5

property framenet_id

Alias for field number 6

property word_sense

Alias for field number 7

property speaker

Alias for field number 8

property entity_label

Alias for field number 9

property coreference

Alias for field number 10

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dictionary of default hyperparameters.

{
    "name": "reader",
    "column_format": [
        "document_id",
        "part_number",
        None,
        "word",
        "pos_tag",
        None,
        "lemmatised_word",
        "framenet_id",
        "word_sense",
        "speaker",
        "entity_label",
        "*predicate_labels",
        "coreference",
    ]
}

Here:

“column_format”: list

A List of default column types.

Note

A None field means that column in the dataset file will be ignored during parsing.

PlainTextReader

class forte.data.readers.plaintext_reader.PlainTextReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

PlainTextReader is designed to read in plain text dataset.

text_replace_operation(text)[source]

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters

text – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]): the replacement operations.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}

ProdigyReader

class forte.data.readers.prodigy_reader.ProdigyReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

ProdigyReader is designed to read in Prodigy output text.

RACEMultiChoiceQAReader

class forte.data.readers.race_multi_choice_qa_reader.RACEMultiChoiceQAReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

RACEMultiChoiceQAReader is designed to read in RACE multi choice qa dataset.

StringReader

class forte.data.readers.string_reader.StringReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

StringReader is designed to read in a list of string variables.

SemEvalTask8Reader

class forte.data.readers.sem_eval_task8_reader.SemEvalTask8Reader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

SemEvalTask8Reader is designed to read in SemEval Task-8 dataset. The data can be obtained here: http://www.kozareva.com/downloads.html

Hendrickx, Iris, et al. SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. https://www.aclweb.org/anthology/S10-1006.pdf

An example of the dataset is ‘’’ 8 “<e1>People</e1> have been moving back into <e2>downtown</e2>.” Entity-Destination(e1,e2) Comment: ‘’’.

This example will be converted to one Sentence, “People have been moving back into downtown.” and one RelationLink, link = RelationLink(parent=People, child=downtown) link.rel_type = Entity-Destination into the DataPack.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}

OpenIEReader

class forte.data.readers.openie_reader.OpenIEReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

OpenIEReader is designed to read in the Open IE dataset used by Open Information Extraction task. The related paper can be found here. The related source code for generating this dataset can be found here. To use this Reader, you must follow the dataset format. Each line in the dataset should contain following fields:

<sentence>\t<predicate_head>\t<full_predicate>\t<arg1>\t<arg2>....

You can also find the dataset format here.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (Config) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

{
    "name": "reader"
}

DataPack Dataset

DataPackIterator

class forte.data.data_pack_dataset.DataPackIterator(pack_iterator, context_type, request=None, skip_k=0)[source]

An iterator over single data example from multiple data packs.

Parameters
  • pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

  • context_type – The granularity of a single example which could be any Annotation type. For example, it can be Sentence, then each training example will represent the information of a sentence.

  • request – The request of type Dict sent to DataPack to query specific data.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.

Returns

An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type DataPack).

Here is an example usage:
file_path: str = "data_samples/data_pack_dataset_test"
reader = CoNLL03Reader()
context_type = Sentence
request = {Sentence: []}
skip_k = 0

train_pl: Pipeline = Pipeline()
train_pl.set_reader(reader)
train_pl.initialize()
pack_iterator: Iterator[PackType] =
    train_pl.process_dataset(file_path)

iterator: DataPackIterator = DataPackIterator(pack_iterator,
                                              context_type,
                                              request,
                                              skip_k)

for tid, data_pack in iterator:
    # process tid and data_pack

Note

For parameters context_type, request, skip_k, please refer to get_data() in DataPack.

DataPackDataset

class forte.data.data_pack_dataset.DataPackDataset(data_source, feature_schemes, hparams=None, device=None)[source]

A dataset representing data packs. Calling an DataIterator over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.

Parameters
  • data_source – A data source of type DataPackDataSource.

  • feature_schemes (dict) – A Dict containing all the information to do data pre-processing. This is exactly the same as the schemes in feature_resource. Please refer to feature_resource() in TrainPreprocessor for details.

  • hparams – A dict or instance of : class:~texar.torch.HParams containing hyperparameters. See default_hparams() in DatasetBase for the defaults.

  • device – The device of the produced batches. For GPU training, set to current CUDA device.

process(raw_example)[source]

Given an input which is a single data example, extract feature from it.

Parameters

raw_example (tuple(dict, DataPack)) –

A Tuple where

The first element is a Dict produced by get_data() in DataPack.

The second element is an instance of type DataPack.

Returns

A Dict mapping from user-specified tags to the Feature extracted.

Note

Please refer to Please refer to feature_resource() in TrainPreprocessor for details about user-specified tags.

collate(examples)[source]

Given a batch of output from process(), produce pre-processed data as well as masks and features.

Parameters

examples – A List of result from process().

Returns

A texar Batch It can be treated as a Dict with the following structure:

{
    "tag_a": {
        "data": <tensor>,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    },
    "tag_b": {
        "data": Tensor,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    }
}
”data”: List or np.ndarray or torch.tensor

The pre-processed data.

Please refer to Converter for details.

”masks”: np.ndarray or torch.tensor

All the masks for pre-processed data.

Please refer to Converter for details.

”features”: List[Feature]

A List of Feature. This is useful when users want to do customized pre-processing.

Please refer to Feature for details.

Note

The first level key in returned batch is the user-specified tags. Please refer to feature_resource() in TrainPreprocessor for details about user-specified tags.

Batchers

ProcessingBatcher

class forte.data.batchers.ProcessingBatcher(cross_pack=True)[source]

This defines the basis interface of the Batcher used in BatchProcessor. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.

Parameters
  • cross_pack (bool, optional) – whether to allow batches go across

  • packs when there is no enough data at the end. (data) –

initialize(_)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This batcher will be called at the pipeline initialize stage.

Returns:

flush()[source]

Flush the remaining data.

Returns

A tuple contains datapack, instance and batch data. In the basic ProcessingBatcher, to be compatible with existing implementation, instance is not needed, thus using None.

get_batch(input_pack, context_type, requests)[source]

Returns an iterator of A tuple contains datapack, instance and batch data. In the basic ProcessingBatcher, to be compatible with existing implementation, instance is not needed, thus using None.

Data Utilities

maybe_download

forte.data.data_utils.maybe_download(urls, path, filenames=None, extract=False)[source]

Downloads a set of files.

Parameters
  • urls – A (list of) URLs to download files.

  • path – The destination path to save the files.

  • filenames – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.

  • extract – Whether to extract compressed files.

Returns

A list of paths to the downloaded files.

batch_instances

forte.data.data_utils_io.batch_instances(instances)[source]

Merge a list of instances.

merge_batches

forte.data.data_utils_io.merge_batches(batches)[source]

Merge a list of batches.

slice_batch

forte.data.data_utils_io.slice_batch(batch, start, length)[source]

Return a sliced batch of size length from start in batch.

dataset_path_iterator

forte.data.data_utils_io.dataset_path_iterator(dir_path, file_extension)[source]

An iterator returning the file paths in a directory containing files of the given datasets.