Data

Ontology

base

class forte.data.span.Span(begin, end)[source]

A class recording the span of annotations. Span objects can be totally ordered according to their begin as the first sort key and end as the second sort key.

Parameters
  • begin (int) – The offset of the first character in the span.

  • end (int) – The offset of the last character in the span + 1. So the span is a left-closed and right-open interval [begin, end).

core

Entry

class forte.data.ontology.core.Entry(pack)[source]

The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are Annotation, Link, Generics, and Group.

An forte.data.ontology.top.Annotation object represents a span in text.

A forte.data.ontology.top.Link object represents a binary link relation between two entries.

A forte.data.ontology.top.Generics object.

A forte.data.ontology.top.Group object represents a collection of multiple entries.

Main Attributes:

  • embedding: The embedding vectors (numpy array of floats) of this entry.

Parameters

pack (~ContainerType) – Each entry should be associated with one pack upon creation.

property embedding

Get the embedding vectors (numpy array of floats) of the entry.

property tid

Get the id of this entry.

Return type

int

Returns

id of the entry

property pack_id

Get the id of the pack that contains this entry.

Return type

int

Returns

id of the pack that contains this entry.

This function is normally called after deserialization. It can be called when the pack reference of this entry is ready (i.e. after set_pack). The purpose is to convert the Pointer objects into actual entries.

as_pointer(from_entry)[source]

Return this entry as a pointer of this entry relative to the from_entry.

Parameters

from_entry (Entry) – the entry to point from.

Returns

A pointer to the this entry from the from_entry.

entry_type()[source]

Return the full name of this entry type.

Return type

str

abstract set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent (Entry) – The parent entry.

abstract set_child(child)[source]

This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.

Parameters

child (Entry) – The child entry

abstract get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link from the given DataPack.

abstract get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link from the given DataPack.

class forte.data.ontology.core.BaseGroup(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

This is the BaseGroup interface. Specific member constraints are defined in the inherited classes.

abstract add_member(member)[source]

Add one entry to the group.

Parameters

member (~EntryType) – One member to be added to the group.

add_members(members)[source]

Add members to the group.

Parameters

members (Iterable[~EntryType]) – An iterator of members to be added to the group.

abstract get_members()[source]

Get the member entries in the group.

Return type

List[~EntryType]

Returns

Instances of Entry that are the members of the group.

top

class forte.data.ontology.top.Generics(pack)[source]
class forte.data.ontology.top.Annotation(pack, begin, end)[source]

Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a Span corresponding to its offset in the text.

Parameters
  • pack (~PackType) – The container that this annotation will be added to.

  • begin (int) – The offset of the first character in the annotation.

  • end (int) – The offset of the last character in the annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]

This function wraps the get() method to find entries “covered” by this annotation. See that method for more information.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from each sentence created by NLTKTokenizer.
    token_entries = sentence.get(
        entry_type=Token,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used.

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

class forte.data.ontology.top.AudioAnnotation(pack, begin, end)[source]

AudioAnnotation type entries, such as “recording” and “audio utterance”. Each audio annotation has a Span corresponding to its offset in the audio. Most methods in this class are the same as the ones in Annotation, except that it replaces property text with audio.

Parameters
  • pack (~PackType) – The container that this audio annotation will be added to.

  • begin (int) – The offset of the first sample in the audio annotation.

  • end (int) – The offset of the last sample in the audio annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]

This function wraps the get() method to find entries “covered” by this audio annotation. See that method for more information. For usage details, refer to forte.data.ontology.top.Annotation.get().

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

Link type entries, such as “predicate link”. Each link has a parent node and a child node.

Parameters
  • pack (~PackType) – The container that this annotation will be added to.

  • parent (Optional[Entry]) – the parent entry of the link.

  • child (Optional[Entry]) – the child entry of the link.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

set_parent(parent)[source]

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters

parent (Entry) – The parent entry.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.

Parameters

child (Entry) – The child entry.

property parent

Get tid of the parent node. To get the object of the parent node, call get_parent().

property child

Get tid of the child node. To get the object of the child node, call get_child().

get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Group(pack, members=None)[source]

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type

List[Entry]

Returns

A set of instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackGeneric(pack)[source]
class forte.data.ontology.top.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type

List[Entry]

Returns

Instances of Entry that are the members of the group.

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType

alias of forte.data.ontology.core.Entry

ChildType

alias of forte.data.ontology.core.Entry

parent_id()[source]

Return the tid of the parent entry.

Return type

int

Returns

The tid of the parent entry.

child_id()[source]

Return the tid of the child entry.

Return type

int

Returns

The tid of the child entry.

parent_pack_id()[source]

Return the pack_id of the parent pack.

Return type

int

Returns

The pack_id of the parent pack..

child_pack_id()[source]

Return the pack_id of the child pack.

Return type

int

Returns

The pack_id of the child pack.

set_parent(parent)[source]

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

parent (Entry) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters

child (Entry) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]

Get the parent entry of the link.

Return type

Entry

Returns

An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Return type

Entry

Returns

An instance of Entry that is the child of the link.

class forte.data.ontology.top.Query(pack)[source]

An entry type representing queries for information retrieval tasks.

Parameters

pack (~PackType) – Data pack reference to which this query will be added

add_result(pid, score)[source]

Set the result score for a particular pack (based on the pack id).

Parameters
  • pid (str) – the pack id.

  • score (float) – the score for the pack

Returns

None

update_results(pid_to_score)[source]

Updates the results for this query.

Parameters

pid_to_score (Dict[str, float]) – A dict containing pack id -> score mapping

Packs

BasePack

class forte.data.base_pack.BasePack(pack_name=None)[source]

The base class of DataPack and MultiPack.

Parameters

pack_name (Optional[str]) – a string name of the pack.

abstract delete_entry(entry)[source]

Remove the entry from the pack.

Parameters

entry (~EntryType) – The entry to be removed.

Returns

None

add_entry(entry, component_name=None)[source]

Add an Entry object to the BasePack object. Allow duplicate entries in a pack.

Parameters
  • entry (Entry) – An Entry object to be added to the pack.

  • component_name (Optional[str]) – A name to record that the entry is created by this component.

Return type

~EntryType

Returns

The input entry itself

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

to_string(drop_record=False, json_method='jsonpickle', indent=None)[source]

Return the string representation (json encoded) of this method.

Parameters
  • drop_record (Optional[bool]) – Whether to drop the creation records, default is False.

  • json_method (str) – What method is used to convert data pack to json. Only supports json_pickle for now. Default value is json_pickle.

  • indent (Optional[int]) – The indent used for json string.

Returns: String representation of the data pack.

Return type

str

serialize(output_path, zip_pack=False, drop_record=False, serialize_method='jsonpickle', indent=None)[source]

Serializes the data pack to the provided path. The output of this function depends on the serialization method chosen.

Parameters
  • output_path (Union[str, Path]) – The path to write data to.

  • zip_pack (bool) – Whether to compress the result with gzip.

  • drop_record (bool) – Whether to drop the creation records, default is False.

  • serialize_method (str) – The method used to serialize the data. Currently supports jsonpickle (outputs str) and Python’s built-in pickle (outputs bytes).

  • indent (Optional[int]) – Whether to indent the file if written as JSON.

Returns: Results of serialization.

set_control_component(component)[source]

Record the current component that is taking control of this pack.

Parameters

component (str) – The component that is going to take control

Returns:

record_field(entry_id, field_name)[source]

Record who modifies the entry, will be called in Entry

Parameters
  • entry_id (int) – The id of the entry.

  • field_name (str) – The name of the field modified.

Returns:

on_entry_creation(entry, component_name=None)[source]

Call this when adding a new entry, will be called in Entry when its __init__ function is called.

Parameters
  • entry (Entry) – The entry to be added.

  • component_name (Optional[str]) – A name to record that the entry is created by this component.

Returns:

regret_creation(entry)[source]

Will remove the entry from the pending entries internal state of the pack.

Parameters

entry (~EntryType) – The entry that we would not add the the pack anymore.

Returns:

get_entry(tid)[source]

Look up the entry_index with key ptr. Specific implementation depends on the actual class.

Return type

~EntryType

abstract get(entry_type, **kwargs)[source]

Implementation of this method should provide to obtain the entries in entry ordering. If there are orders defined between the entries, they should be used first. Otherwise, the insertion order should be used (FIFO).

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of the entry to obtain.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the provided arguments.

get_single(entry_type)[source]

Take a single entry of type entry_type from this data pack. This is useful when the target entry type appears only one time in the DataPack for e.g., a Document entry. Or you just intended to take the first one.

Parameters

entry_type (Union[str, Type[~EntryType]]) – The entry type to be retrieved.

Return type

~EntryType

Returns

A single data entry.

get_ids_by_creator(component)[source]

Look up the component_index with key component. This will return the entry ids that are created by the component

Parameters

component (str) – The component (creator) to find ids for.

Return type

Set[int]

Returns

A set of entry ids that are created by the component.

is_created_by(entry, components)[source]

Check if the entry is created by any of the provided components.

Parameters
Return type

bool

Returns

True if the entry is created by the component, False otherwise.

get_entries_from(component)[source]

Look up all entries from the component as a unordered set

Parameters

component (str) – The component (creator) to get the entries. It is normally the full qualified name of the creator class, but it may also be customized based on the implementation.

Return type

Set[~EntryType]

Returns

The set of entry ids that are created by the input component.

get_ids_from(components)[source]

Look up entries using a list of components (creators). This will find each creator iteratively and combine the result.

Parameters

components (List[str]) – The list of components to find.

Return type

Set[int]

Returns

The list of entry ids that are created from these components.

get_entries_of(entry_type, exclude_sub_types=False)[source]

Return all entries of this particular type without orders. If you need to get the annotations based on the entry ordering, use forte.data.base_pack.BasePack.get().

Parameters
  • entry_type (Type[~EntryType]) – The type of the entry you are looking for.

  • exclude_sub_types – Whether to ignore the inherited sub type of the provided entry_type. Default is True.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the type constraint.

DataPack

class forte.data.data_pack.DataPack(pack_name=None)[source]

A DataPack contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.

Parameters

pack_name (Optional[str]) – A name for this data pack.

property text

Return the text of the data pack

Return type

str

property audio

Return the audio of the data pack

Return type

Optional[ndarray]

property sample_rate

Return the sample rate of the audio data

Return type

Optional[int]

property all_annotations

An iterator of all annotations in this data pack.

Returns: Iterator of all annotations, of type Annotation.

Return type

Iterator[Annotation]

property num_annotations

Number of annotations in this data pack.

Returns: (int) Number of the links.

Return type

int

An iterator of all links in this data pack.

Returns: Iterator of all links, of type Link.

Return type

Iterator[Link]

Number of links in this data pack.

Returns: Number of the links.

Return type

int

property all_groups

An iterator of all groups in this data pack.

Returns: Iterator of all groups, of type Group.

Return type

Iterator[Group]

property num_groups

Number of groups in this data pack.

Returns: Number of groups.

property all_generic_entries

An iterator of all generic entries in this data pack.

Returns: Iterator of generic

Return type

Iterator[Generics]

property num_generics_entries

Number of generics entries in this data pack.

Returns: Number of generics entries.

property all_audio_annotations

An iterator of all audio annotations in this data pack.

Returns: Iterator of all audio annotations, of type AudioAnnotation.

Return type

Iterator[AudioAnnotation]

property num_audio_annotations

Number of audio annotations in this data pack.

Returns: Number of audio annotations.

get_span_text(begin, end)[source]

Get the text in the data pack contained in the span.

Parameters
  • begin (int) – begin index to query.

  • end (int) – end index to query.

Return type

str

Returns

The text within this span.

get_span_audio(begin, end)[source]

Get the audio in the data pack contained in the span. begin and end represent the starting and ending indices of the span in audio payload respectively. Each index corresponds to one sample in audio time series.

Parameters
  • begin (int) – begin index to query.

  • end (int) – end index to query.

Return type

ndarray

Returns

The audio within this span.

set_audio(audio, sample_rate)[source]

Set the audio payload and sample rate of the DataPack object.

Parameters
  • audio (ndarray) – A numpy array storing the audio waveform.

  • sample_rate (int) – An integer specifying the sample rate.

get_original_text()[source]

Get original unmodified text from the DataPack object.

Returns

Original text after applying the replace_back_operations of DataPack object to the modified text

get_original_span(input_processed_span, align_mode='relaxed')[source]

Function to obtain span of the original text that aligns with the given span of the processed text.

Parameters
  • input_processed_span (Span) – Span of the processed text for which the corresponding span of the original text is desired.

  • align_mode (str) –

    The strictness criteria for alignment in the ambiguous cases, that is, if a part of input_processed_span spans a part of the inserted span, then align_mode controls whether to use the span fully or ignore it completely according to the following possible values:

    • ”strict” - do not allow ambiguous input, give ValueError.

    • ”relaxed” - consider spans on both sides.

    • ”forward” - align looking forward, that is, ignore the span towards the left, but consider the span towards the right.

    • ”backward” - align looking backwards, that is, ignore the span towards the right, but consider the span towards the left.

Returns

Span of the original text that aligns with input_processed_span

Example

  • Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.

  • Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.

  • Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->

  • Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->

  • Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]

>> data_pack = DataPack()
>> original_text = "He plays in the park"
>> data_pack.set_text(original_text,\
>>                    lambda _: [(Span(0, 2), "She"))]
>> data_pack.text
"She plays in the park"
>> input_processed_span = Span(0, len("She plays"))
>> orig_span = data_pack.get_original_span(input_processed_span)
>> data_pack.get_original_text()[orig_span.begin: orig_span.end]
"He plays"
classmethod deserialize(data_source, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Data Pack from a string. This internally calls the internal _deserialize() function from BasePack.

Parameters
  • data_source (Union[Path, str]) – The path storing data source.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

DataPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the DataPack. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.

Please note that deleting a entry do not guarantee the deletion of the related entries.

Parameters

entry (~EntryType) – An Entry object to be deleted from the pack.

get_data(context_type, request=None, skip_k=0)[source]

Fetch data from entries in the data_pack of type context_type. Data includes “span”, annotation-specific default data fields and specific data fields by “request”.

Annotation-specific data fields means:

  • “text” for Type[Annotation]

  • “audio” for Type[AudioAnnotation]

Currently, we do not support Groups and Generics in the request.

Example

requests = {
    base_ontology.Sentence:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense"],
    base_ontology.EntityMention: {
    },
}
pack.get_data(base_ontology.Sentence, requests)
Parameters
  • context_type (Union[str, Type[Annotation], Type[AudioAnnotation]]) –

    The granularity of the data context, which could be any Annotation or AudioAnnotation type. Behaviors under different context_type varies:

    • str type will be converted into either Annotation type or AudioAnnotation type.

    • Type[Annotation]: the default data field for getting context data is text. This function iterates all_annotations to search target entry data.

    • Type[AudioAnnotation]: the default data field for getting context data is audio which stores audio data in numpy arrays. This function iterates all_audio_annotations to search target entry data.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) –

    The entry types and fields User wants to request. The keys of the requests dict are the required entry types and the value should be either:

    • a list of field names or

    • a dict which accepts three keys: “fields”, “component”, and “unit”.

      • By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.

      • By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.

      • By setting “unit” (string), users can specify a unit by which the annotations are indexed.

    Note that for all annotation types, “span” fields and annotation-specific data fields are returned by default.

    For all link types, “child” and “parent” fields are returned by default.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).

build_coverage_for(context_type, covered_type)[source]

User can call this function to build coverage index for specific types. The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See forte.data.data_pack.DataIndex for more details.

Parameters
covers(context_entry, covered_entry)[source]

Check if the covered_entry is covered (in span) of the context_type.

See in_span() and in_audio_span() for the definition of in span.

Parameters
  • context_entry (Union[Annotation, AudioAnnotation]) – The context entry.

  • covered_entry (~EntryType) – The entry to be checked on whether it is in span of the context entry.

Returns (bool): True if in span.

Return type

bool

iter_in_range(entry_type, range_annotation)[source]

Iterate the entries of the provided type within or fulfill the constraints of the range_annotation. The constraint is True if an entry is in_span() or in_audio_span() of the provided range_annotation.

Internally, if the coverage index between the entry type and the type of the range_annotation is built, then this will create the iterator from the index. Otherwise, the function will iterate them from scratch (which is slower). If there are frequent usage of this function, it is suggested to build the coverage index.

Only when range_annotation is an instance of AudioAnnotation will the searching be performed on the list of audio annotations. In other cases (i.e., when range_annotation is None or Annotation), it defaults to a searching process on the list of text annotations.

Parameters
  • entry_type (Type[~EntryType]) – The type of entry to iterate over.

  • range_annotation (Union[Annotation, AudioAnnotation]) – The range annotation that serve as the constraint.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries with in the range_annotation.

get(entry_type, range_annotation=None, components=None, include_sub_type=True)[source]

This function is used to get data from a data pack with various methods.

Depending on the provided arguments, the function will perform several different filtering of the returned data.

The entry_type is mandatory, where all the entries matching this type will be returned. The sub-types of the provided entry type will be also returned if include_sub_type is set to True (which is the default behavior).

The range_annotation controls the search area of the sub-types. An entry E will be returned if in_span() or in_audio_span() returns True. If this function is called frequently with queries related to the range_annotation, please consider to build the coverage index regarding the related entry types. User can call build_coverage_for(context_type, covered_type)() in order to build a mapping between a pair of entry types and target entries that are covered in ranges specified by outer entries.

The components list will filter the results by the component (i.e the creator of the entry). If components is provided, only the entries created by one of the components will be returned.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from a sentence created by NLTKTokenizer.
    token_entries = input_pack.get(
        entry_type=Token,
        range_annotation=sentence,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used:

# Build coverage index between `Token` and `Sentence`
input_pack.build_coverage_for(
    context_type=Sentence
    covered_type=Token
)

After building the index from the snippet above, you will be able to retrieve the tokens covered by sentence much faster.

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.

  • range_annotation (Union[Annotation, AudioAnnotation, None]) – The range of entries requested. If None, will return valid entries in the range of whole data pack.

  • components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.

  • include_sub_type (bool) – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

update(datapack)[source]

Update the attributes and properties of the current DataPack with another DataPack.

Parameters

datapack (DataPack) – A reference datapack to update

MultiPack

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters

packs (Iterator[DataPack]) – a data pack iterator.

Returns

None

get_subentry(pack_idx, entry_id)[source]

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id. Noted this is changed from the way of accessing such pack before the PACK_ID_COMPATIBLE_VERSION, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters
  • pack_idx (int) – The pack_id for the data_pack in the multi pack.

  • entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters
  • index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.

  • clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.

  • purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type

bool

Returns

True if successful.

add_pack(ref_name=None, pack_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters
  • ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.

  • pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type

DataPack

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters
  • pack (DataPack) – The existing data pack.

  • ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]

Get data pack at provided index.

Parameters

index (int) – The index of the pack.

Return type

DataPack

Returns

The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters

pack_id (int) – The global pack id to find.

Return type

int

Returns

None

get_pack(name)[source]

Get data pack of name.

Parameters

name (str) – The name of the pack.

Return type

DataPack

Returns

The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Return type

List[DataPack]

Returns

List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters
  • old_name (str) – The old name of the pack.

  • new_name (str) – The new name to be assigned for the pack.

Returns

None

An iterator of all links in this multi pack.

Return type

Iterator[MultiPackLink]

Returns

Iterator of all links, of type MultiPackLink.

Number of groups in this multi pack.

Return type

int

Returns

Number of links.

property all_groups

An iterator of all groups in this multi pack.

Return type

Iterator[MultiPackGroup]

Returns

Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Return type

int

Returns

Number of groups.

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters
  • pack_index (int) – The index of a single pack.

  • context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.

  • skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

  • “fields”, this specifies the attribute field names to be obtained

  • “unit”, this specifies the unit used to index the annotation

  • “component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)
Parameters

request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.

Returns

None

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.

  • components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.

  • include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters
  • data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the MultiPack.

Parameters

entry (~EntryType) – An Entry object to be deleted from the pack.

BaseMeta

class forte.data.base_pack.BaseMeta(pack_name=None)[source]

Basic Meta information for both DataPack and MultiPack.

Parameters

pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Meta

class forte.data.data_pack.Meta(pack_name=None, language='eng', span_unit='character', sample_rate=None, info=None)[source]

Basic Meta information associated with each instance of DataPack.

Parameters
  • pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

  • language (str) – The language used by this data pack, default is English.

  • span_unit (str) – The unit used for interpreting the Span object of this data pack. Default is character.

  • sample_rate (Optional[int]) – An integer specifying the sample rate of audio payload. Default is None.

  • info (Optional[Dict[str, str]]) – Store additional string based information that the user add.

pack_name

storing the provided pack_name.

language

storing the provided language.

sample_rate

storing the provided sample_rate.

info

storing the provided info.

record

Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

DataIndex

class forte.data.data_pack.DataIndex[source]

A set of indexes used in DataPack, note that this class is used by the DataPack internally.

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. component_index, the index from each component to the entries generated by that component

  4. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  5. group_index, the index from group members to groups.

  6. _coverage_index, the index that maps from an annotation to the entries it covers. _coverage_index is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.

coverage_index(outer_type, inner_type)[source]

Get the coverage index from outer_type to inner_type.

Parameters
Return type

Optional[Dict[int, Set[int]]]

Returns

If the coverage index does not exist, return None. Otherwise, return a dict.

get_covered(data_pack, context_annotation, inner_type)[source]

Get the entries covered by a certain context annotation

Parameters
  • data_pack (DataPack) – The data pack to search for.

  • context_annotation (Union[Annotation, AudioAnnotation]) – The context annotation to search in.

  • inner_type (Type[~EntryType]) – The inner type to be searched for.

Return type

Set[int]

Returns

Entry ID of type inner_type that is covered by context_annotation.

build_coverage_index(data_pack, outer_type, inner_type)[source]

Build the coverage index from outer_type to inner_type.

Parameters
  • data_pack (DataPack) – The data pack to build coverage for.

  • outer_type (Type[Union[Annotation, AudioAnnotation]]) – an annotation or AudioAnnotation type.

  • inner_type (Type[~EntryType]) – an entry type, can be Annotation, Link, Group, AudioAnnotation.

have_overlap(entry1, entry2)[source]

Check whether the two annotations have overlap in span.

Parameters
Return type

bool

in_span(inner_entry, span)[source]

Check whether the inner entry is within the given span. The criterion are as followed:

Annotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both Annotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type Annotation, this function will always return False.

Group entries: if the child type of the group is Annotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not Annotation type, this function will always return False.

Other entries (i.e Generics and AudioAnnotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters
  • inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

in_audio_span(inner_entry, span)[source]

Check whether the inner entry is within the given audio span. This method is identical to :meth:in_span() except that it operates on the audio payload of datapack. The criterion are as followed:

AudioAnnotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both AudioAnnotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type AudioAnnotation, this function will always return False.

Group entries: if the child type of the group is AudioAnnotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not AudioAnnotation type, this function will always return False.

Other entries (i.e Generics and Annotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters
  • inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.

  • span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

MultiPack

MultiPackMeta

class forte.data.multi_pack.MultiPackMeta(pack_name=None)[source]

Meta information of a MultiPack.

MultiPack

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

relink(packs)[source]

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters

packs (Iterator[DataPack]) – a data pack iterator.

Returns

None

get_subentry(pack_idx, entry_id)[source]

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id. Noted this is changed from the way of accessing such pack before the PACK_ID_COMPATIBLE_VERSION, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters
  • pack_idx (int) – The pack_id for the data_pack in the multi pack.

  • entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters
  • index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.

  • clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.

  • purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type

bool

Returns

True if successful.

add_pack(ref_name=None, pack_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters
  • ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.

  • pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type

DataPack

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters
  • pack (DataPack) – The existing data pack.

  • ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]

Get data pack at provided index.

Parameters

index (int) – The index of the pack.

Return type

DataPack

Returns

The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters

pack_id (int) – The global pack id to find.

Return type

int

Returns

None

get_pack(name)[source]

Get data pack of name.

Parameters

name (str) – The name of the pack.

Return type

DataPack

Returns

The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Return type

List[DataPack]

Returns

List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters
  • old_name (str) – The old name of the pack.

  • new_name (str) – The new name to be assigned for the pack.

Returns

None

property all_links

An iterator of all links in this multi pack.

Return type

Iterator[MultiPackLink]

Returns

Iterator of all links, of type MultiPackLink.

property num_links

Number of groups in this multi pack.

Return type

int

Returns

Number of links.

property all_groups

An iterator of all groups in this multi pack.

Return type

Iterator[MultiPackGroup]

Returns

Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Return type

int

Returns

Number of groups.

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters

component (Optional[str]) – Overwrite the component record with this.

Returns

None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters
  • pack_index (int) – The index of a single pack.

  • context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.

  • skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

  • “fields”, this specifies the attribute field names to be obtained

  • “unit”, this specifies the unit used to index the annotation

  • “component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)
Parameters

request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.

Returns

None

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters
  • entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.

  • components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.

  • include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters
  • data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.

  • serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.

  • zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the MultiPack.

Parameters

entry (~EntryType) – An Entry object to be deleted from the pack.

MultiPackGroup

class forte.data.multi_pack.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType

alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters

member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type

List[Entry]

Returns

Instances of Entry that are the members of the group.

Readers

BaseReader

class forte.data.base_reader.BaseReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic data reader class. To be inherited by all data readers.

Parameters
  • from_cache (bool) – Decide whether to read from cache if cache file exists. By default (False), the reader will only read from the original file and use the cache file path for caching, it will not read from the cache_directory. If True, the reader will try to read a datapack from the caching file.

  • cache_directory (Optional[str]) –

    The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by _cache_key_function().

    Note

    A collection is the data returned by _collect().

  • append_to_cache (bool) – Decide whether to append write if cache file already exists. By default (False), we will overwrite the existing caching file. If True, we will cache the datapack append to end of the caching file.

initialize(resources, configs)[source]

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters
  • resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.

  • configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

parse_pack(collection)[source]

Calls _parse_pack() to create packs from the collection. This internally setup the component meta data. Users should implement the _parse_pack() method.

Return type

Iterator[~PackType]

text_replace_operation(text)[source]

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters

text (str) – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]):

the replacement operations.

Return type

List[Tuple[Span, str]]

set_profiling(enable_profiling=True)[source]

Set profiling option.

Parameters

enable_profiling (bool) – A boolean of whether to enable profiling for the reader or not (the default is True).

timer_yield(pack)[source]

Wrapper generator for time profiling. Insert timers around ‘yield’ to support time profiling for reader.

Parameters

pack (~PackType) – DataPack passed from self.iter()

iter(*args, **kwargs)[source]

An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call collect.

Parameters
  • args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.

  • kwargs – Iterator of DataPacks.

Return type

Iterator[~PackType]

record(record_meta)[source]

Modify the pack meta record field of the reader’s output. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters

record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

cache_data(collection, pack, append)[source]

Specify the path to the cache directory.

After you call this method, the dataset reader will use its cache_directory to store a cache of BasePack read from every document passed to read, serialized as one string-formatted BasePack. If the cache file for a given file_path exists, we read the BasePack from the cache. If the cache file does not exist, we will create it on our first pass through the data.

Parameters
  • collection (Any) – The collection is a piece of data from the _collect() function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.

  • pack (~PackType) – The data pack to be cached.

  • append (bool) – Whether to allow appending to the cache.

read_from_cache(cache_filename)[source]

Reads one or more Packs from cache_filename, and yields Pack(s) from the cache file.

Parameters

cache_filename (Union[Path, str]) – Path to the cache file.

Return type

Iterator[~PackType]

Returns

List of cached data packs.

finish(resource)[source]

The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.

Parameters

resource (Resources) – A global resource registry.

set_text(pack, text)[source]

Assign the text value to the DataPack. This function will pass the text_replace_operation to the DataPack to conduct the pre-processing step.

Parameters
  • pack (DataPack) – The DataPack to assign value for.

  • text (str) – The original text to be recorded in this dataset.

PackReader

class forte.data.base_reader.PackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

A Pack Reader reads data into DataPack.

MultiPackReader

class forte.data.base_reader.MultiPackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

The basic MultiPack data reader class. To be inherited by all data readers which return MultiPack.

CoNLL03Reader

ConllUDReader

class forte.data.readers.conllu_ud_reader.ConllUDReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

ConllUDReader is designed to read in the Universal Dependencies 2.4 dataset.

BaseDeserializeReader

RawDataDeserializeReader

RecursiveDirectoryDeserializeReader

HTMLReader

MSMarcoPassageReader

class forte.data.readers.ms_marco_passage_reader.MSMarcoPassageReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

MultiPackSentenceReader

MultiPackTerminalReader

OntonotesReader

PlainTextReader

ProdigyReader

RACEMultiChoiceQAReader

StringReader

SemEvalTask8Reader

OpenIEReader

SquadReader

class forte.datasets.mrc.squad_reader.SquadReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]

Reader for processing Stanford Question Answering Dataset (SQuAD).

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span.

Dataset can be downloaded at https://rajpurkar.github.io/SQuAD-explorer/.

SquadReader reads each paragraph in the dataset as a separate Document, and the questions are concatenated behind the paragraph, form a Passage. Phrase are MRC answers marked as text spans. Each MRCQuestion has a list of answers.

classmethod default_configs()[source]

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

  • zip_pack (bool): whether to zip the results. The default value is False.

  • serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

record(record_meta)[source]

Method to add output type record of PlainTextReader which is ft.onto.base_ontology.Document with an empty set to forte.data.data_pack.Meta.record.

Parameters

record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

ClassificationDatasetReader

Selector

Selector

class forte.data.selector.Selector[source]

DummySelector

class forte.data.selector.DummySelector[source]

Do nothing, return the data pack itself, which can be either DataPack or MultiPack.

SinglePackSelector

class forte.data.selector.SinglePackSelector[source]

This is the base class that select a DataPack from MultiPack.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

NameMatchSelector

class forte.data.selector.NameMatchSelector(select_name=None)[source]

Select a DataPack from a MultiPack with specified name. This implementation takes special care for backward compatibility.

Deprecated:

selector = NameMatchSelector(select_name="foo")
selector = NameMatchSelector("foo")

Now:

selector = NameMatchSelector()
    selector.initialize(
        configs={
            "select_name": "foo"
        }
)

WARNING: Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Returns

A boolean value to indicate whether pack will be returned.

RegexNameMatchSelector

class forte.data.selector.RegexNameMatchSelector(select_name=None)[source]

Select a DataPack from a MultiPack using a regex.

This implementation takes special care for backward compatibility.

Deprecated:

selector = RegexNameMatchSelector(select_name="^.*\\d$")
selector = RegexNameMatchSelector("^.*\\d$")

Now:

selector = RegexNameMatchSelector()
selector.initialize(
    configs={
        "select_name": "^.*\\d$"
    }
)

Warning

Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

FirstPackSelector

class forte.data.selector.FirstPackSelector[source]

Select the first entry from MultiPack and yield it.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

AllPackSelector

class forte.data.selector.AllPackSelector[source]

Select all the packs from MultiPack and yield them.

will_select(pack_name, pack, multi_pack)[source]

Implement this method to return a boolean value whether the pack will be selected.

Parameters
  • pack_name (str) – The name of the pack to be selected.

  • pack (DataPack) – The pack that needed to be determined whether it will be selected.

  • multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

Index

BaseIndex

class forte.data.index.BaseIndex[source]

A set of indexes used in BasePack:

  1. entry_index, the index from each tid to the corresponding entry

  2. type_index, the index from each type to the entries of that type

  3. link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links

  4. group_index, the index from group members to groups.

update_basic_index(entries)[source]

Build or update the basic indexes, including

(1) entry_index, the index from each tid to the corresponding entry;

(2) type_index, the index from each type to the entries of that type;

(3) component_index, the index from each component to the entries generated by that component.

Parameters

entries (list) – a list of entries to be added into the basic index.

query_by_type_subtype(t)[source]

Look up the entry indices that are instances of entry_type, including children classes of entry_type.

Note

all the known types to this data pack will be scanned to find all sub-types. This method will try to cache the sub-type information after the first call, but the cached information could be invalidated by other operations (such as adding new items to the data pack).

Parameters

t (Type[~EntryType]) – The type of the entry you are looking for.

Return type

Set[int]

Returns

A set of entry ids. The entries are instances of entry_type ( and also includes instances of the subclasses of entry_type).

Build the link_index, the index from child and parent nodes to links. It will build the links with the links in the dataset.

link_index consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:

build_group_index(groups)[source]

Build group_index, the index from group members to groups.

Returns

None

Look up the link_index with key tid. If the link index is not built, this will throw a PackIndexError.

Parameters
  • tid (int) – the tid of the entry being looked up.

  • as_parent (bool) – If as_patent is True, will look up link_index["parent_index"] and return the tids of links whose parent is ``tid`. Otherwise, will look up link_index["child_index"] and return the tids of links whose child is ``tid`.

Return type

Set[int]

group_index(tid)[source]

Look up the group_index with key tid. If the index is not built, this will raise a PackIndexError.

Return type

Set[int]

Update link_index with the provided links, the index from child and parent to links.

link_index consists of two sub-indexes:

  • “child_index” is the index from child nodes to their corresponding links

  • “parent_index” is the index from parent nodes to their corresponding links.

Parameters

links (List[~LinkType]) – a list of links to be added into the index.

update_group_index(groups)[source]

Build or update group_index, the index from group members to groups.

Parameters

groups (List[~GroupType]) – a list of groups to be added into the index.

Store

BaseStore

class forte.data.base_store.BaseStore[source]

The base class which will be used by DataStore.

abstract add_annotation_raw(type_name, begin, end)[source]

This function adds an annotation entry with begin and end indices to the type_name sorted list in self.__elements, returns the tid for the inserted entry.

Parameters
  • type_name (str) – The index of Annotation sorted list in self.__elements.

  • begin (int) – Begin index of the entry.

  • end (int) – End index of the entry.

Return type

int

Returns

tid of the entry.

This function adds a link entry with parent_tid and child_tid to the type_name list in self.__elements, returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters
  • type_name (str) – The index of Link list in self.__elements.

  • parent_tid (int) – tid of the parent entry.

  • child_tid (int) – tid of the child entry.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

abstract add_group_raw(type_name, member_type)[source]

This function adds a group entry with member_type to the type_name list in self.__elements, returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters
  • type_name (str) – The index of Group list in self.__elements.

  • member_type (str) – Fully qualified name of its members.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

abstract set_attribute(tid, attr_name, attr_value)[source]

This function locates the entry data with tid and sets its attr_name with attr_value.

Parameters
  • tid (int) – Unique Id of the entry.

  • attr_name (str) – Name of the attribute.

  • attr_value (Any) – Value of the attribute.

abstract set_attr(tid, attr_id, attr_value)[source]

This function locates the entry data with tid and sets its attribute attr_id with value attr_value. Called by set_attribute().

Parameters
  • tid (int) – Unique id of the entry.

  • attr_id (int) – Id of the attribute.

  • attr_value (Any) – value of the attribute.

abstract get_attribute(tid, attr_name)[source]

This function finds the value of attr_name in entry with tid.

Parameters
  • tid (int) – Unique id of the entry.

  • attr_name (str) – Name of the attribute.

Returns

The value of attr_name for the entry with tid.

abstract get_attr(tid, attr_id)[source]

This function locates the entry data with tid and gets the value of attr_id of this entry. Called by get_attribute().

Parameters
  • tid (int) – Unique id of the entry.

  • attr_id (int) – Id of the attribute.

Returns

The value of attr_id for the entry with tid.

abstract delete_entry(tid)[source]

This function removes the entry with tid from the data store.

Parameters

tid (int) – Unique id of the entry.

abstract get_entry(tid)[source]

Look up the entry_dict with key tid. Return the entry and its type_name.

Parameters

tid (int) – Unique id of the entry.

Return type

Tuple[List, str]

Returns

The entry which tid corresponds to and its type_name.

abstract get_entry_index(tid)[source]

Look up the entry_dict with key tid. Return the index_id of the entry.

Parameters

tid (int) – Unique id of the entry.

Return type

int

Returns

Index of the entry which tid corresponds to in the entry_type list.

abstract get(type_name, include_sub_type)[source]

This function fetches entries from the data store of type type_name.

Parameters
  • type_name (str) – The index of the list in self.__elements.

  • include_sub_type (bool) – A boolean to indicate whether get its subclass.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

abstract next_entry(tid)[source]

Get the next entry of the same type as the tid entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

The next entry of the same type as the tid entry.

abstract prev_entry(tid)[source]

Get the previous entry of the same type as the tid entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

The previous entry of the same type as the tid entry.

Data Store

DataStore

class forte.data.data_store.DataStore(onto_file_path=None, dynamically_add_type=True)[source]
add_annotation_raw(type_name, begin, end)[source]

This function adds an annotation entry with begin and end indices to current data store object. Returns the tid for the inserted entry.

Parameters
  • type_name (str) – The fully qualified type name of the new Annotation.

  • begin (int) – Begin index of the entry.

  • end (int) – End index of the entry.

Return type

int

Returns

tid of the entry.

This function adds a link entry with parent_tid and child_tid to current data store object. Returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters
  • type_name (str) – The fully qualified type name of the new Link.

  • parent_tid (int) – tid of the parent entry.

  • child_tid (int) – tid of the child entry.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

add_group_raw(type_name, member_type)[source]

This function adds a group entry with member_type to the current data store object. Returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters
  • type_name (str) – The fully qualified type name of the new Group.

  • member_type (str) – Fully qualified name of its members.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the (type_id)th list.

set_attribute(tid, attr_name, attr_value)[source]

This function locates the entry data with tid and sets its attr_name with attr_value. It first finds attr_id according to attr_name. tid, attr_id, and attr_value are passed to set_attr().

Parameters
  • tid (int) – Unique Id of the entry.

  • attr_name (str) – Name of the attribute.

  • attr_value (Any) – Value of the attribute.

Raises

KeyError – when tid or attr_name is not found.

get_attribute(tid, attr_name)[source]

This function finds the value of attr_name in entry with tid. It locates the entry data with tid and finds attr_id of its attribute attr_name. tid and attr_id are passed to get_attr().

Parameters
  • tid (int) – Unique id of the entry.

  • attr_name (str) – Name of the attribute.

Return type

Any

Returns

The value of attr_name for the entry with tid.

Raises

KeyError – when tid or attr_name is not found.

delete_entry(tid)[source]

This function locates the entry data with tid and removes it from the data store. This function first removes it from __entry_dict.

Parameters

tid (int) – Unique id of the entry.

Raises
  • KeyError – when entry with tid is not found.

  • RuntimeError – when internal storage is inconsistent.

get_entry(tid)[source]

This function finds the entry with tid. It returns the entry and its type_name.

Parameters

tid (int) – Unique id of the entry.

Return type

Tuple[List, str]

Returns

The entry which tid corresponds to and its type_name.

Raises
  • ValueError – An error occurred when input tid is not found.

  • KeyError – An error occurred when entry_type is not found.

get_entry_index(tid)[source]

Look up the entry_dict with key tid. Return the index_id of the entry.

Parameters

tid (int) – Unique id of the entry.

Return type

int

Returns

Index of the entry which tid corresponds to in the entry_type list.

Raises

ValueError – An error occurred when no corresponding entry is found.

co_iterator_annotation_like(type_names)[source]

Given two or more type names, iterate their entry lists from beginning to end together.

For every single type, their entry lists are sorted by the begin and end fields. The co_iterator_annotation_like function will iterate those sorted lists together, and yield each entry in sorted order. This tasks is quite similar to merging several sorted list to one sorted list. We internally use a MinHeap to order the order of yielded items, and the ordering is determined by:

  • start index of the entry.

  • end index of the entry.

  • the index of the entry type name in input parameter type_names.

The precedence of those values indicates their priority in the min heap ordering. For example, if two entries have both the same begin and end field, then their order is decided by the order of user input type_name (the type that first appears in the target type list will return first). For entries that have the exact same begin, end and type_name, the order will be determined arbitrarily.

Parameters

type_names (List[str]) – a list of string type names

Return type

Iterator[List]

Returns

An iterator of entry elements.

get(type_name, include_sub_type=True)[source]

This function fetches entries from the data store of type type_name.

Parameters
  • type_name (str) – The fully qualified name of the entry.

  • include_sub_type (bool) – A boolean to indicate whether get its subclass.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

next_entry(tid)[source]

Get the next entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the next entry. If it is a non-annotation type, it will be sorted in the insertion order, which means next_entry would return the next inserted entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

A list of attributes representing the next entry of the same type as the tid entry. Return None when accessing the next entry of the last element in entry list.

Raises

IndexError – An error occurred accessing index out out of entry list.

prev_entry(tid)[source]

Get the previous entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the previous entry. If it is a non-annotation type, it will be sorted in the insertion order, which means prev_entry would return the previous inserted entry.

Parameters

tid (int) – Unique id of the entry.

Return type

Optional[List]

Returns

A list of attributes representing the previous entry of the same type as the tid entry. Return None when accessing the previous entry of the first element in entry list.

Raises

IndexError – An error occurred accessing index out out of entry list.

DataPack Dataset

DataPackIterator

class forte.data.data_pack_dataset.DataPackIterator(pack_iterator, context_type, request=None, skip_k=0)[source]

An iterator generating data example from a stream of data packs.

Parameters
  • pack_iterator (Iterator[DataPack]) – An iterator of DataPack.

  • context_type (Type[Annotation]) – The granularity of a single example which could be any Annotation type. For example, it can be Sentence, then each training example will represent the information of a sentence.

  • request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The request of type Dict sent to DataPack to query specific data.

  • skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.

Returns

An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type DataPack).

Here is an example usage:

file_path: str = "data_samples/data_pack_dataset_test"
reader = CoNLL03Reader()
context_type = Sentence
request = {Sentence: []}
skip_k = 0

train_pl: Pipeline = Pipeline()
train_pl.set_reader(reader)
train_pl.initialize()
pack_iterator: Iterator[PackType] =
    train_pl.process_dataset(file_path)

iterator: DataPackIterator = DataPackIterator(pack_iterator,
                                              context_type,
                                              request,
                                              skip_k)

for tid, data_pack in iterator:
    # process tid and data_pack

Note

For parameters context_type, request, skip_k, please refer to get_data() in DataPack.

DataPackDataset

class forte.data.data_pack_dataset.DataPackDataset(data_source, feature_schemes, hparams=None, device=None)[source]

A dataset representing data packs. Calling an DataIterator over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.

Parameters
process(raw_example)[source]

Given an input which is a single data example, extract feature from it.

Parameters

raw_example (tuple(dict, DataPack)) –

A Tuple where

Return type

Dict[str, Feature]

Returns

A Dict mapping from user-specified tags to the Feature extracted.

Note

Please refer to request() for details about user-specified tags.

collate(examples)[source]

Given a batch of output from process(), produce pre-processed data as well as masks and features.

Parameters

examples (List[Dict[str, Feature]]) – A List of result from process().

Return type

Batch

Returns

A Texar Pytorch Batch It can be treated as a Dict with the following structure:

  • ”data”: List or np.ndarray or torch.tensor The pre-processed data.

    Please refer to Converter for details.

  • ”masks”: np.ndarray or torch.tensor All the masks for pre-processed data.

    Please refer to Converter for details.

  • ”features”: List[Feature] A List of Feature. This is useful when users want to do customized pre-processing.

    Please refer to Feature for details.

{
    "tag_a": {
        "data": <tensor>,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    },
    "tag_b": {
        "data": Tensor,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    }
}

Note

The first level key in returned batch is the user-specified tags. Please refer to request() for details about user-specified tags.

RawExample

forte.data.data_pack_dataset.RawExample

alias of Tuple[int, forte.data.data_pack.DataPack]

FeatureCollection

forte.data.data_pack_dataset.FeatureCollection

alias of Dict[str, forte.data.converter.feature.Feature]

Batchers

ProcessingBatcher

class forte.data.batchers.ProcessingBatcher[source]

This defines the basis interface of the batcher used in BaseBatchProcessor. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

flush()[source]

Flush the remaining data.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

A triplet contains datapack, context instance and batched data.

Note

For backward compatibility issues, this function return list of None contexts.

get_batch(input_pack)[source]

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters

input_pack (~PackType) – The input data pack to get features from.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

An iterator of A tuple contains datapack, context instance and batch data.

Note

For backward compatibility issues, this function return a list of None as contexts.

classmethod default_configs()[source]

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

  • use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

  • cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type

Dict[str, Any]

Returns

The default configuration.

FixedSizeDataPackBatcherWithExtractor

class forte.data.batchers.FixedSizeDataPackBatcherWithExtractor[source]

This batcher uses extractor to extract features from dataset and group them into batch. In this class, more pools are added. One is instance_pool, which is used to record the instance from which feature is extracted. The other one is feature_pool, which is used to record features before they can be yield in batch.

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

add_feature_scheme(tag, scheme)[source]

Add feature scheme to the batcher.

Parameters
  • tag (str) – The name/tag of the scheme.

  • scheme (str) – The scheme content, which should be a dict containing the extractor and converter used to create features.

collate(features_collection)[source]

This function use the Converter interface to turn a list of features into batches, where each feature is converted to tensor/matrix format. The resulting features are organized as a dictionary, where the keys are the feature names/tags, and the values are the converted features. Each feature contains the data and mask in MatrixLike form, as well as the original raw features.

Parameters

features_collection (List[Dict[str, Feature]]) – A list of features.

Return type

Dict[str, Dict[str, Any]]

Returns

A instance of Dict[str, Union[Tensor, Dict]], which is a batch of features.

flush()[source]

Flush data in batches. Each return value contains a tuple of 3 items: the corresponding data pack, the list of annotation objects that represent the context type, and the features.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict[str, Dict[str, Any]]]]

get_batch(input_pack)[source]

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters

input_pack (~PackType) – The input data pack to get features from.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

An iterator of a tuple contains datapack, context instance and batch data.

classmethod default_configs()[source]

Defines the configuration of this batcher, here:

  • context_type: The context scope to extract data from. It could be a annotation class or a string that is the fully qualified name of the annotation class.

  • feature_scheme: A dictionary of (extractor name, extractor) that can be used to extract features.

  • batch_size: The batch size, default is 10.

Return type

Dict[str, Any]

Returns

The default configuration structure.

FixedSizeRequestDataPackBatcher

class forte.data.batchers.FixedSizeRequestDataPackBatcher[source]
initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

The configuration of a batcher.

Here:

  • context_type (str): The fully qualified name of an Annotation type, which will be used as the context to retrieve data from. For example, if a ft.onto.Sentence type is provided, then it will extract data within each sentence.

  • requests: The request detail. See get_data() on what a request looks like.

Return type

Dict

Returns

The default configuration structure and default value.

FixedSizeMultiPackProcessingBatcher

class forte.data.batchers.FixedSizeMultiPackProcessingBatcher[source]

A Batcher used in MultiPackBatchProcessor.

Note

this implementation is not finished.

The Batcher calls the ProcessingBatcher inherently on each specified data pack in the MultiPack.

It’s flexible to query MultiPack so we delegate the task to the subclasses such as:

  • query all packs with the same context and input_info.

  • query different packs with different context and input_info.

Since the batcher will save the data_pack_pool on the fly, it’s not trivial to do batching and slicing multiple data packs in the same time

initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

  • use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

  • cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type

Dict

Returns

The default configuration.

FixedSizeDataPackBatcher

class forte.data.batchers.FixedSizeDataPackBatcher[source]
initialize(config)[source]

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns

None

classmethod default_configs()[source]

The configuration of a batcher.

Here:

  • batch_size: the batch size, default is 10.

Return type

Dict

Returns

The default configuration structure and default value.

Caster

Caster

class forte.data.caster.Caster[source]

MultiPackBoxer

class forte.data.caster.MultiPackBoxer[source]

This class creates a MultiPack from a DataPack, this MultiPack will only contains the original DataPack, indexed by the pack_name.

cast(pack)[source]

Auto-box the DataPack into a MultiPack by simple wrapping.

Parameters

pack (DataPack) – The DataPack to be boxed

Return type

MultiPack

Returns

An iterator that produces the boxed MultiPack.

classmethod default_configs()[source]

Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

MultiPackUnboxer

class forte.data.caster.MultiPackUnboxer[source]

This passes on a single DataPack within the MultiPack.

cast(pack)[source]

Auto-box the MultiPack into a DataPack by using pack_index to take the unique pack.

Parameters

pack (MultiPack) – The MultiPack to be boxed.

Return type

DataPack

Returns

A DataPack boxed from the MultiPack.

classmethod default_configs()[source]

Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

Container

EntryContainer

class forte.data.container.EntryContainer[source]

BasePointer

class forte.data.container.BasePointer[source]

Objects to point to other objects in the data pack.

Types

ReplaceOperationsType

forte.data.types.ReplaceOperationsType

alias of List[Tuple[forte.data.span.Span, str]]

DataRequest

forte.data.types.DataRequest

alias of Dict[Type[forte.data.ontology.core.Entry], Union[Dict, List]]

MatrixLike

forte.data.types.MatrixLike

alias of Union[torch._C.TensorType, numpy.ndarray, List]

Data Utilities

maybe_download

forte.data.data_utils.maybe_download(urls: List[str], path: Union[str, PathLike], filenames: Optional[List[str]] = None, extract: bool = False, num_gdrive_retries: int = 1)List[str][source]
forte.data.data_utils.maybe_download(urls: str, path: Union[str, PathLike], filenames: Optional[str] = None, extract: bool = False, num_gdrive_retries: int = 1)str

Downloads a set of files.

Parameters
  • urls (Union[List[str], str]) – A (list of) URLs to download files.

  • path (Union[str, ~PathLike]) – The destination path to save the files.

  • filenames (Union[List[str], str, None]) – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.

  • extract (bool) – Whether to extract compressed files.

  • num_gdrive_retries (int) – An integer specifying the number of attempts to download file from Google Drive. Default value is 1.

Returns

A list of paths to the downloaded files.

batch_instances

forte.data.data_utils_io.batch_instances(instances)[source]

Merge a list of instances.

merge_batches

forte.data.data_utils_io.merge_batches(batches)[source]

Merge a list of batches.

slice_batch

forte.data.data_utils_io.slice_batch(batch, start, length)[source]

Return a sliced batch of size length from start in batch.

dataset_path_iterator

forte.data.data_utils_io.dataset_path_iterator(dir_path, file_extension)[source]

An iterator returning the file paths in a directory containing files of the given datasets.

Return type

Iterator[str]

Entry Utilities

create_utterance

forte.data.common_entry_utils.create_utterance(input_pack, text, speaker)[source]

Create an utterance in the datapack. This is composed of two steps:

  1. Append the utterance text to the data pack.

  2. Create Utterance entry on the text.

  3. Set the speaker of the utterance to the provided speaker.

Parameters
  • input_pack (DataPack) – The data pack to add utterance into.

  • text (str) – The text of the utterance.

  • speaker (str) – The speaker name to be associated with the utterance.

get_last_utterance

forte.data.common_entry_utils.get_last_utterance(input_pack, target_speaker)[source]

Get the last utterance from a particular speaker. An utterance is an entry of type Utterance

Parameters
  • input_pack (DataPack) – The data pack to find utterances.

  • target_speaker (str) – The name of the target speaker.

Return type

Optional[Utterance]

Returns

The last Utterance from the speaker if found, None otherwise.