Data¶

Ontology¶

base¶

class forte.data.span.Span(begin, end)[source]¶

A class recording the span of annotations. Span objects can be totally ordered according to their begin as the first sort key and end as the second sort key.

Parameters

begin (int) – The offset of the first character in the span.
end (int) – The offset of the last character in the span + 1. So the span is a left-closed and right-open interval [begin, end).

core¶

Entry¶

class forte.data.ontology.core.Entry(pack)[source]¶

The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are Annotation, Link, Generics, and Group.

An forte.data.ontology.top.Annotation object represents a span in text.

A forte.data.ontology.top.Link object represents a binary link relation between two entries.

A forte.data.ontology.top.Generics object.

A forte.data.ontology.top.Group object represents a collection of multiple entries.

Main Attributes:

embedding: The embedding vectors (numpy array of floats) of this entry.

Parameters: pack (~ContainerType) – Each entry should be associated with one pack upon creation.

property embedding¶: Get the embedding vectors (numpy array of floats) of the entry.

property tid¶

Get the id of this entry.

Return type: int
Returns: id of the entry

property pack_id¶

Get the id of the pack that contains this entry.

Return type: int
Returns: id of the pack that contains this entry.

relink_pointer()[source]¶: This function is normally called after deserialization. It can be called when the pack reference of this entry is ready (i.e. after set_pack). The purpose is to convert the Pointer objects into actual entries.

as_pointer(from_entry)[source]¶

Return this entry as a pointer of this entry relative to the from_entry.

Parameters: from_entry (Entry) – the entry to point from.
Returns: A pointer to the this entry from the from_entry.

entry_type()[source]¶

Return the full name of this entry type.

Return type: str

class forte.data.ontology.core.BaseLink(pack, parent=None, child=None)[source]¶

abstract set_parent(parent)[source]¶

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters: parent (Entry) – The parent entry.

abstract set_child(child)[source]¶

This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.

Parameters: child (Entry) – The child entry

abstract get_parent()[source]¶

Get the parent entry of the link.

Return type: Entry
Returns: An instance of Entry that is the child of the link from the given DataPack.

abstract get_child()[source]¶

Get the child entry of the link.

Return type: Entry
Returns: An instance of Entry that is the child of the link from the given DataPack.

class forte.data.ontology.core.BaseGroup(pack, members=None)[source]¶

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

This is the BaseGroup interface. Specific member constraints are defined in the inherited classes.

abstract add_member(member)[source]¶

Add one entry to the group.

Parameters: member (~EntryType) – One member to be added to the group.

add_members(members)[source]¶

Add members to the group.

Parameters: members (Iterable[~EntryType]) – An iterator of members to be added to the group.

abstract get_members()[source]¶

Get the member entries in the group.

Return type: List[~EntryType]
Returns: Instances of Entry that are the members of the group.

top¶

class forte.data.ontology.top.Generics(pack)[source]¶

class forte.data.ontology.top.Annotation(pack, begin, end)[source]¶

Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a Span corresponding to its offset in the text.

Parameters

pack (~PackType) – The container that this annotation will be added to.
begin (int) – The offset of the first character in the annotation.
end (int) – The offset of the last character in the annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]¶

This function wraps the get() method to find entries “covered” by this annotation. See that method for more information.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from each sentence created by NLTKTokenizer.
    token_entries = sentence.get(
        entry_type=Token,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used.

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.
components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.
include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

class forte.data.ontology.top.AudioAnnotation(pack, begin, end)[source]¶

AudioAnnotation type entries, such as “recording” and “audio utterance”. Each audio annotation has a Span corresponding to its offset in the audio. Most methods in this class are the same as the ones in Annotation, except that it replaces property text with audio.

Parameters

pack (~PackType) – The container that this audio annotation will be added to.
begin (int) – The offset of the first sample in the audio annotation.
end (int) – The offset of the last sample in the audio annotation + 1.

get(entry_type, components=None, include_sub_type=True)[source]¶

This function wraps the get() method to find entries “covered” by this audio annotation. See that method for more information. For usage details, refer to forte.data.ontology.top.Annotation.get().

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.
components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.
include_sub_type – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

class forte.data.ontology.top.Link(pack, parent=None, child=None)[source]¶

Link type entries, such as “predicate link”. Each link has a parent node and a child node.

Parameters

pack (~PackType) – The container that this annotation will be added to.
parent (Optional[Entry]) – the parent entry of the link.
child (Optional[Entry]) – the child entry of the link.

ParentType¶: alias of forte.data.ontology.core.Entry

ChildType¶: alias of forte.data.ontology.core.Entry

set_parent(parent)[source]¶

This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.

Parameters: parent (Entry) – The parent entry.

set_child(child)[source]¶

This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.

Parameters: child (Entry) – The child entry.

property parent¶: Get tid of the parent node. To get the object of the parent node, call get_parent().

property child¶: Get tid of the child node. To get the object of the child node, call get_child().

get_parent()[source]¶

Get the parent entry of the link.

Return type: Entry
Returns: An instance of Entry that is the parent of the link.

get_child()[source]¶

Get the child entry of the link.

Return type: Entry
Returns: An instance of Entry that is the child of the link.

class forte.data.ontology.top.Group(pack, members=None)[source]¶

Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.

MemberType¶: alias of forte.data.ontology.core.Entry

add_member(member)[source]¶

Add one entry to the group.

Parameters: member (Entry) – One member to be added to the group.

get_members()[source]¶

Get the member entries in the group.

Return type: List[Entry]
Returns: A set of instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackGeneric(pack)[source]¶

class forte.data.ontology.top.MultiPackGroup(pack, members=None)[source]¶

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType¶: alias of forte.data.ontology.core.Entry

add_member(member)[source]¶

Add one entry to the group.

Parameters: member (Entry) – One member to be added to the group.

get_members()[source]¶

Get the member entries in the group.

Return type: List[Entry]
Returns: Instances of Entry that are the members of the group.

class forte.data.ontology.top.MultiPackLink(pack, parent=None, child=None)[source]¶

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType¶: alias of forte.data.ontology.core.Entry

ChildType¶: alias of forte.data.ontology.core.Entry

parent_id()[source]¶

Return the tid of the parent entry.

Return type: int
Returns: The tid of the parent entry.

child_id()[source]¶

Return the tid of the child entry.

Return type: int
Returns: The tid of the child entry.

parent_pack_id()[source]¶

Return the pack_id of the parent pack.

Return type: int
Returns: The pack_id of the parent pack..

child_pack_id()[source]¶

Return the pack_id of the child pack.

Return type: int
Returns: The pack_id of the child pack.

set_parent(parent)[source]¶

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters: parent (Entry) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]¶

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters: child (Entry) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]¶

Get the parent entry of the link.

Return type: Entry
Returns: An instance of Entry that is the parent of the link.

get_child()[source]¶

Get the child entry of the link.

Return type: Entry
Returns: An instance of Entry that is the child of the link.

class forte.data.ontology.top.Query(pack)[source]¶

An entry type representing queries for information retrieval tasks.

Parameters: pack (~PackType) – Data pack reference to which this query will be added

add_result(pid, score)[source]¶

Set the result score for a particular pack (based on the pack id).

Parameters

pid (str) – the pack id.
score (float) – the score for the pack

Returns

None

update_results(pid_to_score)[source]¶

Updates the results for this query.

Parameters: pid_to_score (Dict[str, float]) – A dict containing pack id -> score mapping

Packs¶

BasePack¶

class forte.data.base_pack.BasePack(pack_name=None)[source]¶

The base class of DataPack and MultiPack.

Parameters: pack_name (Optional[str]) – a string name of the pack.

abstract delete_entry(entry)[source]¶

Remove the entry from the pack.

Parameters: entry (~EntryType) – The entry to be removed.
Returns: None

add_entry(entry, component_name=None)[source]¶

Add an Entry object to the BasePack object. Allow duplicate entries in a pack.

Parameters

entry (Entry) – An Entry object to be added to the pack.
component_name (Optional[str]) – A name to record that the entry is created by this component.

Return type

~EntryType

Returns

The input entry itself

add_all_remaining_entries(component=None)[source]¶

Calling this function will add the entries that are not added to the pack manually.

Parameters: component (Optional[str]) – Overwrite the component record with this.
Returns: None

to_string(drop_record=False, json_method='jsonpickle', indent=None)[source]¶

Return the string representation (json encoded) of this method.

Parameters

drop_record (Optional[bool]) – Whether to drop the creation records, default is False.
json_method (str) – What method is used to convert data pack to json. Only supports json_pickle for now. Default value is json_pickle.
indent (Optional[int]) – The indent used for json string.

Returns: String representation of the data pack.

Return type: str

serialize(output_path, zip_pack=False, drop_record=False, serialize_method='jsonpickle', indent=None)[source]¶

Serializes the data pack to the provided path. The output of this function depends on the serialization method chosen.

Parameters

output_path (Union[str, Path]) – The path to write data to.
zip_pack (bool) – Whether to compress the result with gzip.
drop_record (bool) – Whether to drop the creation records, default is False.
serialize_method (str) – The method used to serialize the data. Currently supports jsonpickle (outputs str) and Python’s built-in pickle (outputs bytes).
indent (Optional[int]) – Whether to indent the file if written as JSON.

Returns: Results of serialization.

set_control_component(component)[source]¶

Record the current component that is taking control of this pack.

Parameters: component (str) – The component that is going to take control

Returns:

record_field(entry_id, field_name)[source]¶

Record who modifies the entry, will be called in Entry

Parameters

entry_id (int) – The id of the entry.
field_name (str) – The name of the field modified.

Returns:

on_entry_creation(entry, component_name=None)[source]¶

Call this when adding a new entry, will be called in Entry when its __init__ function is called.

Parameters

entry (Entry) – The entry to be added.
component_name (Optional[str]) – A name to record that the entry is created by this component.

Returns:

regret_creation(entry)[source]¶

Will remove the entry from the pending entries internal state of the pack.

Parameters: entry (~EntryType) – The entry that we would not add the the pack anymore.

Returns:

get_entry(tid)[source]¶

Look up the entry_index with key ptr. Specific implementation depends on the actual class.

Return type: ~EntryType

abstract get(entry_type, **kwargs)[source]¶

Implementation of this method should provide to obtain the entries in entry ordering. If there are orders defined between the entries, they should be used first. Otherwise, the insertion order should be used (FIFO).

Parameters: entry_type (Union[str, Type[~EntryType]]) – The type of the entry to obtain.
Return type: Iterator[~EntryType]
Returns: An iterator of the entries matching the provided arguments.

get_single(entry_type)[source]¶

Take a single entry of type entry_type from this data pack. This is useful when the target entry type appears only one time in the DataPack for e.g., a Document entry. Or you just intended to take the first one.

Parameters: entry_type (Union[str, Type[~EntryType]]) – The entry type to be retrieved.
Return type: ~EntryType
Returns: A single data entry.

get_ids_by_creator(component)[source]¶

Look up the component_index with key component. This will return the entry ids that are created by the component

Parameters: component (str) – The component (creator) to find ids for.
Return type: Set[int]
Returns: A set of entry ids that are created by the component.

is_created_by(entry, components)[source]¶

Check if the entry is created by any of the provided components.

Parameters

entry (Entry) – The entry to check.
components (Union[str, Iterable[str]]) – The list of component names.

Return type

bool

Returns

True if the entry is created by the component, False otherwise.

get_entries_from(component)[source]¶

Look up all entries from the component as a unordered set

Parameters: component (str) – The component (creator) to get the entries. It is normally the full qualified name of the creator class, but it may also be customized based on the implementation.
Return type: Set[~EntryType]
Returns: The set of entry ids that are created by the input component.

get_ids_from(components)[source]¶

Look up entries using a list of components (creators). This will find each creator iteratively and combine the result.

Parameters: components (List[str]) – The list of components to find.
Return type: Set[int]
Returns: The list of entry ids that are created from these components.

get_entries_of(entry_type, exclude_sub_types=False)[source]¶

Return all entries of this particular type without orders. If you need to get the annotations based on the entry ordering, use forte.data.base_pack.BasePack.get().

Parameters

entry_type (Type[~EntryType]) – The type of the entry you are looking for.
exclude_sub_types – Whether to ignore the inherited sub type of the provided entry_type. Default is True.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the type constraint.

DataPack¶

class forte.data.data_pack.DataPack(pack_name=None)[source]¶

A DataPack contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.

Parameters: pack_name (Optional[str]) – A name for this data pack.

property text¶

Return the text of the data pack

Return type: str

property audio¶

Return the audio of the data pack

Return type: Optional[ndarray]

property sample_rate¶

Return the sample rate of the audio data

Return type: Optional[int]

property all_annotations¶

An iterator of all annotations in this data pack.

Returns: Iterator of all annotations, of type Annotation.

Return type: Iterator[Annotation]

property num_annotations¶

Number of annotations in this data pack.

Returns: (int) Number of the links.

Return type: int

property all_links¶

An iterator of all links in this data pack.

Returns: Iterator of all links, of type Link.

Return type: Iterator[Link]

property num_links¶

Number of links in this data pack.

Returns: Number of the links.

Return type: int

property all_groups¶

An iterator of all groups in this data pack.

Returns: Iterator of all groups, of type Group.

Return type: Iterator[Group]

property num_groups¶

Number of groups in this data pack.

Returns: Number of groups.

property all_generic_entries¶

An iterator of all generic entries in this data pack.

Returns: Iterator of generic

Return type: Iterator[Generics]

property num_generics_entries¶

Number of generics entries in this data pack.

Returns: Number of generics entries.

property all_audio_annotations¶

An iterator of all audio annotations in this data pack.

Returns: Iterator of all audio annotations, of type AudioAnnotation.

Return type: Iterator[AudioAnnotation]

property num_audio_annotations¶

Number of audio annotations in this data pack.

Returns: Number of audio annotations.

get_span_text(begin, end)[source]¶

Get the text in the data pack contained in the span.

Parameters

begin (int) – begin index to query.
end (int) – end index to query.

Return type

str

Returns

The text within this span.

get_span_audio(begin, end)[source]¶

Get the audio in the data pack contained in the span. begin and end represent the starting and ending indices of the span in audio payload respectively. Each index corresponds to one sample in audio time series.

Parameters

begin (int) – begin index to query.
end (int) – end index to query.

Return type

ndarray

Returns

The audio within this span.

set_audio(audio, sample_rate)[source]¶

Set the audio payload and sample rate of the DataPack object.

Parameters

audio (ndarray) – A numpy array storing the audio waveform.
sample_rate (int) – An integer specifying the sample rate.

get_original_text()[source]¶

Get original unmodified text from the DataPack object.

Returns: Original text after applying the replace_back_operations of DataPack object to the modified text

get_original_span(input_processed_span, align_mode='relaxed')[source]¶

Function to obtain span of the original text that aligns with the given span of the processed text.

Parameters

input_processed_span (Span) – Span of the processed text for which the corresponding span of the original text is desired.
align_mode (str) –
The strictness criteria for alignment in the ambiguous cases, that is, if a part of input_processed_span spans a part of the inserted span, then align_mode controls whether to use the span fully or ignore it completely according to the following possible values:
- ”strict” - do not allow ambiguous input, give ValueError.
- ”relaxed” - consider spans on both sides.
- ”forward” - align looking forward, that is, ignore the span towards the left, but consider the span towards the right.
- ”backward” - align looking backwards, that is, ignore the span towards the right, but consider the span towards the left.

Returns

Span of the original text that aligns with input_processed_span

Example

Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.
Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.
Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->
Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->
Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]

>> data_pack = DataPack()
>> original_text = "He plays in the park"
>> data_pack.set_text(original_text,\
>>                    lambda _: [(Span(0, 2), "She"))]
>> data_pack.text
"She plays in the park"
>> input_processed_span = Span(0, len("She plays"))
>> orig_span = data_pack.get_original_span(input_processed_span)
>> data_pack.get_original_text()[orig_span.begin: orig_span.end]
"He plays"

classmethod deserialize(data_source, serialize_method='jsonpickle', zip_pack=False)[source]¶

Deserialize a Data Pack from a string. This internally calls the internal _deserialize() function from BasePack.

Parameters

data_source (Union[Path, str]) – The path storing data source.
serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.
zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

DataPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]¶

Delete an Entry object from the DataPack. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.

Please note that deleting a entry do not guarantee the deletion of the related entries.

Parameters: entry (~EntryType) – An Entry object to be deleted from the pack.

get_data(context_type, request=None, skip_k=0)[source]¶

Fetch data from entries in the data_pack of type context_type. Data includes “span”, annotation-specific default data fields and specific data fields by “request”.

Annotation-specific data fields means:

“text” for Type[Annotation]

“audio” for Type[AudioAnnotation]

Currently, we do not support Groups and Generics in the request.

Example

requests = {
    base_ontology.Sentence:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense"],
    base_ontology.EntityMention: {
    },
}
pack.get_data(base_ontology.Sentence, requests)

Parameters

context_type (Union[str, Type[Annotation], Type[AudioAnnotation]]) –
The granularity of the data context, which could be any Annotation or AudioAnnotation type. Behaviors under different context_type varies:
- str type will be converted into either Annotation type or AudioAnnotation type.
- Type[Annotation]: the default data field for getting context data is text. This function iterates all_annotations to search target entry data.
- Type[AudioAnnotation]: the default data field for getting context data is audio which stores audio data in numpy arrays. This function iterates all_audio_annotations to search target entry data.
request (Optional[Dict[Type[Entry], Union[Dict, List]]]) –
The entry types and fields User wants to request. The keys of the requests dict are the required entry types and the value should be either:
- a list of field names or
- a dict which accepts three keys: “fields”, “component”, and “unit”.
  - By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.
  - By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.
  - By setting “unit” (string), users can specify a unit by which the annotations are indexed.
Note that for all annotation types, “span” fields and annotation-specific data fields are returned by default.

For all link types, “child” and “parent” fields are returned by default.
skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).

build_coverage_for(context_type, covered_type)[source]¶

User can call this function to build coverage index for specific types. The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See forte.data.data_pack.DataIndex for more details.

Parameters

context_type (Type[Union[Annotation, AudioAnnotation]]) – The context/covering type.
covered_type (Type[~EntryType]) – The entry to find under the context type.

covers(context_entry, covered_entry)[source]¶

Check if the covered_entry is covered (in span) of the context_type.

See in_span() and in_audio_span() for the definition of in span.

Parameters

context_entry (Union[Annotation, AudioAnnotation]) – The context entry.
covered_entry (~EntryType) – The entry to be checked on whether it is in span of the context entry.

Returns (bool): True if in span.

Return type: bool

iter_in_range(entry_type, range_annotation)[source]¶

Iterate the entries of the provided type within or fulfill the constraints of the range_annotation. The constraint is True if an entry is in_span() or in_audio_span() of the provided range_annotation.

Internally, if the coverage index between the entry type and the type of the range_annotation is built, then this will create the iterator from the index. Otherwise, the function will iterate them from scratch (which is slower). If there are frequent usage of this function, it is suggested to build the coverage index.

Only when range_annotation is an instance of AudioAnnotation will the searching be performed on the list of audio annotations. In other cases (i.e., when range_annotation is None or Annotation), it defaults to a searching process on the list of text annotations.

Parameters

entry_type (Type[~EntryType]) – The type of entry to iterate over.
range_annotation (Union[Annotation, AudioAnnotation]) – The range annotation that serve as the constraint.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries with in the range_annotation.

get(entry_type, range_annotation=None, components=None, include_sub_type=True)[source]¶

This function is used to get data from a data pack with various methods.

Depending on the provided arguments, the function will perform several different filtering of the returned data.

The entry_type is mandatory, where all the entries matching this type will be returned. The sub-types of the provided entry type will be also returned if include_sub_type is set to True (which is the default behavior).

The range_annotation controls the search area of the sub-types. An entry E will be returned if in_span() or in_audio_span() returns True. If this function is called frequently with queries related to the range_annotation, please consider to build the coverage index regarding the related entry types. User can call build_coverage_for(context_type, covered_type)() in order to build a mapping between a pair of entry types and target entries that are covered in ranges specified by outer entries.

The components list will filter the results by the component (i.e the creator of the entry). If components is provided, only the entries created by one of the components will be returned.

Example

# Iterate through all the sentences in the pack.
for sentence in input_pack.get(Sentence):
    # Take all tokens from a sentence created by NLTKTokenizer.
    token_entries = input_pack.get(
        entry_type=Token,
        range_annotation=sentence,
        component='NLTKTokenizer')
    ...

In the above code snippet, we get entries of type Token within each sentence which were generated by NLTKTokenizer. You can consider build coverage index between Token and Sentence if this snippet is frequently used:

# Build coverage index between `Token` and `Sentence`
input_pack.build_coverage_for(
    context_type=Sentence
    covered_type=Token
)

After building the index from the snippet above, you will be able to retrieve the tokens covered by sentence much faster.

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of entries requested.
range_annotation (Union[Annotation, AudioAnnotation, None]) – The range of entries requested. If None, will return valid entries in the range of whole data pack.
components (Union[str, Iterable[str], None]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.
include_sub_type (bool) – whether to consider the sub types of the provided entry type. Default True.

Yields

Each Entry found using this method.

Return type

Iterable[~EntryType]

update(datapack)[source]¶

Update the attributes and properties of the current DataPack with another DataPack.

Parameters: datapack (DataPack) – A reference datapack to update

MultiPack¶

class forte.data.multi_pack.MultiPack(pack_name=None)[source]¶

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

relink(packs)[source]¶

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters: packs (Iterator[DataPack]) – a data pack iterator.
Returns: None

get_subentry(pack_idx, entry_id)[source]¶

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id. Noted this is changed from the way of accessing such pack before the PACK_ID_COMPATIBLE_VERSION, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters

pack_idx (int) – The pack_id for the data_pack in the multi pack.
entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]¶

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters

index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.
clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.
purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]¶

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type: bool
Returns: True if successful.

add_pack(ref_name=None, pack_name=None)[source]¶

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters

ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.
pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type: DataPack

add_pack_(pack, ref_name=None)[source]¶

Add a existing data pack to the multi pack.

Parameters

pack (DataPack) – The existing data pack.
ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]¶

Get data pack at provided index.

Parameters: index (int) – The index of the pack.
Return type: DataPack
Returns: The pack at the index.

get_pack_index(pack_id)[source]¶

Get the pack index from the global pack id.

Parameters: pack_id (int) – The global pack id to find.
Return type: int
Returns: None

get_pack(name)[source]¶

Get data pack of name.

Parameters: name (str) – The name of the pack.
Return type: DataPack
Returns: The pack that has that name.

property packs¶

Get the list of Data packs that in the order of added.

Please do not use this try

Return type: List[DataPack]
Returns: List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]¶

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters

old_name (str) – The old name of the pack.
new_name (str) – The new name to be assigned for the pack.

Returns

None

property all_links¶

An iterator of all links in this multi pack.

Return type: Iterator[MultiPackLink]
Returns: Iterator of all links, of type MultiPackLink.

property num_links¶

Number of groups in this multi pack.

Return type: int
Returns: Number of links.

property all_groups¶

An iterator of all groups in this multi pack.

Return type: Iterator[MultiPackGroup]
Returns: Iterator of all groups, of type MultiPackGroup.

property num_groups¶

Number of groups in this multi pack.

Return type: int
Returns: Number of groups.

add_all_remaining_entries(component=None)[source]¶

Calling this function will add the entries that are not added to the pack manually.

Parameters: component (Optional[str]) – Overwrite the component record with this.
Returns: None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]¶

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters

pack_index (int) – The index of a single pack.
context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.
request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.
skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]¶

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

“fields”, this specifies the attribute field names to be obtained
“unit”, this specifies the unit used to index the annotation
“component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)

Parameters: request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.
Returns: None

get(entry_type, components=None, include_sub_type=True)[source]¶

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.
components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.
include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]¶

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters

data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.
serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.
zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]¶

Delete an Entry object from the MultiPack.

Parameters: entry (~EntryType) – An Entry object to be deleted from the pack.

BaseMeta¶

class forte.data.base_pack.BaseMeta(pack_name=None)[source]¶

Basic Meta information for both DataPack and MultiPack.

Parameters: pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.

record¶: Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Meta¶

class forte.data.data_pack.Meta(pack_name=None, language='eng', span_unit='character', sample_rate=None, info=None)[source]¶

Basic Meta information associated with each instance of DataPack.

Parameters

pack_name (Optional[str]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.
language (str) – The language used by this data pack, default is English.
span_unit (str) – The unit used for interpreting the Span object of this data pack. Default is character.
sample_rate (Optional[int]) – An integer specifying the sample rate of audio payload. Default is None.
info (Optional[Dict[str, str]]) – Store additional string based information that the user add.

pack_name¶: storing the provided pack_name.

language¶: storing the provided language.

sample_rate¶: storing the provided sample_rate.

info¶: storing the provided info.

record¶: Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

DataIndex¶

class forte.data.data_pack.DataIndex[source]¶

A set of indexes used in DataPack, note that this class is used by the DataPack internally.

entry_index, the index from each tid to the corresponding entry
type_index, the index from each type to the entries of that type
component_index, the index from each component to the entries generated by that component
link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links
group_index, the index from group members to groups.
_coverage_index, the index that maps from an annotation to the entries it covers. _coverage_index is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.

coverage_index(outer_type, inner_type)[source]¶

Get the coverage index from outer_type to inner_type.

Parameters

outer_type (Type[Union[Annotation, AudioAnnotation]]) – an annotation or AudioAnnotation type.
inner_type (Type[~EntryType]) – an entry type.

Return type

Optional[Dict[int, Set[int]]]

Returns

If the coverage index does not exist, return None. Otherwise, return a dict.

get_covered(data_pack, context_annotation, inner_type)[source]¶

Get the entries covered by a certain context annotation

Parameters

data_pack (DataPack) – The data pack to search for.
context_annotation (Union[Annotation, AudioAnnotation]) – The context annotation to search in.
inner_type (Type[~EntryType]) – The inner type to be searched for.

Return type

Set[int]

Returns

Entry ID of type inner_type that is covered by context_annotation.

build_coverage_index(data_pack, outer_type, inner_type)[source]¶

Build the coverage index from outer_type to inner_type.

Parameters

data_pack (DataPack) – The data pack to build coverage for.
outer_type (Type[Union[Annotation, AudioAnnotation]]) – an annotation or AudioAnnotation type.
inner_type (Type[~EntryType]) – an entry type, can be Annotation, Link, Group, AudioAnnotation.

have_overlap(entry1, entry2)[source]¶

Check whether the two annotations have overlap in span.

Parameters

entry1 (Union[Annotation, int, AudioAnnotation]) – An Annotation or AudioAnnotation object to be checked, or the tid of the Annotation.
entry2 (Union[Annotation, int, AudioAnnotation]) – Another Annotation or AudioAnnotation object to be checked, or the tid of the Annotation.

Return type

bool

in_span(inner_entry, span)[source]¶

Check whether the inner entry is within the given span. The criterion are as followed:

Annotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both Annotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type Annotation, this function will always return False.

Group entries: if the child type of the group is Annotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not Annotation type, this function will always return False.

Other entries (i.e Generics and AudioAnnotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters

inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.
span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

in_audio_span(inner_entry, span)[source]¶

Check whether the inner entry is within the given audio span. This method is identical to :meth:in_span() except that it operates on the audio payload of datapack. The criterion are as followed:

AudioAnnotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.

Link entries: if the parent and child of the links are both AudioAnnotation type, this link will be considered in span if both parent and child are in_span() of the provided span. If either the parent and the child is not of type AudioAnnotation, this function will always return False.

Group entries: if the child type of the group is AudioAnnotation type, then the group will be considered in span if all the elements are in_span() of the provided span. If the child type is not AudioAnnotation type, this function will always return False.

Other entries (i.e Generics and Annotation): they will not be considered in_span() of any spans. The function will always return False.

Parameters

inner_entry (Union[int, Entry]) – The inner entry object to be checked whether it is within span. The argument can be the entry id or the entry object itself.
span (Span) – A Span object to be checked. We will check whether the inner_entry is within this span.

Return type

bool

Returns

True if the inner_entry is considered to be in span of the provided span.

MultiPack¶

MultiPackMeta¶

class forte.data.multi_pack.MultiPackMeta(pack_name=None)[source]¶: Meta information of a MultiPack.

MultiPack¶

class forte.data.multi_pack.MultiPack(pack_name=None)[source]

A MultiPack contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)

relink(packs)[source]

Re-link the reference of the multi-pack to other entries, including the data packs in it.

Parameters: packs (Iterator[DataPack]) – a data pack iterator.
Returns: None

get_subentry(pack_idx, entry_id)[source]

Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id. Noted this is changed from the way of accessing such pack before the PACK_ID_COMPATIBLE_VERSION, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).

Parameters

pack_idx (int) – The pack_id for the data_pack in the multi pack.
entry_id (int) – the id for the entry from the pack with pack_id

Returns

sub-entry of the pack with id = pack_idx

remove_pack(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]

Remove a data pack at index index_of_pack from this multi pack.

In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.

Parameters

index_of_pack (int) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.
clean_invalid_entries (bool) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.
purge_lists (bool) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.

Return type

bool

Returns

True if successful.

Raises

ValueError – if clean_invalid_entries is set to False and the DataPack to be removed have entries (in links, groups) associated with it.

purge_deleted_packs()[source]

Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.

Return type: bool
Returns: True if successful.

add_pack(ref_name=None, pack_name=None)[source]

Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.

Parameters

ref_name (Optional[str]) – The pack name used to reference this data pack from the multi pack. If none, the reference name will not be set.
pack_name (Optional[str]) – The pack name of the data pack (itself). If none, the name will not be set.

Returns: The newly created data pack.

Return type: DataPack

add_pack_(pack, ref_name=None)[source]

Add a existing data pack to the multi pack.

Parameters

pack (DataPack) – The existing data pack.
ref_name (Optional[str]) – The name to used in this multi pack.

Returns

None

get_pack_at(index)[source]

Get data pack at provided index.

Parameters: index (int) – The index of the pack.
Return type: DataPack
Returns: The pack at the index.

get_pack_index(pack_id)[source]

Get the pack index from the global pack id.

Parameters: pack_id (int) – The global pack id to find.
Return type: int
Returns: None

get_pack(name)[source]

Get data pack of name.

Parameters: name (str) – The name of the pack.
Return type: DataPack
Returns: The pack that has that name.

property packs

Get the list of Data packs that in the order of added.

Please do not use this try

Return type: List[DataPack]
Returns: List of data packs contained in this multi-pack.

rename_pack(old_name, new_name)[source]

Rename the pack to a new name. If the new_name is already taken, a ValueError will be raised. If the old_name is not found, then a KeyError will be raised just as missing value from a dictionary.

Parameters

old_name (str) – The old name of the pack.
new_name (str) – The new name to be assigned for the pack.

Returns

None

property all_links

An iterator of all links in this multi pack.

Return type: Iterator[MultiPackLink]
Returns: Iterator of all links, of type MultiPackLink.

property num_links

Number of groups in this multi pack.

Return type: int
Returns: Number of links.

property all_groups

An iterator of all groups in this multi pack.

Return type: Iterator[MultiPackGroup]
Returns: Iterator of all groups, of type MultiPackGroup.

property num_groups

Number of groups in this multi pack.

Return type: int
Returns: Number of groups.

add_all_remaining_entries(component=None)[source]

Calling this function will add the entries that are not added to the pack manually.

Parameters: component (Optional[str]) – Overwrite the component record with this.
Returns: None

get_single_pack_data(pack_index, context_type, request=None, skip_k=0)[source]

Get pack data from one of the packs specified by the name. This is equivalent to calling the get_data() in DataPack.

Parameters

pack_index (int) – The index of a single pack.
context_type (Type[Annotation]) – The granularity of the data context, which could be any Annotation type.
request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.
skip_k (int) – Will skip the first k instances and generate data from the k + 1 instance.

Return type

Iterator[Dict[str, Any]]

Returns

A data generator, which generates one piece of data (a dict containing the required annotations and context).

get_cross_pack_data(request)[source]

Note

This function is not finished.

Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.

One can also call this method with more constraints by providing a dictionary, which can contain the following keys:

“fields”, this specifies the attribute field names to be obtained
“unit”, this specifies the unit used to index the annotation
“component”, this specifies a constraint to take only the entries created by the specified component.

The data request logic is similar to that of get_data() function in DataPack, but applied on MultiPack entries.

Example:

requests = {
    MultiPackLink:
        {
            "component": ["dummy"],
            "fields": ["speaker"],
        },
    base_ontology.Token: ["pos", "sense""],
    base_ontology.EntityMention: {
        "unit": "Token",
    },
}
pack.get_cross_pack_data(requests)

Parameters: request (Dict[Type[Union[MultiPackLink, MultiPackGroup]], Union[Dict, List]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.
Returns: None

get(entry_type, components=None, include_sub_type=True)[source]

Get entries of entry_type from this multi pack.

Example:

for relation in pack.get(
                    CrossDocEntityRelation,
                    component="relation_creator"
                    ):
    print(relation.get_parent())

In the above code snippet, we get entries of type CrossDocEntityRelation which were generated by a component named relation_creator

Parameters

entry_type (Union[str, Type[~EntryType]]) – The type of the entries requested.
components (Union[str, List[str], None]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.
include_sub_type – whether to return the sub types of the queried entry_type. True by default.

Return type

Iterator[~EntryType]

Returns

An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)

classmethod deserialize(data_path, serialize_method='jsonpickle', zip_pack=False)[source]

Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.

This internally calls the internal _deserialize() function from the BasePack.

Parameters

data_path (Union[Path, str]) – The serialized string of a Multi pack to be deserialized.
serialize_method (str) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.
zip_pack (bool) – Boolean value indicating whether the input source is zipped.

Return type

MultiPack

Returns

An data pack object deserialized from the string.

delete_entry(entry)[source]

Delete an Entry object from the MultiPack.

Parameters: entry (~EntryType) – An Entry object to be deleted from the pack.

MultiPackLink¶

class forte.data.multi_pack.MultiPackLink(pack, parent=None, child=None)[source]

This is used to link entries in a MultiPack, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.

ParentType: alias of forte.data.ontology.core.Entry

ChildType: alias of forte.data.ontology.core.Entry

parent_id()[source]

Return the tid of the parent entry.

Return type: int
Returns: The tid of the parent entry.

child_id()[source]

Return the tid of the child entry.

Return type: int
Returns: The tid of the child entry.

parent_pack_id()[source]

Return the pack_id of the parent pack.

Return type: int
Returns: The pack_id of the parent pack..

child_pack_id()[source]

Return the pack_id of the child pack.

Return type: int
Returns: The pack_id of the child pack.

set_parent(parent)[source]

This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters: parent (Entry) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

set_child(child)[source]

This will set the child of the current instance with given Entry. The child is saved internally as a tuple: pack index and entry.tid. Pack index is the index of the data pack in the multi-pack.

Parameters: child (Entry) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.

get_parent()[source]

Get the parent entry of the link.

Return type: Entry
Returns: An instance of Entry that is the parent of the link.

get_child()[source]

Get the child entry of the link.

Return type: Entry
Returns: An instance of Entry that is the child of the link.

MultiPackGroup¶

class forte.data.multi_pack.MultiPackGroup(pack, members=None)[source]

Group type entries, such as “coreference group”. Each group has a set of members.

MemberType: alias of forte.data.ontology.core.Entry

add_member(member)[source]

Add one entry to the group.

Parameters: member (Entry) – One member to be added to the group.

get_members()[source]

Get the member entries in the group.

Return type: List[Entry]
Returns: Instances of Entry that are the members of the group.

Readers¶

BaseReader¶

class forte.data.base_reader.BaseReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶

The basic data reader class. To be inherited by all data readers.

Parameters

from_cache (bool) – Decide whether to read from cache if cache file exists. By default (False), the reader will only read from the original file and use the cache file path for caching, it will not read from the cache_directory. If True, the reader will try to read a datapack from the caching file.
cache_directory (Optional[str]) –
The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by _cache_key_function().

Note

A collection is the data returned by _collect().
append_to_cache (bool) – Decide whether to append write if cache file already exists. By default (False), we will overwrite the existing caching file. If True, we will cache the datapack append to end of the caching file.

initialize(resources, configs)[source]¶

The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with configs, and register global resources into resource. The implementation should set up the states of the component.

Parameters

resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (HParams) – The configuration passed in to set up this component.

classmethod default_configs()[source]¶

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

zip_pack (bool): whether to zip the results. The default value is False.

serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

parse_pack(collection)[source]¶

Calls _parse_pack() to create packs from the collection. This internally setup the component meta data. Users should implement the _parse_pack() method.

Return type: Iterator[~PackType]

text_replace_operation(text)[source]¶

Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.

Parameters: text (str) – The original data text to be cleaned.

Returns (List[Tuple[Tuple[int, int], str]]):: the replacement operations.

Return type: List[Tuple[Span, str]]

set_profiling(enable_profiling=True)[source]¶

Set profiling option.

Parameters: enable_profiling (bool) – A boolean of whether to enable profiling for the reader or not (the default is True).

timer_yield(pack)[source]¶

Wrapper generator for time profiling. Insert timers around ‘yield’ to support time profiling for reader.

Parameters: pack (~PackType) – DataPack passed from self.iter()

iter(*args, **kwargs)[source]¶

An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call collect.

Parameters

args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.
kwargs – Iterator of DataPacks.

Return type

Iterator[~PackType]

record(record_meta)[source]¶

Modify the pack meta record field of the reader’s output. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.

Parameters: record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

cache_data(collection, pack, append)[source]¶

Specify the path to the cache directory.

After you call this method, the dataset reader will use its cache_directory to store a cache of BasePack read from every document passed to read, serialized as one string-formatted BasePack. If the cache file for a given file_path exists, we read the BasePack from the cache. If the cache file does not exist, we will create it on our first pass through the data.

Parameters

collection (Any) – The collection is a piece of data from the _collect() function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.
pack (~PackType) – The data pack to be cached.
append (bool) – Whether to allow appending to the cache.

read_from_cache(cache_filename)[source]¶

Reads one or more Packs from cache_filename, and yields Pack(s) from the cache file.

Parameters: cache_filename (Union[Path, str]) – Path to the cache file.
Return type: Iterator[~PackType]
Returns: List of cached data packs.

finish(resource)[source]¶

The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.

Parameters: resource (Resources) – A global resource registry.

set_text(pack, text)[source]¶

Assign the text value to the DataPack. This function will pass the text_replace_operation to the DataPack to conduct the pre-processing step.

Parameters

pack (DataPack) – The DataPack to assign value for.
text (str) – The original text to be recorded in this dataset.

PackReader¶

class forte.data.base_reader.PackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶: A Pack Reader reads data into DataPack.

MultiPackReader¶

class forte.data.base_reader.MultiPackReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶: The basic MultiPack data reader class. To be inherited by all data readers which return MultiPack.

CoNLL03Reader¶

ConllUDReader¶

class forte.data.readers.conllu_ud_reader.ConllUDReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶: ConllUDReader is designed to read in the Universal Dependencies 2.4 dataset.

BaseDeserializeReader¶

RawDataDeserializeReader¶

RecursiveDirectoryDeserializeReader¶

HTMLReader¶

MSMarcoPassageReader¶

class forte.data.readers.ms_marco_passage_reader.MSMarcoPassageReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶

MultiPackSentenceReader¶

MultiPackTerminalReader¶

OntonotesReader¶

PlainTextReader¶

ProdigyReader¶

RACEMultiChoiceQAReader¶

StringReader¶

SemEvalTask8Reader¶

OpenIEReader¶

SquadReader¶

class forte.datasets.mrc.squad_reader.SquadReader(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶

Reader for processing Stanford Question Answering Dataset (SQuAD).

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span.

Dataset can be downloaded at https://rajpurkar.github.io/SQuAD-explorer/.

SquadReader reads each paragraph in the dataset as a separate Document, and the questions are concatenated behind the paragraph, form a Passage. Phrase are MRC answers marked as text spans. Each MRCQuestion has a list of answers.

classmethod default_configs()[source]¶

Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.

Here:

zip_pack (bool): whether to zip the results. The default value is False.

serialize_method: The method used to serialize the data. Current available options are jsonpickle and pickle. Default is jsonpickle.

record(record_meta)[source]¶

Method to add output type record of PlainTextReader which is ft.onto.base_ontology.Document with an empty set to forte.data.data_pack.Meta.record.

Parameters: record_meta (Dict[str, Set[str]]) – the field in the datapack for type record that need to fill in for consistency checking.

ClassificationDatasetReader¶

Selector¶

class forte.data.selector.Selector[source]¶

DummySelector¶

class forte.data.selector.DummySelector[source]¶: Do nothing, return the data pack itself, which can be either DataPack or MultiPack.

SinglePackSelector¶

class forte.data.selector.SinglePackSelector[source]¶

This is the base class that select a DataPack from MultiPack.

will_select(pack_name, pack, multi_pack)[source]¶

Implement this method to return a boolean value whether the pack will be selected.

Parameters

pack_name (str) – The name of the pack to be selected.
pack (DataPack) – The pack that needed to be determined whether it will be selected.
multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

NameMatchSelector¶

class forte.data.selector.NameMatchSelector(select_name=None)[source]¶

Select a DataPack from a MultiPack with specified name. This implementation takes special care for backward compatibility.

Deprecated:

selector = NameMatchSelector(select_name="foo")
selector = NameMatchSelector("foo")

Now:

selector = NameMatchSelector()
    selector.initialize(
        configs={
            "select_name": "foo"
        }
)

WARNING: Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]¶

Implement this method to return a boolean value whether the pack will be selected.

Parameters

pack_name (str) – The name of the pack to be selected.
pack (DataPack) – The pack that needed to be determined whether it will be selected.
multi_pack (MultiPack) – The original multi pack.

Returns

A boolean value to indicate whether pack will be returned.

RegexNameMatchSelector¶

class forte.data.selector.RegexNameMatchSelector(select_name=None)[source]¶

Select a DataPack from a MultiPack using a regex.

This implementation takes special care for backward compatibility.

Deprecated:

selector = RegexNameMatchSelector(select_name="^.*\\d$")
selector = RegexNameMatchSelector("^.*\\d$")

Now:

selector = RegexNameMatchSelector()
selector.initialize(
    configs={
        "select_name": "^.*\\d$"
    }
)

Warning

Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.

will_select(pack_name, pack, multi_pack)[source]¶

Implement this method to return a boolean value whether the pack will be selected.

Parameters

pack_name (str) – The name of the pack to be selected.
pack (DataPack) – The pack that needed to be determined whether it will be selected.
multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

FirstPackSelector¶

class forte.data.selector.FirstPackSelector[source]¶

Select the first entry from MultiPack and yield it.

will_select(pack_name, pack, multi_pack)[source]¶

Implement this method to return a boolean value whether the pack will be selected.

Parameters

pack_name (str) – The name of the pack to be selected.
pack (DataPack) – The pack that needed to be determined whether it will be selected.
multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

AllPackSelector¶

class forte.data.selector.AllPackSelector[source]¶

Select all the packs from MultiPack and yield them.

will_select(pack_name, pack, multi_pack)[source]¶

Implement this method to return a boolean value whether the pack will be selected.

Parameters

pack_name (str) – The name of the pack to be selected.
pack (DataPack) – The pack that needed to be determined whether it will be selected.
multi_pack (MultiPack) – The original multi pack.

Return type

bool

Returns

A boolean value to indicate whether pack will be returned.

Index¶

BaseIndex¶

class forte.data.index.BaseIndex[source]¶

A set of indexes used in BasePack:

entry_index, the index from each tid to the corresponding entry
type_index, the index from each type to the entries of that type
link_index, the index from child (link_index["child_index"])and parent (link_index["parent_index"]) nodes to links
group_index, the index from group members to groups.

update_basic_index(entries)[source]¶

Build or update the basic indexes, including

(1) entry_index, the index from each tid to the corresponding entry;

(2) type_index, the index from each type to the entries of that type;

(3) component_index, the index from each component to the entries generated by that component.

Parameters: entries (list) – a list of entries to be added into the basic index.

query_by_type_subtype(t)[source]¶

Look up the entry indices that are instances of entry_type, including children classes of entry_type.

Note

all the known types to this data pack will be scanned to find all sub-types. This method will try to cache the sub-type information after the first call, but the cached information could be invalidated by other operations (such as adding new items to the data pack).

Parameters: t (Type[~EntryType]) – The type of the entry you are looking for.
Return type: Set[int]
Returns: A set of entry ids. The entries are instances of entry_type ( and also includes instances of the subclasses of entry_type).

build_link_index(links)[source]¶

Build the link_index, the index from child and parent nodes to links. It will build the links with the links in the dataset.

link_index consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:

build_group_index(groups)[source]¶

Build group_index, the index from group members to groups.

Returns: None

link_index(tid, as_parent=True)[source]¶

Look up the link_index with key tid. If the link index is not built, this will throw a PackIndexError.

Parameters

tid (int) – the tid of the entry being looked up.
as_parent (bool) – If as_patent is True, will look up link_index["parent_index"] and return the tids of links whose parent is ``tid`. Otherwise, will look up link_index["child_index"] and return the tids of links whose child is ``tid`.

Return type

Set[int]

group_index(tid)[source]¶

Look up the group_index with key tid. If the index is not built, this will raise a PackIndexError.

Return type: Set[int]

update_link_index(links)[source]¶

Update link_index with the provided links, the index from child and parent to links.

link_index consists of two sub-indexes:

“child_index” is the index from child nodes to their corresponding links

“parent_index” is the index from parent nodes to their corresponding links.

Parameters: links (List[~LinkType]) – a list of links to be added into the index.

update_group_index(groups)[source]¶

Build or update group_index, the index from group members to groups.

Parameters: groups (List[~GroupType]) – a list of groups to be added into the index.

Store¶

BaseStore¶

class forte.data.base_store.BaseStore[source]¶

The base class which will be used by DataStore.

abstract add_annotation_raw(type_name, begin, end)[source]¶

This function adds an annotation entry with begin and end indices to the type_name sorted list in self.__elements, returns the tid for the inserted entry.

Parameters

type_name (str) – The index of Annotation sorted list in self.__elements.
begin (int) – Begin index of the entry.
end (int) – End index of the entry.

Return type

int

Returns

tid of the entry.

abstract add_link_raw(type_name, parent_tid, child_tid)[source]¶

This function adds a link entry with parent_tid and child_tid to the type_name list in self.__elements, returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters

type_name (str) – The index of Link list in self.__elements.
parent_tid (int) – tid of the parent entry.
child_tid (int) – tid of the child entry.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

abstract add_group_raw(type_name, member_type)[source]¶

This function adds a group entry with member_type to the type_name list in self.__elements, returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters

type_name (str) – The index of Group list in self.__elements.
member_type (str) – Fully qualified name of its members.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

abstract set_attribute(tid, attr_name, attr_value)[source]¶

This function locates the entry data with tid and sets its attr_name with attr_value.

Parameters

tid (int) – Unique Id of the entry.
attr_name (str) – Name of the attribute.
attr_value (Any) – Value of the attribute.

abstract set_attr(tid, attr_id, attr_value)[source]¶

This function locates the entry data with tid and sets its attribute attr_id with value attr_value. Called by set_attribute().

Parameters

tid (int) – Unique id of the entry.
attr_id (int) – Id of the attribute.
attr_value (Any) – value of the attribute.

abstract get_attribute(tid, attr_name)[source]¶

This function finds the value of attr_name in entry with tid.

Parameters

tid (int) – Unique id of the entry.
attr_name (str) – Name of the attribute.

Returns

The value of attr_name for the entry with tid.

abstract get_attr(tid, attr_id)[source]¶

This function locates the entry data with tid and gets the value of attr_id of this entry. Called by get_attribute().

Parameters

tid (int) – Unique id of the entry.
attr_id (int) – Id of the attribute.

Returns

The value of attr_id for the entry with tid.

abstract delete_entry(tid)[source]¶

This function removes the entry with tid from the data store.

Parameters: tid (int) – Unique id of the entry.

abstract get_entry(tid)[source]¶

Look up the entry_dict with key tid. Return the entry and its type_name.

Parameters: tid (int) – Unique id of the entry.
Return type: Tuple[List, str]
Returns: The entry which tid corresponds to and its type_name.

abstract get_entry_index(tid)[source]¶

Look up the entry_dict with key tid. Return the index_id of the entry.

Parameters: tid (int) – Unique id of the entry.
Return type: int
Returns: Index of the entry which tid corresponds to in the entry_type list.

abstract get(type_name, include_sub_type)[source]¶

This function fetches entries from the data store of type type_name.

Parameters

type_name (str) – The index of the list in self.__elements.
include_sub_type (bool) – A boolean to indicate whether get its subclass.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

abstract next_entry(tid)[source]¶

Get the next entry of the same type as the tid entry.

Parameters: tid (int) – Unique id of the entry.
Return type: Optional[List]
Returns: The next entry of the same type as the tid entry.

abstract prev_entry(tid)[source]¶

Get the previous entry of the same type as the tid entry.

Parameters: tid (int) – Unique id of the entry.
Return type: Optional[List]
Returns: The previous entry of the same type as the tid entry.

Data Store¶

DataStore¶

class forte.data.data_store.DataStore(onto_file_path=None, dynamically_add_type=True)[source]¶

add_annotation_raw(type_name, begin, end)[source]¶

This function adds an annotation entry with begin and end indices to current data store object. Returns the tid for the inserted entry.

Parameters

type_name (str) – The fully qualified type name of the new Annotation.
begin (int) – Begin index of the entry.
end (int) – End index of the entry.

Return type

int

Returns

tid of the entry.

add_link_raw(type_name, parent_tid, child_tid)[source]¶

This function adds a link entry with parent_tid and child_tid to current data store object. Returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters

type_name (str) – The fully qualified type name of the new Link.
parent_tid (int) – tid of the parent entry.
child_tid (int) – tid of the child entry.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the type_name list.

add_group_raw(type_name, member_type)[source]¶

This function adds a group entry with member_type to the current data store object. Returns the tid and the index_id for the inserted entry in the list. This index_id is the index of the entry in the type_name list.

Parameters

type_name (str) – The fully qualified type name of the new Group.
member_type (str) – Fully qualified name of its members.

Return type

Tuple[int, int]

Returns

tid of the entry and its index in the (type_id)th list.

set_attribute(tid, attr_name, attr_value)[source]¶

This function locates the entry data with tid and sets its attr_name with attr_value. It first finds attr_id according to attr_name. tid, attr_id, and attr_value are passed to set_attr().

Parameters

tid (int) – Unique Id of the entry.
attr_name (str) – Name of the attribute.
attr_value (Any) – Value of the attribute.

Raises

KeyError – when tid or attr_name is not found.

get_attribute(tid, attr_name)[source]¶

This function finds the value of attr_name in entry with tid. It locates the entry data with tid and finds attr_id of its attribute attr_name. tid and attr_id are passed to get_attr().

Parameters

tid (int) – Unique id of the entry.
attr_name (str) – Name of the attribute.

Return type

Any

Returns

The value of attr_name for the entry with tid.

Raises

KeyError – when tid or attr_name is not found.

delete_entry(tid)[source]¶

This function locates the entry data with tid and removes it from the data store. This function first removes it from __entry_dict.

Parameters

tid (int) – Unique id of the entry.

Raises

KeyError – when entry with tid is not found.
RuntimeError – when internal storage is inconsistent.

get_entry(tid)[source]¶

This function finds the entry with tid. It returns the entry and its type_name.

Parameters

tid (int) – Unique id of the entry.

Return type

Tuple[List, str]

Returns

The entry which tid corresponds to and its type_name.

Raises

ValueError – An error occurred when input tid is not found.
KeyError – An error occurred when entry_type is not found.

get_entry_index(tid)[source]¶

Look up the entry_dict with key tid. Return the index_id of the entry.

Parameters: tid (int) – Unique id of the entry.
Return type: int
Returns: Index of the entry which tid corresponds to in the entry_type list.
Raises: ValueError – An error occurred when no corresponding entry is found.

co_iterator_annotation_like(type_names)[source]¶

Given two or more type names, iterate their entry lists from beginning to end together.

For every single type, their entry lists are sorted by the begin and end fields. The co_iterator_annotation_like function will iterate those sorted lists together, and yield each entry in sorted order. This tasks is quite similar to merging several sorted list to one sorted list. We internally use a MinHeap to order the order of yielded items, and the ordering is determined by:

start index of the entry.

end index of the entry.

the index of the entry type name in input parameter type_names.

The precedence of those values indicates their priority in the min heap ordering. For example, if two entries have both the same begin and end field, then their order is decided by the order of user input type_name (the type that first appears in the target type list will return first). For entries that have the exact same begin, end and type_name, the order will be determined arbitrarily.

Parameters: type_names (List[str]) – a list of string type names
Return type: Iterator[List]
Returns: An iterator of entry elements.

get(type_name, include_sub_type=True)[source]¶

This function fetches entries from the data store of type type_name.

Parameters

type_name (str) – The fully qualified name of the entry.
include_sub_type (bool) – A boolean to indicate whether get its subclass.

Return type

Iterator[List]

Returns

An iterator of the entries matching the provided arguments.

next_entry(tid)[source]¶

Get the next entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the next entry. If it is a non-annotation type, it will be sorted in the insertion order, which means next_entry would return the next inserted entry.

Parameters: tid (int) – Unique id of the entry.
Return type: Optional[List]
Returns: A list of attributes representing the next entry of the same type as the tid entry. Return None when accessing the next entry of the last element in entry list.
Raises: IndexError – An error occurred accessing index out out of entry list.

prev_entry(tid)[source]¶

Get the previous entry of the same type as the tid entry. Call get_entry() to find the current index and use it to find the previous entry. If it is a non-annotation type, it will be sorted in the insertion order, which means prev_entry would return the previous inserted entry.

Parameters: tid (int) – Unique id of the entry.
Return type: Optional[List]
Returns: A list of attributes representing the previous entry of the same type as the tid entry. Return None when accessing the previous entry of the first element in entry list.
Raises: IndexError – An error occurred accessing index out out of entry list.

DataPack Dataset¶

DataPackIterator¶

class forte.data.data_pack_dataset.DataPackIterator(pack_iterator, context_type, request=None, skip_k=0)[source]¶

An iterator generating data example from a stream of data packs.

Parameters

pack_iterator (Iterator[DataPack]) – An iterator of DataPack.
context_type (Type[Annotation]) – The granularity of a single example which could be any Annotation type. For example, it can be Sentence, then each training example will represent the information of a sentence.
request (Optional[Dict[Type[Entry], Union[Dict, List]]]) – The request of type Dict sent to DataPack to query specific data.
skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.

Returns

An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type DataPack).

Here is an example usage:

file_path: str = "data_samples/data_pack_dataset_test"
reader = CoNLL03Reader()
context_type = Sentence
request = {Sentence: []}
skip_k = 0

train_pl: Pipeline = Pipeline()
train_pl.set_reader(reader)
train_pl.initialize()
pack_iterator: Iterator[PackType] =
    train_pl.process_dataset(file_path)

iterator: DataPackIterator = DataPackIterator(pack_iterator,
                                              context_type,
                                              request,
                                              skip_k)

for tid, data_pack in iterator:
    # process tid and data_pack

Note

For parameters context_type, request, skip_k, please refer to get_data() in DataPack.

DataPackDataset¶

class forte.data.data_pack_dataset.DataPackDataset(data_source, feature_schemes, hparams=None, device=None)[source]¶

A dataset representing data packs. Calling an DataIterator over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.

Parameters

data_source (DataPackIterator) – A data source of type DataPackIterator.
feature_schemes (Dict) – A Dict containing all the information to do data pre-processing. This is exactly the same as the schemes in request().
hparams (Union[Dict, HParams, None]) – A dict or instance of HParams containing hyperparameters. See default_hparams() in DatasetBase for the defaults.
device (Optional[device]) – The device of the produced batches. For GPU training, set to current CUDA device.

process(raw_example)[source]¶

Given an input which is a single data example, extract feature from it.

Parameters

raw_example (tuple(dict, DataPack)) –

A Tuple where

The first element is a Dict produced by get_data() in DataPack.
The second element is an instance of type DataPack.

Return type

Dict[str, Feature]

Returns

A Dict mapping from user-specified tags to the Feature extracted.

Note

Please refer to request() for details about user-specified tags.

collate(examples)[source]¶

Given a batch of output from process(), produce pre-processed data as well as masks and features.

Parameters

examples (List[Dict[str, Feature]]) – A List of result from process().

Return type

Batch

Returns

A Texar Pytorch Batch It can be treated as a Dict with the following structure:

”data”: List or np.ndarray or torch.tensor The pre-processed data.

Please refer to Converter for details.
”masks”: np.ndarray or torch.tensor All the masks for pre-processed data.

Please refer to Converter for details.
”features”: List[Feature] A List of Feature. This is useful when users want to do customized pre-processing.

Please refer to Feature for details.

{
    "tag_a": {
        "data": <tensor>,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    },
    "tag_b": {
        "data": Tensor,
        "masks": [<tensor1>, <tensor2>, ...],
        "features": [<feature1>, <feature2>, ...]
    }
}

Note

The first level key in returned batch is the user-specified tags. Please refer to request() for details about user-specified tags.

RawExample¶

forte.data.data_pack_dataset.RawExample¶: alias of Tuple[int, forte.data.data_pack.DataPack]

FeatureCollection¶

forte.data.data_pack_dataset.FeatureCollection¶: alias of Dict[str, forte.data.converter.feature.Feature]

Batchers¶

ProcessingBatcher¶

class forte.data.batchers.ProcessingBatcher[source]¶

This defines the basis interface of the batcher used in BaseBatchProcessor. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.

initialize(config)[source]¶

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns: None

flush()[source]¶

Flush the remaining data.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

A triplet contains datapack, context instance and batched data.

Note

For backward compatibility issues, this function return list of None contexts.

get_batch(input_pack)[source]¶

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters

input_pack (~PackType) – The input data pack to get features from.

Return type

Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]

Returns

An iterator of A tuple contains datapack, context instance and batch data.

Note

For backward compatibility issues, this function return a list of None as contexts.

classmethod default_configs()[source]¶

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type: Dict[str, Any]
Returns: The default configuration.

FixedSizeDataPackBatcherWithExtractor¶

class forte.data.batchers.FixedSizeDataPackBatcherWithExtractor[source]¶

This batcher uses extractor to extract features from dataset and group them into batch. In this class, more pools are added. One is instance_pool, which is used to record the instance from which feature is extracted. The other one is feature_pool, which is used to record features before they can be yield in batch.

initialize(config)[source]¶

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns: None

add_feature_scheme(tag, scheme)[source]¶

Add feature scheme to the batcher.

Parameters

tag (str) – The name/tag of the scheme.
scheme (str) – The scheme content, which should be a dict containing the extractor and converter used to create features.

collate(features_collection)[source]¶

This function use the Converter interface to turn a list of features into batches, where each feature is converted to tensor/matrix format. The resulting features are organized as a dictionary, where the keys are the feature names/tags, and the values are the converted features. Each feature contains the data and mask in MatrixLike form, as well as the original raw features.

Parameters: features_collection (List[Dict[str, Feature]]) – A list of features.
Return type: Dict[str, Dict[str, Any]]
Returns: A instance of Dict[str, Union[Tensor, Dict]], which is a batch of features.

flush()[source]¶

Flush data in batches. Each return value contains a tuple of 3 items: the corresponding data pack, the list of annotation objects that represent the context type, and the features.

Return type: Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict[str, Dict[str, Any]]]]

get_batch(input_pack)[source]¶

By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.

Parameters: input_pack (~PackType) – The input data pack to get features from.
Return type: Iterator[Tuple[List[~PackType], List[Optional[Annotation]], Dict]]
Returns: An iterator of a tuple contains datapack, context instance and batch data.

classmethod default_configs()[source]¶

Defines the configuration of this batcher, here:

context_type: The context scope to extract data from. It could be a annotation class or a string that is the fully qualified name of the annotation class.

feature_scheme: A dictionary of (extractor name, extractor) that can be used to extract features.

batch_size: The batch size, default is 10.

Return type: Dict[str, Any]
Returns: The default configuration structure.

FixedSizeRequestDataPackBatcher¶

class forte.data.batchers.FixedSizeRequestDataPackBatcher[source]¶

initialize(config)[source]¶

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns: None

classmethod default_configs()[source]¶

The configuration of a batcher.

Here:

context_type (str): The fully qualified name of an Annotation type, which will be used as the context to retrieve data from. For example, if a ft.onto.Sentence type is provided, then it will extract data within each sentence.

requests: The request detail. See get_data() on what a request looks like.

Return type: Dict
Returns: The default configuration structure and default value.

FixedSizeMultiPackProcessingBatcher¶

class forte.data.batchers.FixedSizeMultiPackProcessingBatcher[source]¶

A Batcher used in MultiPackBatchProcessor.

Note

this implementation is not finished.

The Batcher calls the ProcessingBatcher inherently on each specified data pack in the MultiPack.

It’s flexible to query MultiPack so we delegate the task to the subclasses such as:

query all packs with the same context and input_info.

query different packs with different context and input_info.

Since the batcher will save the data_pack_pool on the fly, it’s not trivial to do batching and slicing multiple data packs in the same time

initialize(config)[source]¶

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns: None

classmethod default_configs()[source]¶

Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.

Here, the available parameters are:

use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.

cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.

Return type: Dict
Returns: The default configuration.

FixedSizeDataPackBatcher¶

class forte.data.batchers.FixedSizeDataPackBatcher[source]¶

initialize(config)[source]¶

The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.

Returns: None

classmethod default_configs()[source]¶

The configuration of a batcher.

Here:

batch_size: the batch size, default is 10.

Return type: Dict
Returns: The default configuration structure and default value.

Caster¶

class forte.data.caster.Caster[source]¶

MultiPackBoxer¶

class forte.data.caster.MultiPackBoxer[source]¶

This class creates a MultiPack from a DataPack, this MultiPack will only contains the original DataPack, indexed by the pack_name.

cast(pack)[source]¶

Auto-box the DataPack into a MultiPack by simple wrapping.

Parameters: pack (DataPack) – The DataPack to be boxed
Return type: MultiPack
Returns: An iterator that produces the boxed MultiPack.

classmethod default_configs()[source]¶: Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

MultiPackUnboxer¶

class forte.data.caster.MultiPackUnboxer[source]¶

This passes on a single DataPack within the MultiPack.

cast(pack)[source]¶

Auto-box the MultiPack into a DataPack by using pack_index to take the unique pack.

Parameters: pack (MultiPack) – The MultiPack to be boxed.
Return type: DataPack
Returns: A DataPack boxed from the MultiPack.

classmethod default_configs()[source]¶: Returns a dict of configurations of the component with default values. Used to replace the missing values of input configs during pipeline construction.

Container¶

EntryContainer¶

class forte.data.container.EntryContainer[source]¶

BasePointer¶

class forte.data.container.BasePointer[source]¶: Objects to point to other objects in the data pack.

Types¶

ReplaceOperationsType¶

forte.data.types.ReplaceOperationsType¶: alias of List[Tuple[forte.data.span.Span, str]]

DataRequest¶

forte.data.types.DataRequest¶: alias of Dict[Type[forte.data.ontology.core.Entry], Union[Dict, List]]

MatrixLike¶

forte.data.types.MatrixLike¶: alias of Union[torch._C.TensorType, numpy.ndarray, List]

Data Utilities¶

maybe_download¶

forte.data.data_utils.maybe_download(urls: List[str], path: Union[str, PathLike], filenames: Optional[List[str]] = None, extract: bool = False, num_gdrive_retries: int = 1) → List[str][source]¶

forte.data.data_utils.maybe_download(urls: str, path: Union[str, PathLike], filenames: Optional[str] = None, extract: bool = False, num_gdrive_retries: int = 1) → str

Downloads a set of files.

Parameters

urls (Union[List[str], str]) – A (list of) URLs to download files.
path (Union[str, ~PathLike]) – The destination path to save the files.
filenames (Union[List[str], str, None]) – A (list of) strings of the file names. If given, must have the same length with urls. If None, filenames are extracted from urls.
extract (bool) – Whether to extract compressed files.
num_gdrive_retries (int) – An integer specifying the number of attempts to download file from Google Drive. Default value is 1.

Returns

A list of paths to the downloaded files.

batch_instances¶

forte.data.data_utils_io.batch_instances(instances)[source]¶: Merge a list of instances.

merge_batches¶

forte.data.data_utils_io.merge_batches(batches)[source]¶: Merge a list of batches.

slice_batch¶

forte.data.data_utils_io.slice_batch(batch, start, length)[source]¶: Return a sliced batch of size length from start in batch.

dataset_path_iterator¶

forte.data.data_utils_io.dataset_path_iterator(dir_path, file_extension)[source]¶

An iterator returning the file paths in a directory containing files of the given datasets.

Return type: Iterator[str]

Entry Utilities¶

create_utterance¶

forte.data.common_entry_utils.create_utterance(input_pack, text, speaker)[source]¶

Create an utterance in the datapack. This is composed of two steps:

Append the utterance text to the data pack.

Create Utterance entry on the text.

Set the speaker of the utterance to the provided speaker.

Parameters

input_pack (DataPack) – The data pack to add utterance into.
text (str) – The text of the utterance.
speaker (str) – The speaker name to be associated with the utterance.

get_last_utterance¶

forte.data.common_entry_utils.get_last_utterance(input_pack, target_speaker)[source]¶

Get the last utterance from a particular speaker. An utterance is an entry of type Utterance

Parameters

input_pack (DataPack) – The data pack to find utterances.
target_speaker (str) – The name of the target speaker.

Return type

Optional[Utterance]

Returns

The last Utterance from the speaker if found, None otherwise.