Data¶
Ontology¶
base¶
core¶
Entry¶
-
class
forte.data.ontology.core.
Entry
(pack)[source]¶ The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are
Annotation
,Link
,Generics
, andGroup
.An
forte.data.ontology.top.Annotation
object represents a span in text.A
forte.data.ontology.top.Link
object represents a binary link relation between two entries.A
forte.data.ontology.top.Generics
object.A
forte.data.ontology.top.Group
object represents a collection of multiple entries.Main Attributes:
embedding: The embedding vectors (numpy array of floats) of this entry.
- Parameters
pack (~ContainerType) – Each entry should be associated with one pack upon creation.
-
property
embedding
¶ Get the embedding vectors (numpy array of floats) of the entry.
-
property
pack_id
¶ Get the id of the pack that contains this entry.
- Return type
- Returns
id of the pack that contains this entry.
-
class
forte.data.ontology.core.
BaseLink
(pack, parent=None, child=None)[source]¶ -
abstract
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.
- Parameters
parent (
Entry
) – The parent entry.
-
abstract
set_child
(child)[source]¶ This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.
- Parameters
child (
Entry
) – The child entry
-
abstract
-
class
forte.data.ontology.core.
BaseGroup
(pack, members=None)[source]¶ Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.
This is the
BaseGroup
interface. Specific member constraints are defined in the inherited classes.-
abstract
add_member
(member)[source]¶ Add one entry to the group.
- Parameters
member (~EntryType) – One member to be added to the group.
-
abstract
top¶
-
class
forte.data.ontology.top.
Annotation
(pack, begin, end)[source]¶ Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a
Span
corresponding to its offset in the text.- Parameters
-
get
(entry_type, components=None, include_sub_type=True)[source]¶ This function wraps the
get()
method to find entries “covered” by this annotation. See that method for more information.Example
# Iterate through all the sentences in the pack. for sentence in input_pack.get(Sentence): # Take all tokens from each sentence created by NLTKTokenizer. token_entries = sentence.get( entry_type=Token, component='NLTKTokenizer') ...
In the above code snippet, we get entries of type
Token
within eachsentence
which were generated byNLTKTokenizer
. You can consider build coverage index between Token and Sentence if this snippet is frequently used.- Parameters
entry_type (
Union
[str
,Type
[~EntryType]]) – The type of entries requested.components (
Union
[str
,Iterable
[str
],None
]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.include_sub_type – whether to consider the subtypes of the provided entry type. Default True.
- Yields
Each Entry found using this method.
- Return type
Iterable
[~EntryType]
-
class
forte.data.ontology.top.
AudioAnnotation
(pack, begin, end)[source]¶ AudioAnnotation type entries, such as “recording” and “audio utterance”. Each audio annotation has a
Span
corresponding to its offset in the audio. Most methods in this class are the same as the ones inAnnotation
, except that it replaces property text with audio.- Parameters
-
get
(entry_type, components=None, include_sub_type=True)[source]¶ This function wraps the
get()
method to find entries “covered” by this audio annotation. See that method for more information. For usage details, refer toforte.data.ontology.top.Annotation.get()
.- Parameters
entry_type (
Union
[str
,Type
[~EntryType]]) – The type of entries requested.components (
Union
[str
,Iterable
[str
],None
]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.include_sub_type – whether to consider the subtypes of the provided entry type. Default True.
- Yields
Each Entry found using this method.
- Return type
Iterable
[~EntryType]
-
class
forte.data.ontology.top.
Link
(pack, parent=None, child=None)[source]¶ Link type entries, such as “predicate link”. Each link has a parent node and a child node.
- Parameters
-
ParentType
¶ alias of
forte.data.ontology.core.Entry
-
ChildType
¶ alias of
forte.data.ontology.core.Entry
-
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.
- Parameters
parent (
Entry
) – The parent entry.
-
set_child
(child)[source]¶ This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.
- Parameters
child (
Entry
) – The child entry.
-
class
forte.data.ontology.top.
Group
(pack, members=None)[source]¶ Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.
-
MemberType
¶ alias of
forte.data.ontology.core.Entry
-
-
class
forte.data.ontology.top.
MultiPackGroup
(pack, members=None)[source]¶ Group type entries, such as “coreference group”. Each group has a set of members.
-
MemberType
¶ alias of
forte.data.ontology.core.Entry
-
-
class
forte.data.ontology.top.
MultiPackLink
(pack, parent=None, child=None)[source]¶ This is used to link entries in a
MultiPack
, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.-
ParentType
¶ alias of
forte.data.ontology.core.Entry
-
ChildType
¶ alias of
forte.data.ontology.core.Entry
-
parent_id
()[source]¶ Return the
tid
of the parent entry.- Return type
- Returns
The
tid
of the parent entry.
-
child_id
()[source]¶ Return the
tid
of the child entry.- Return type
- Returns
The
tid
of the child entry.
-
parent_pack_id
()[source]¶ Return the pack_id of the parent pack.
- Return type
- Returns
The pack_id of the parent pack..
-
child_pack_id
()[source]¶ Return the pack_id of the child pack.
- Return type
- Returns
The pack_id of the child pack.
-
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
parent (
Entry
) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its owntid
in the pack.
-
set_child
(child)[source]¶ This will set the child of the current instance with given Entry. The child is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
child (
Entry
) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its owntid
in the pack.
-
-
class
forte.data.ontology.top.
Query
(pack)[source]¶ An entry type representing queries for information retrieval tasks.
- Parameters
pack (~PackType) – Data pack reference to which this query will be added
Packs¶
BasePack¶
-
class
forte.data.base_pack.
BasePack
(pack_name=None)[source]¶ The base class of
DataPack
andMultiPack
.-
delete_entry
(entry)[source]¶ Remove the entry from the pack.
- Parameters
entry (~EntryType) – The entry to be removed.
- Returns
None
-
add_entry
(entry, component_name=None)[source]¶ Add an
Entry
object to theBasePack
object. Allow duplicate entries in a pack.
-
add_all_remaining_entries
(component=None)[source]¶ Calling this function will add the entries that are not added to the pack manually.
-
to_string
(drop_record=False, json_method='json', indent=None)[source]¶ Return the string representation (json encoded) of this method.
- Parameters
Returns: String representation of the data pack.
- Return type
-
serialize
(output_path, zip_pack=False, drop_record=False, serialize_method='json', indent=None)[source]¶ Serializes the data pack to the provided path. The output of this function depends on the serialization method chosen.
- Parameters
zip_pack (
bool
) – Whether to compress the result with gzip.drop_record (
bool
) – Whether to drop the creation records, default is False.serialize_method (
str
) – The method used to serialize the data. Currently supports json (outputs str), jsonpickle (outputs str) and Python’s built-in pickle (outputs bytes).indent (
Optional
[int
]) – Whether to indent the file if written as JSON.
Returns: Results of serialization.
-
set_control_component
(component)[source]¶ Record the current component that is taking control of this pack.
- Parameters
component (
str
) – The component that is going to take control
Returns:
-
record_field
(entry_id, field_name)[source]¶ Record who modifies the entry, will be called in
Entry
Returns:
-
on_entry_creation
(entry, component_name=None)[source]¶ Call this when adding a new entry, will be called in
Entry
when its __init__ function is called. This method does the following 2 operations with regards to creating a new entry.- All
dataclass
attributes of the entry to be created are stored in the class level dictionary of
Entry
calledcached_attributes_data
. This is used to initialize the corresponding entry’s objects data store entry
- All
- On creation of the data store entry, this methods associates
getter
andsetter
properties to all dataclass attributes of this entry to allow direct interaction between the attributes of the entry and their copy being stored in the data store. For example, the setter method updates the data store value of an attribute of a given entry whenever the attribute in the entry’s object is updated.
- Parameters
Returns:
-
get_entry
(tid)[source]¶ Look up the entry_index with
tid
. Specific implementation depends on the actual class.- Return type
~EntryType
-
abstract property
links
¶ A List container of all links in this data pack.
-
abstract property
groups
¶ A List container of all groups in this pack.
-
abstract
get
(entry_type, **kwargs)[source]¶ Implementation of this method should provide to obtain the entries in entry ordering. If there are orders defined between the entries, they should be used first. Otherwise, the insertion order should be used (FIFO).
-
get_single
(entry_type)[source]¶ Take a single entry of type
entry_type
from this data pack. This is useful when the target entry type appears only one time in theDataPack
for e.g., a Document entry. Or you just intended to take the first one.
-
get_ids_by_creator
(component)[source]¶ Look up the component_index with key component. This will return the entry ids that are created by the component
-
is_created_by
(entry, components)[source]¶ Check if the entry is created by any of the provided components.
-
get_ids_from
(components)[source]¶ Look up entries using a list of components (creators). This will find each creator iteratively and combine the result.
-
get_entries_of
(entry_type, exclude_sub_types=False)[source]¶ Return all entries of this particular type without orders. If you need to get the annotations based on the entry ordering, use
forte.data.base_pack.BasePack.get()
.
-
DataPack¶
-
class
forte.data.data_pack.
DataPack
(pack_name=None)[source]¶ A
DataPack
contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.-
property
text
¶ Get the first text data stored in the DataPack. If there is no text payload in the DataPack, it will return empty string.
- Parameters
text_payload_index – the index of the text payload. Defaults to 0.
- Raises
ValueError – raised when the index is out of bound of the text payload list.
- Return type
- Returns
text data in the text payload.
-
property
audio
¶ Return the audio data from the first audio payload in the DataPack.
-
property
image
¶ Return the image data from the first image payload in the data pack.
-
get_image
(index)[source]¶ Return the image data from the image payload at the specified index.
- Parameters
index (
int
) – image payload index for retrieving the image data.- Returns
image payload data at the specified index.
-
property
all_annotations
¶ An iterator of all annotations in this data pack.
Returns: Iterator of all annotations, of type
Annotation
.- Return type
-
property
num_annotations
¶ Number of annotations in this data pack.
Returns: (int) Number of the links.
- Return type
-
property
all_links
¶ An iterator of all links in this data pack.
Returns: Iterator of all links, of type
Link
.
-
property
num_links
¶ Number of links in this data pack.
Returns: Number of the links.
- Return type
-
property
all_groups
¶ An iterator of all groups in this data pack.
Returns: Iterator of all groups, of type
Group
.
-
property
num_groups
¶ Number of groups in this data pack.
Returns: Number of groups.
-
property
all_generic_entries
¶ An iterator of all generic entries in this data pack.
Returns: Iterator of generic
-
property
num_generics_entries
¶ Number of generics entries in this data pack.
Returns: Number of generics entries.
-
property
all_audio_annotations
¶ An iterator of all audio annotations in this data pack.
Returns: Iterator of all audio annotations, of type
AudioAnnotation
.- Return type
-
property
num_audio_annotations
¶ Number of audio annotations in this data pack.
Returns: Number of audio annotations.
-
property
annotations
¶ A SortedList container of all annotations in this data pack.
Returns: SortedList of all annotations, of type
Annotation
.
-
property
generics
¶ A SortedList container of all generic entries in this data pack.
Returns: SortedList of generics
-
property
audio_annotations
¶ A SortedList container of all audio annotations in this data pack.
Returns: SortedList of all audio annotations, of type
AudioAnnotation
.
-
property
links
¶ A List container of all links in this data pack.
Returns: List of all links, of type
Link
.
-
property
groups
¶ A List container of all groups in this data pack.
Returns: List of all groups, of type
Group
.
-
get_payload_at
(modality, payload_index)[source]¶ Get Payload of requested modality at the requested payload index.
- Parameters
- Raises
ValueError – raised when the requested modality is not supported.
- Returns
Payload entry containing text data, image or audio data.
-
get_payload_data_at
(modality, payload_index)[source]¶ Get Payload of requested modality at the requested payload index.
- Parameters
- Raises
ValueError – raised when the requested modality is not supported.
- Return type
- Returns
different data types for different data modalities.
str data for text data.
Numpy array for image and audio data.
-
get_span_text
(begin, end, text_payload_index=0)[source]¶ Get the text in the data pack contained in the span.
-
get_span_audio
(begin, end, audio_payload_index=0)[source]¶ Get the audio in the data pack contained in the span. begin and end represent the starting and ending indices of the span in audio payload respectively. Each index corresponds to one sample in audio time series.
-
set_text
(text, replace_func=None, text_payload_index=0)[source]¶ Set text for TextPayload at a specified index or add a new TextPayload in the DataPack.
- Raises
ValueError – raised when the text payload index is out of range.
- Parameters
text (
str
) – the input text to be assigned to this pack.replace_func (
Optional
[Callable
[[str
],List
[Tuple
[Span
,str
]]]]) – function that replace text. Defaults to None.text_payload_index (
int
) – the zero-based index of to locate a TextPayload in this DataPack, default 0. This allows one to set multiple texts per DataPack. A DataPack by default contains one such TextPayload, if the text_payload_index is larger than 0, then more than one TextPayload need to be added before this, otherwise
-
set_audio
(audio, sample_rate, audio_payload_index=0)[source]¶ Set audio for AudioPayload at a specified index or add a new AudioPayload in the DataPack.
- Raises
ValueError – raised when the audio payload index is out of range.
- Parameters
audio (
ndarray
) – A numpy array storing the audio waveform.sample_rate (
int
) – An integer specifying the sample rate.audio_payload_index (
int
) – the zero-based index of the AudioPayload in this DataPack’s AudioPayload entries. Defaults to 0, and it adds a new audio payload if there is no audio payload in the data pack.
-
add_audio
(audio)[source]¶ Add an AudioPayload storing the audio given in the parameters.
- Parameters
audio – A numpy array storing the audio.
-
add_image
(image)[source]¶ Add an ImagePayload storing the image given in the parameters.
- Parameters
image – A numpy array storing the image.
-
set_image
(image, image_payload_index=0)[source]¶ Set the image payload of the
DataPack
object.- Parameters
image – A numpy array storing the image.
image_payload_index (
int
) – the zero-based index of the ImagePayload in this DataPack’s ImagePayload entries. Defaults to 0.
-
get_original_text
(text_payload_index=0)[source]¶ Get original unmodified text from the
DataPack
object.
-
get_original_span
(input_processed_span, align_mode='relaxed')[source]¶ Function to obtain span of the original text that aligns with the given span of the processed text.
- Parameters
input_processed_span (
Span
) – Span of the processed text for which the corresponding span of the original text is desired.align_mode (
str
) –The strictness criteria for alignment in the ambiguous cases, that is, if a part of input_processed_span spans a part of the inserted span, then align_mode controls whether to use the span fully or ignore it completely according to the following possible values:
”strict” - do not allow ambiguous input, give ValueError.
”relaxed” - consider spans on both sides.
”forward” - align looking forward, that is, ignore the span towards the left, but consider the span towards the right.
”backward” - align looking backwards, that is, ignore the span towards the right, but consider the span towards the left.
- Returns
Span of the original text that aligns with input_processed_span
Example
Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.
Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.
Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->
Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->
Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]
>> data_pack = DataPack() >> original_text = "He plays in the park" >> data_pack.set_text(original_text,\ >> lambda _: [(Span(0, 2), "She"))] >> data_pack.text "She plays in the park" >> input_processed_span = Span(0, len("She plays")) >> orig_span = data_pack.get_original_span(input_processed_span) >> data_pack.get_original_text()[orig_span.begin: orig_span.end] "He plays"
-
classmethod
deserialize
(data_source, serialize_method='json', zip_pack=False)[source]¶ Deserialize a Data Pack from a string. This internally calls the internal
_deserialize()
function fromBasePack
.- Parameters
data_source (
Union
[Path
,str
]) – The path storing data source.serialize_method (
str
) – The method used to serialize the data, this should be the same as how serialization is done. The current options are json, jsonpickle and pickle. The default method is json.zip_pack (
bool
) – Boolean value indicating whether the input source is zipped.
- Return type
- Returns
An data pack object deserialized from the string.
-
delete_entry
(entry)[source]¶ Delete an
Entry
object from theDataPack
. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.Please note that deleting a entry do not guarantee the deletion of the related entries.
- Parameters
entry (~EntryType) – An
Entry
object to be deleted from the pack.
-
get_data
(context_type, request=None, skip_k=0, payload_index=0)[source]¶ Fetch data from entries in the data_pack of type context_type. Data includes “span”, annotation-specific default data fields and specific data fields by “request”.
Annotation-specific data fields means:
“text” for
Type[Annotation]
“audio” for
Type[AudioAnnotation]
Currently, we do not support Groups and Generics in the request.
Example
requests = { base_ontology.Sentence: { "component": ["dummy"], "fields": ["speaker"], }, base_ontology.Token: ["pos", "sense"], base_ontology.EntityMention: { }, } pack.get_data(base_ontology.Sentence, requests)
- Parameters
context_type (
Union
[str
,Type
[Annotation
],Type
[AudioAnnotation
]]) –The granularity of the data context, which could be any
Annotation
orAudioAnnotation
type. Behaviors under different context_type varies:str type will be converted into either
Annotation
type orAudioAnnotation
type.Type[Annotation]
: the default data field for getting context data istext
. This function iteratesall_annotations
to search target entry data.Type[AudioAnnotation]
: the default data field for getting context data isaudio
which stores audio data in numpy arrays. This function iteratesall_audio_annotations
to search target entry data.
request (
Optional
[Dict
[Type
[Entry
],Union
[Dict
,List
]]]) –The entry types and fields User wants to request. The keys of the requests dict are the required entry types and the value should be either:
a list of field names or
a dict which accepts three keys: “fields”, “component”, and “unit”.
By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.
By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.
By setting “unit” (string), users can specify a unit by which the annotations are indexed.
Note that for all annotation types, “span” fields and annotation-specific data fields are returned by default.
For all link types, “child” and “parent” fields are returned by default.
skip_k (
int
) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.payload_index (
int
) – the zero-based index of the Payload in this DataPack’s Payload entries of a particular modality. The modality is dependent oncontext_type
. Defaults to 0.
- Return type
- Returns
A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).
-
build_coverage_for
(context_type, covered_type)[source]¶ User can call this function to build coverage index for specific types. The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See
forte.data.data_pack.DataIndex
for more details.- Parameters
context_type (
Type
[Union
[Annotation
,AudioAnnotation
]]) – The context/covering type.covered_type (
Type
[~EntryType]) – The entry to find under the context type.
-
covers
(context_entry, covered_entry)[source]¶ Check if the covered_entry is covered (in span) of the context_type.
See
in_span()
andin_audio_span()
for the definition of in span.- Parameters
context_entry (
Union
[Annotation
,AudioAnnotation
]) – The context entry.covered_entry (~EntryType) – The entry to be checked on whether it is in span of the context entry.
Returns (bool): True if in span.
- Return type
-
get
(entry_type, range_annotation=None, components=None, include_sub_type=True, get_raw=False)[source]¶ This function is used to get data from a data pack with various methods.
Depending on the provided arguments, the function will perform several different filtering of the returned data.
The
entry_type
is mandatory, where all the entries matching this type will be returned. The sub-types of the provided entry type will be also returned ifinclude_sub_type
is set to True (which is the default behavior).The
range_annotation
controls the search area of the sub-types. An entry E will be returned ifin_span()
orin_audio_span()
returns True. If this function is called frequently with queries related to therange_annotation
, please consider to build the coverage index regarding the related entry types. User can callbuild_coverage_for(context_type, covered_type)()
in order to build a mapping between a pair of entry types and target entries that are covered in ranges specified by outer entries.The
components
list will filter the results by the component (i.e the creator of the entry). Ifcomponents
is provided, only the entries created by one of thecomponents
will be returned.Example
# Iterate through all the sentences in the pack. for sentence in input_pack.get(Sentence): # Take all tokens from a sentence created by NLTKTokenizer. token_entries = input_pack.get( entry_type=Token, range_annotation=sentence, component='NLTKTokenizer') ...
In the above code snippet, we get entries of type
Token
within eachsentence
which were generated byNLTKTokenizer
. You can consider build coverage index betweenToken
andSentence
if this snippet is frequently used:# Build coverage index between `Token` and `Sentence` input_pack.build_coverage_for( context_type=Sentence covered_type=Token )
After building the index from the snippet above, you will be able to retrieve the tokens covered by sentence much faster.
- Parameters
entry_type (
Union
[str
,Type
[~EntryType]]) – The type of entries requested.range_annotation (
Union
[Annotation
,AudioAnnotation
,int
,None
]) – The range of entries requested. This value can be given by an entry object or thetid
of that entry. If None, will return valid entries in the range of whole data pack.components (
Union
[str
,Iterable
[str
],None
]) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.include_sub_type (
bool
) – whether to consider the sub types of the provided entry type. Default True.get_raw (
bool
) – boolean to indicate if the entry should be returned in its primitive form as opposed to an object. False by default
- Yields
Each Entry found using this method.
- Return type
Iterable
[~EntryType]
-
property
MultiPack¶
-
class
forte.data.multi_pack.
MultiPack
(pack_name=None)[source]¶ A
MultiPack
contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)-
relink
(packs)[source]¶ Re-link the reference of the multi-pack to other entries, including the data packs in it.
-
get_subentry
(pack_idx, entry_id)[source]¶ Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id.
Noted this is changed from the way of accessing such pack before v0.0.1, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).
-
remove_pack
(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source]¶ Remove a data pack at index index_of_pack from this multi pack.
In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.
- Parameters
index_of_pack (
int
) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.clean_invalid_entries (
bool
) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.purge_lists (
bool
) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.
- Return type
- Returns
True if successful.
- Raises
ValueError – if
clean_invalid_entries
is set to False and the DataPack to be removed have entries (in links, groups) associated with it.
-
purge_deleted_packs
()[source]¶ Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.
- Return type
- Returns
True if successful.
-
add_pack
(ref_name=None, pack_name=None)[source]¶ Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.
- Parameters
Returns: The newly created data pack.
- Return type
-
property
packs
¶ Get the list of Data packs that in the order of added.
Please do not use this try
-
rename_pack
(old_name, new_name)[source]¶ Rename the pack to a new name. If the new_name is already taken, a
ValueError
will be raised. If the old_name is not found, then aKeyError
will be raised just as missing value from a dictionary.
-
property
all_links
¶ An iterator of all links in this multi pack.
- Return type
- Returns
Iterator of all links, of type
MultiPackLink
.
-
property
all_groups
¶ An iterator of all groups in this multi pack.
- Return type
- Returns
Iterator of all groups, of type
MultiPackGroup
.
-
property
generic_entries
¶ An iterator of all generics in this multi pack.
- Return type
- Returns
Iterator of all generics, of type
MultiPackGeneric
.
-
property
links
¶ A List container of all links in this multi pack.
Returns: List of all links, of type
MultiPackLink
.
-
property
groups
¶ A List container of all groups in this multi pack.
Returns: List of all groups, of type
MultiPackGroup
.
-
property
generics
¶ A SortedList container of all generic entries in this multi pack.
Returns: SortedList of generics
-
add_all_remaining_entries
(component=None)[source]¶ Calling this function will add the entries that are not added to the pack manually.
-
get_single_pack_data
(pack_index, context_type, request=None, skip_k=0)[source]¶ Get pack data from one of the packs specified by the name. This is equivalent to calling the
get_data()
inDataPack
.- Parameters
pack_index (
int
) – The index of a single pack.context_type (
Type
[Annotation
]) – The granularity of the data context, which could be any Annotation type.request (
Optional
[Dict
[Type
[Entry
],Union
[Dict
,List
]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.skip_k (
int
) – Will skip the first k instances and generate data from the k + 1 instance.
- Return type
- Returns
A data generator, which generates one piece of data (a dict containing the required annotations and context).
-
get_cross_pack_data
(request)[source]¶ Note
This function is not finished.
Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.
One can also call this method with more constraints by providing a dictionary, which can contain the following keys:
“fields”, this specifies the attribute field names to be obtained
“unit”, this specifies the unit used to index the annotation
“component”, this specifies a constraint to take only the entries created by the specified component.
The data request logic is similar to that of
get_data()
function inDataPack
, but applied on MultiPack entries.Example:
requests = { MultiPackLink: { "component": ["dummy"], "fields": ["speaker"], }, base_ontology.Token: ["pos", "sense""], base_ontology.EntityMention: { "unit": "Token", }, } pack.get_cross_pack_data(requests)
- Parameters
request (
Dict
[Type
[Union
[MultiPackLink
,MultiPackGroup
]],Union
[Dict
,List
]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.- Returns
None
-
get
(entry_type, components=None, include_sub_type=True, get_raw=False)[source]¶ Get entries of
entry_type
from this multi pack.Example:
for relation in pack.get( CrossDocEntityRelation, component="relation_creator" ): print(relation.get_parent())
In the above code snippet, we get entries of type
CrossDocEntityRelation
which were generated by a component namedrelation_creator
- Parameters
entry_type (
Union
[str
,Type
[~EntryType]]) – The type of the entries requested.components (
Union
[str
,List
[str
],None
]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.include_sub_type (
bool
) – whether to return the sub types of the queried entry_type. True by default.get_raw (
bool
) – boolean to indicate if the entry should be returned in its primitive form as opposed to an object. False by default
- Return type
Iterator
[~EntryType]- Returns
An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)
-
classmethod
deserialize
(data_path, serialize_method='json', zip_pack=False)[source]¶ Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.
This internally calls the internal
_deserialize()
function from theBasePack
.- Parameters
data_path (
Union
[Path
,str
]) – The serialized string of a Multi pack to be deserialized.serialize_method (
str
) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.zip_pack (
bool
) – Boolean value indicating whether the input source is zipped.
- Return type
- Returns
An data pack object deserialized from the string.
-
BaseMeta¶
-
class
forte.data.base_pack.
BaseMeta
(pack_name=None)[source]¶ Basic Meta information for both
DataPack
andMultiPack
.- Parameters
pack_name (
Optional
[str
]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.
-
record
¶ Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
Meta¶
-
class
forte.data.data_pack.
Meta
(pack_name=None, language='eng', span_unit='character', sample_rate=None, info=None)[source]¶ Basic Meta information associated with each instance of
DataPack
.- Parameters
pack_name (
Optional
[str
]) – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.language (
str
) – The language used by this data pack, default is English.span_unit (
str
) – The unit used for interpreting the Span object of this data pack. Default is character.sample_rate (
Optional
[int
]) – An integer specifying the sample rate of audio payload. Default is None.info (
Optional
[Dict
[str
,str
]]) – Store additional string based information that the user add.
-
pack_name
¶ storing the provided pack_name.
-
language
¶ storing the provided language.
-
sample_rate
¶ storing the provided sample_rate.
-
info
¶ storing the provided info.
-
record
¶ Initialized as a dictionary. This is not a required field. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
DataIndex¶
-
class
forte.data.data_pack.
DataIndex
[source]¶ A set of indexes used in
DataPack
, note that this class is used by the DataPack internally.entry_index
, the index from eachtid
to the corresponding entrytype_index
, the index from each type to the entries of that typecomponent_index
, the index from each component to the entries generated by that componentlink_index
, the index from child (link_index["child_index"]
)and parent (link_index["parent_index"]
) nodes to linksgroup_index
, the index from group members to groups._coverage_index
, the index that maps from an annotation to the entries it covers._coverage_index
is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is thetid
of the outer entry, and the value is a set oftid
that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.
-
coverage_index
(outer_type, inner_type)[source]¶ Get the coverage index from
outer_type
toinner_type
.- Parameters
outer_type (
Type
[Union
[Annotation
,AudioAnnotation
]]) – an annotation or AudioAnnotation type.inner_type (
Type
[~EntryType]) – an entry type.
- Return type
- Returns
If the coverage index does not exist, return None. Otherwise, return a dict.
-
get_covered
(data_pack, context_annotation, inner_type)[source]¶ Get the entries covered by a certain context annotation
- Parameters
data_pack (
DataPack
) – The data pack to search for.context_annotation (
Union
[Annotation
,AudioAnnotation
]) – The context annotation to search in.inner_type (
Type
[~EntryType]) – The inner type to be searched for.
- Return type
- Returns
Entry ID of type inner_type that is covered by context_annotation.
-
build_coverage_index
(data_pack, outer_type, inner_type)[source]¶ Build the coverage index from
outer_type
toinner_type
.- Parameters
data_pack (
DataPack
) – The data pack to build coverage for.outer_type (
Type
[Union
[Annotation
,AudioAnnotation
]]) – an annotation or AudioAnnotation type.inner_type (
Type
[~EntryType]) – an entry type, can be Annotation, Link, Group, AudioAnnotation.
-
have_overlap
(entry1, entry2)[source]¶ Check whether the two annotations have overlap in span.
- Parameters
entry1 (
Union
[Annotation
,int
,AudioAnnotation
]) – AnAnnotation
orAudioAnnotation
object to be checked, or thetid
of the Annotation.entry2 (
Union
[Annotation
,int
,AudioAnnotation
]) – AnotherAnnotation
orAudioAnnotation
object to be checked, or thetid
of the Annotation.
- Return type
-
in_span
(inner_entry, span)[source]¶ Check whether the
inner entry
is within the givenspan
. The criterion are as followed:Annotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.
Link entries: if the parent and child of the links are both Annotation type, this link will be considered in span if both parent and child are
in_span()
of the provided span. If either the parent and the child is not of type Annotation, this function will always return False.Group entries: if the child type of the group is Annotation type, then the group will be considered in span if all the elements are
in_span()
of the provided span. If the child type is not Annotation type, this function will always return False.Other entries (i.e Generics and AudioAnnotation): they will not be considered
in_span()
of any spans. The function will always return False.- Parameters
- Return type
- Returns
True if the inner_entry is considered to be in span of the provided span.
-
in_audio_span
(inner_entry, span)[source]¶ Check whether the
inner entry
is within the given audio span. This method is identical to :meth:in_span()
except that it operates on the audio payload of datapack. The criterion are as followed:AudioAnnotation entries: they are considered in a span if the begin is not smaller than span.begin and the end is not larger than span.end.
Link entries: if the parent and child of the links are both AudioAnnotation type, this link will be considered in span if both parent and child are
in_span()
of the provided span. If either the parent and the child is not of type AudioAnnotation, this function will always return False.Group entries: if the child type of the group is AudioAnnotation type, then the group will be considered in span if all the elements are
in_span()
of the provided span. If the child type is not AudioAnnotation type, this function will always return False.Other entries (i.e Generics and Annotation): they will not be considered
in_span()
of any spans. The function will always return False.- Parameters
- Return type
- Returns
True if the inner_entry is considered to be in span of the provided span.
MultiPack¶
MultiPackMeta¶
MultiPack¶
-
class
forte.data.multi_pack.
MultiPack
(pack_name=None)[source] A
MultiPack
contains multiple DataPacks and a collection of cross-pack entries (such as links and groups)-
relink
(packs)[source] Re-link the reference of the multi-pack to other entries, including the data packs in it.
-
get_subentry
(pack_idx, entry_id)[source] Get sub_entry from multi pack. This method uses pack_id (a unique identifier assigned to datapack) to get a pack from multi pack, and then return its sub_entry with entry_id.
Noted this is changed from the way of accessing such pack before v0.0.1, in which the pack_idx was used as list index number to access/reference a pack within the multi pack (and in this case then get the sub_entry).
-
remove_pack
(index_of_pack, clean_invalid_entries=False, purge_lists=False)[source] Remove a data pack at index index_of_pack from this multi pack.
In a multi pack, the data pack to be removed may be associated with some multi pack entries, such as MultiPackLinks that are connected with other packs. These entries will become dangling and invalid, thus need to be removed. One can consider removing these links before calling this function, or set the clean_invalid_entries to True so that they will be automatically pruned. The purge of the lists in this multi_pack can be called if pruge_lists is set to true which will remove the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists.
- Parameters
index_of_pack (
int
) – The index of pack for removal from the multi pack. If invalid, no pack will be deleted.clean_invalid_entries (
bool
) – Switch for automatically cleaning the entries associated with the data pack being deleted which will become invalid after the removal of the pack. Default is False.purge_lists (
bool
) – Switch for automatically removing the empty spaces in the lists of this multi pack of the removed pack and resulting in the index for the remaining packs after the removed pack to be changed, so user will be responsible to manage such changes if the index(es) of said remaining pack is used or stored somewhere by user, after purging the lists. Default is False.
- Return type
- Returns
True if successful.
- Raises
ValueError – if
clean_invalid_entries
is set to False and the DataPack to be removed have entries (in links, groups) associated with it.
-
purge_deleted_packs
()[source] Purge deleted packs from lists previous set to -1, empty or none to keep index unchanged Caution: Purging the deleted_packs from lists in multi_pack will remove the empty spaces from the lists of this multi_pack after a pack is removed and resulting the indexes of the packs after the deleted pack(s) to change, so user will be responsible to manage such changes if such index of a pack is used or stored somewhere in user’s code after purging.
- Return type
- Returns
True if successful.
-
add_pack
(ref_name=None, pack_name=None)[source] Create a data pack and add it to this multi pack. If ref_name is provided, it will be used to index the data pack. Otherwise, a default name based on the pack id will be created for this data pack. The created data pack will be returned.
- Parameters
Returns: The newly created data pack.
- Return type
-
add_pack_
(pack, ref_name=None)[source] Add a existing data pack to the multi pack.
-
get_pack_at
(index)[source] Get data pack at provided index.
-
get_pack_index
(pack_id)[source] Get the pack index from the global pack id.
-
get_pack
(name)[source] Get data pack of name.
-
property
packs
Get the list of Data packs that in the order of added.
Please do not use this try
-
rename_pack
(old_name, new_name)[source] Rename the pack to a new name. If the new_name is already taken, a
ValueError
will be raised. If the old_name is not found, then aKeyError
will be raised just as missing value from a dictionary.
-
property
all_links
An iterator of all links in this multi pack.
- Return type
- Returns
Iterator of all links, of type
MultiPackLink
.
-
property
num_links
Number of links in this multi pack.
- Return type
- Returns
Number of links.
-
property
all_groups
An iterator of all groups in this multi pack.
- Return type
- Returns
Iterator of all groups, of type
MultiPackGroup
.
-
property
num_groups
Number of groups in this multi pack.
- Return type
- Returns
Number of groups.
-
property
generic_entries
An iterator of all generics in this multi pack.
- Return type
- Returns
Iterator of all generics, of type
MultiPackGeneric
.
-
property
links
A List container of all links in this multi pack.
Returns: List of all links, of type
MultiPackLink
.
-
property
groups
A List container of all groups in this multi pack.
Returns: List of all groups, of type
MultiPackGroup
.
-
property
generics
A SortedList container of all generic entries in this multi pack.
Returns: SortedList of generics
-
add_all_remaining_entries
(component=None)[source] Calling this function will add the entries that are not added to the pack manually.
-
get_single_pack_data
(pack_index, context_type, request=None, skip_k=0)[source] Get pack data from one of the packs specified by the name. This is equivalent to calling the
get_data()
inDataPack
.- Parameters
pack_index (
int
) – The index of a single pack.context_type (
Type
[Annotation
]) – The granularity of the data context, which could be any Annotation type.request (
Optional
[Dict
[Type
[Entry
],Union
[Dict
,List
]]]) – The entry types and fields required. The keys of the dict are the required entry types and the value should be either a list of field names or a dict. If the value is a dict, accepted items includes “fields”, “component”, and “unit”. By setting “component” (a list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components. By setting “unit” (a string), users can specify a unit by which the annotations are indexed. Note that for all annotations, “text” and “span” fields are given by default; for all links, “child” and “parent” fields are given by default.skip_k (
int
) – Will skip the first k instances and generate data from the k + 1 instance.
- Return type
- Returns
A data generator, which generates one piece of data (a dict containing the required annotations and context).
-
get_cross_pack_data
(request)[source] Note
This function is not finished.
Get data via the links and groups across data packs. The keys could be MultiPack entries (i.e. MultiPackLink and MultiPackGroup). The values specifies the detailed entry information to be get. The value can be a List of field names, then the return results will contains all specified fields.
One can also call this method with more constraints by providing a dictionary, which can contain the following keys:
“fields”, this specifies the attribute field names to be obtained
“unit”, this specifies the unit used to index the annotation
“component”, this specifies a constraint to take only the entries created by the specified component.
The data request logic is similar to that of
get_data()
function inDataPack
, but applied on MultiPack entries.Example:
requests = { MultiPackLink: { "component": ["dummy"], "fields": ["speaker"], }, base_ontology.Token: ["pos", "sense""], base_ontology.EntityMention: { "unit": "Token", }, } pack.get_cross_pack_data(requests)
- Parameters
request (
Dict
[Type
[Union
[MultiPackLink
,MultiPackGroup
]],Union
[Dict
,List
]]) – A dict containing the data request. The keys are the types to be requested, and the fields are the detailed constraints.- Returns
None
-
get
(entry_type, components=None, include_sub_type=True, get_raw=False)[source] Get entries of
entry_type
from this multi pack.Example:
for relation in pack.get( CrossDocEntityRelation, component="relation_creator" ): print(relation.get_parent())
In the above code snippet, we get entries of type
CrossDocEntityRelation
which were generated by a component namedrelation_creator
- Parameters
entry_type (
Union
[str
,Type
[~EntryType]]) – The type of the entries requested.components (
Union
[str
,List
[str
],None
]) – The component generating the entries requested. If None, all valid entries generated by any component will be returned.include_sub_type (
bool
) – whether to return the sub types of the queried entry_type. True by default.get_raw (
bool
) – boolean to indicate if the entry should be returned in its primitive form as opposed to an object. False by default
- Return type
Iterator
[~EntryType]- Returns
An iterator of the entries matching the arguments, following the order of entries (first sort by entry comparison, then by insertion)
-
classmethod
deserialize
(data_path, serialize_method='json', zip_pack=False)[source] Deserialize a Multi Pack from a string. Note that this will only deserialize the native multi pack content, which means the associated DataPacks contained in the MultiPack will not be recovered. A followed-up step need to be performed to add the data packs back to the multi pack.
This internally calls the internal
_deserialize()
function from theBasePack
.- Parameters
data_path (
Union
[Path
,str
]) – The serialized string of a Multi pack to be deserialized.serialize_method (
str
) – The method used to serialize the data, this should be the same as how serialization is done. The current options are jsonpickle and pickle. The default method is jsonpickle.zip_pack (
bool
) – Boolean value indicating whether the input source is zipped.
- Return type
- Returns
An data pack object deserialized from the string.
-
MultiPackLink¶
-
class
forte.data.multi_pack.
MultiPackLink
(pack, parent=None, child=None)[source] This is used to link entries in a
MultiPack
, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.-
ParentType
alias of
forte.data.ontology.core.Entry
-
ChildType
alias of
forte.data.ontology.core.Entry
-
parent_id
()[source] Return the
tid
of the parent entry.- Return type
- Returns
The
tid
of the parent entry.
-
child_id
()[source] Return the
tid
of the child entry.- Return type
- Returns
The
tid
of the child entry.
-
parent_pack_id
()[source] Return the pack_id of the parent pack.
- Return type
- Returns
The pack_id of the parent pack..
-
child_pack_id
()[source] Return the pack_id of the child pack.
- Return type
- Returns
The pack_id of the child pack.
-
set_parent
(parent)[source] This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
parent (
Entry
) – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its owntid
in the pack.
-
set_child
(child)[source] This will set the child of the current instance with given Entry. The child is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
child (
Entry
) – The child of the link, which is an Entry from a data pack, it has access to the pack index and its owntid
in the pack.
-
get_parent
()[source] Get the parent entry of the link.
-
MultiPackGroup¶
-
class
forte.data.multi_pack.
MultiPackGroup
(pack, members=None)[source] Group type entries, such as “coreference group”. Each group has a set of members.
-
MemberType
alias of
forte.data.ontology.core.Entry
-
Readers¶
BaseReader¶
-
class
forte.data.base_reader.
BaseReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ The basic data reader class. To be inherited by all data readers.
- Parameters
from_cache (
bool
) – Decide whether to read from cache if cache file exists. By default (False
), the reader will only read from the original file and use the cache file path for caching, it will not read from thecache_directory
. IfTrue
, the reader will try to read a datapack from the caching file.cache_directory (
Optional
[str
]) –The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by
_cache_key_function()
.Note
A collection is the data returned by
_collect()
.append_to_cache (
bool
) – Decide whether to append write if cache file already exists. By default (False
), we will overwrite the existing caching file. IfTrue
, we will cache the datapack append to end of the caching file.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.
-
classmethod
default_configs
()[source]¶ Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.
Here:
zip_pack (bool): whether to zip the results. The default value is False.
serialize_method: The method used to serialize the data. Current available options are json, jsonpickle and pickle. Default is json.
-
parse_pack
(collection)[source]¶ Calls
_parse_pack()
to create packs from the collection. This internally setup the component meta data. Users should implement the_parse_pack()
method.- Return type
Iterator
[~PackType]
-
text_replace_operation
(text)[source]¶ Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.
- Parameters
text (
str
) – The original data text to be cleaned.
- Returns (List[Tuple[Tuple[int, int], str]]):
the replacement operations.
-
set_profiling
(enable_profiling=True)[source]¶ Set profiling option.
- Parameters
enable_profiling (
bool
) – A boolean of whether to enable profiling for the reader or not (the default is True).
-
timer_yield
(pack)[source]¶ Wrapper generator for time profiling. Insert timers around ‘yield’ to support time profiling for reader.
- Parameters
pack (~PackType) – DataPack passed from self.iter()
-
iter
(*args, **kwargs)[source]¶ An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call
collect
.- Parameters
args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.
kwargs – Iterator of DataPacks.
- Return type
Iterator
[~PackType]
-
record
(record_meta)[source]¶ Modify the pack meta record field of the reader’s output. The key of the record should be the entry type and values should be attributes of the entry type. All the information would be used for consistency checking purpose if the pipeline is initialized with enforce_consistency=True.
-
cache_data
(collection, pack, append)[source]¶ Specify the path to the cache directory.
After you call this method, the dataset reader will use its
cache_directory
to store a cache ofBasePack
read from every document passed to read, serialized as one string-formattedBasePack
. If the cache file for a givenfile_path
exists, we read theBasePack
from the cache. If the cache file does not exist, we will create it on our first pass through the data.- Parameters
-
read_from_cache
(cache_filename)[source]¶ Reads one or more Packs from
cache_filename
, and yields Pack(s) from the cache file.
-
finish
(resource)[source]¶ The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.
- Parameters
resource (
Resources
) – A global resource registry.
PackReader¶
MultiPackReader¶
CoNLL03Reader¶
ConllUDReader¶
-
class
forte.data.readers.conllu_ud_reader.
ConllUDReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ ConllUDReader
is designed to read in the Universal Dependencies 2.4 dataset.
BaseDeserializeReader¶
RawDataDeserializeReader¶
RecursiveDirectoryDeserializeReader¶
HTMLReader¶
MSMarcoPassageReader¶
MultiPackSentenceReader¶
MultiPackTerminalReader¶
OntonotesReader¶
PlainTextReader¶
ProdigyReader¶
RACEMultiChoiceQAReader¶
StringReader¶
SemEvalTask8Reader¶
OpenIEReader¶
SquadReader¶
-
class
forte.datasets.mrc.squad_reader.
SquadReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ Reader for processing Stanford Question Answering Dataset (SQuAD).
Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span.
Dataset can be downloaded at https://rajpurkar.github.io/SQuAD-explorer/.
SquadReader reads each paragraph in the dataset as a separate Document, and the questions are concatenated behind the paragraph, form a Passage. Phrase are MRC answers marked as text spans. Each MRCQuestion has a list of answers.
-
classmethod
default_configs
()[source]¶ Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.
Here:
zip_pack (bool): whether to zip the results. The default value is False.
serialize_method: The method used to serialize the data. Current available options are json, jsonpickle and pickle. Default is json.
-
record
(record_meta)[source]¶ Method to add output type record of PlainTextReader which is ft.onto.base_ontology.Document with an empty set to
forte.data.data_pack.Meta.record
.
-
classmethod
ClassificationDatasetReader¶
Selector¶
DummySelector¶
SinglePackSelector¶
NameMatchSelector¶
-
class
forte.data.selector.
NameMatchSelector
(select_name=None)[source]¶ Select a
DataPack
from aMultiPack
with specified name. This implementation takes special care for backward compatibility.Deprecated:
selector = NameMatchSelector(select_name="foo") selector = NameMatchSelector("foo")
Now:
selector = NameMatchSelector() selector.initialize( configs={ "select_name": "foo" } )
WARNING: Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.
RegexNameMatchSelector¶
-
class
forte.data.selector.
RegexNameMatchSelector
(select_name=None)[source]¶ Select a
DataPack
from aMultiPack
using a regex.This implementation takes special care for backward compatibility.
Deprecated:
selector = RegexNameMatchSelector(select_name="^.*\\d$") selector = RegexNameMatchSelector("^.*\\d$")
Now:
selector = RegexNameMatchSelector() selector.initialize( configs={ "select_name": "^.*\\d$" } )
Warning
Passing parameters through __init__ is deprecated, and does not work well with pipeline serialization.
FirstPackSelector¶
Index¶
BaseIndex¶
-
class
forte.data.index.
BaseIndex
[source]¶ A set of indexes used in
BasePack
:entry_index
, the index from eachtid
to the corresponding entrytype_index
, the index from each type to the entries of that typelink_index
, the index from child (link_index["child_index"]
)and parent (link_index["parent_index"]
) nodes to linksgroup_index
, the index from group members to groups.
-
update_basic_index
(entries)[source]¶ Build or update the basic indexes, including
(1)
entry_index
, the index from eachtid
to the corresponding entry;(2)
type_index
, the index from each type to the entries of that type;(3)
component_index
, the index from each component to the entries generated by that component.- Parameters
entries (list) – a list of entries to be added into the basic index.
-
query_by_type_subtype
(t)[source]¶ Look up the entry indices that are instances of
entry_type
, including children classes ofentry_type
.Note
all the known types to this data pack will be scanned to find all sub-types. This method will try to cache the sub-type information after the first call, but the cached information could be invalidated by other operations (such as adding new items to the data pack).
-
build_link_index
(links)[source]¶ Build the
link_index
, the index from child and parent nodes to links. It will build the links with the links in the dataset.link_index
consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:
-
build_group_index
(groups)[source]¶ Build
group_index
, the index from group members to groups.- Returns
None
-
link_index
(tid, as_parent=True)[source]¶ Look up the link_index with key
tid
. If the link index is not built, this will throw aPackIndexError
.- Parameters
- Return type
-
group_index
(tid)[source]¶ Look up the group_index with key
tid
. If the index is not built, this will raise aPackIndexError
.
-
update_link_index
(links)[source]¶ Update
link_index
with the provided links, the index from child and parent to links.link_index
consists of two sub-indexes:“child_index” is the index from child nodes to their corresponding links
“parent_index” is the index from parent nodes to their corresponding links.
- Parameters
links (
List
[~LinkType]) – a list of links to be added into the index.
-
update_group_index
(groups)[source]¶ Build or update
group_index
, the index from group members to groups.- Parameters
groups (
List
[~GroupType]) – a list of groups to be added into the index.
Store¶
BaseStore¶
-
class
forte.data.base_store.
BaseStore
[source]¶ The base class which will be used by
DataStore
.-
serialize
(output_path, serialize_method='json', save_attribute=True, indent=None)[source]¶ Serializes the data store to the provided path. The output of this function depends on the serialization method chosen.
- Parameters
output_path (
str
) – The path to write data to.serialize_method (
str
) – The method used to serialize the data. Currently supports json (outputs json dictionary).save_attribute (
bool
) – Boolean value indicating whether users want to save attributes for field checks later during deserialization. Attributes and their indices for every entry type will be saved.indent (
Optional
[int
]) – Whether to indent the file if written as JSON.
Returns: Results of serialization.
-
to_string
(json_method='json', save_attribute=True, indent=None)[source]¶ Return the string representation (json encoded) of this method.
- Parameters
json_method (
str
) – What method is used to convert data pack to json. Only supports json for now. Default value is json.save_attribute (
bool
) – Boolean value indicating whether users want to save attributes for field checks later during deserialization. Attributes and their indices for every entry type will be saved.
Returns: String representation of the data pack.
- Return type
-
abstract
add_entry_raw
(type_name, tid=None, allow_duplicate=True, attribute_data=None)[source]¶ This function provides a general implementation to add all types of entries to the data store. It can add namely Annotation, AudioAnnotation, ImageAnnotation, Link, Group and Generics. Returns the
tid
for the inserted entry.- Parameters
type_name (
str
) – The fully qualified type name of the new Entry.tid (
Optional
[int
]) –tid
of the Entry that is being added. It’s optional, and it will be auto-assigned if not given.allow_duplicate (
bool
) – Whether we allow duplicate in the DataStore. When it’s set to False, the function will return thetid
of existing entry if a duplicate is found. Default value is True.attribute_data (
Optional
[List
]) – It is a list that stores attributes relevant to the entry being added. The attributes passed in attributes_data must be present in that entries type_attributes and must only be those entries which are relevant to the initialization of the entry. For example, begin and end position when creating an entry of typeAnnotation
.
- Return type
- Returns
tid
of the entry.
-
abstract
all_entries
(entry_type_name)[source]¶ Retrieve all entry data of entry type
entry_type_name
and entries of subclasses of entry typeentry_type_name
.
-
abstract
num_entries
(entry_type_name)[source]¶ Compute the number of entries of given
entry_type_name
and entries of subclasses of entry typeentry_type_name
.
-
abstract
set_attribute
(tid, attr_name, attr_value)[source]¶ This function locates the entry data with
tid
and sets itsattr_name
withattr_value
.
-
abstract
get_attribute
(tid, attr_name)[source]¶ This function finds the value of
attr_name
in entry withtid
.
-
abstract
delete_entry
(tid)[source]¶ This function removes the entry with
tid
from the data store.- Parameters
tid (
int
) – Unique id of the entry.
-
abstract
get_entry
(tid)[source]¶ Look up the tid_ref_dict or tid_idx_dict with key
tid
. Return the entry and itstype_name
.
-
abstract
get_entry_index
(tid)[source]¶ Look up the tid_ref_dict or tid_idx_dict with key
tid
. Return theindex_id
of the entry.
-
abstract
get
(type_name, include_sub_type, range_span=None)[source]¶ This function fetches entries from the data store of type
type_name
.- Parameters
- Return type
- Returns
An iterator of the entries matching the provided arguments.
-
Data Store¶
DataStore¶
-
class
forte.data.data_store.
DataStore
(onto_file_path=None, dynamically_add_type=True)[source]¶ -
classmethod
deserialize
(data_source, serialize_method='json', check_attribute=True, suppress_warning=True, accept_unknown_attribute=True)[source]¶ Deserialize a DataStore from serialized data in data_source.
- Parameters
data_source (
str
) – The path storing data source.serialize_method (
str
) – The method used to serialize the data, this should be the same as how serialization is done. The current option is json.check_attribute (
bool
) – Boolean value indicating whether users want to check compatibility of attributes. Only applicable when the data being serialized is done with save_attribute set to True in BaseStore.serialize. If true, it will compare fields of the serialized object and the current DataStore class. If there are fields that have different orders in the current class and the serialized object, it switches the order of fields to match the current class. If there are fields that appear in the current class, but not in the serialized object, it handles those fields with accept_unknown_attribute. If there are fields that appear in the serialized object, but not in the current class, it drops those fields.suppress_warning (
bool
) – Boolean value indicating whether users want to see warnings when it checks attributes. Only applicable when check_attribute is set to True. If true, it will log warnings when there are mismatched fields.accept_unknown_attribute (
bool
) – Boolean value indicating whether users want to fill fields that appear in the current class, but not in the serialized object with none. Only applicable when check_attribute is set to True. If false, it will raise an ValueError if there are any contradictions in fields.
- Raises
ValueError – raised when 1. the serialized object has unknown fields, but accept_unknown_attribute is False. 2. the serialized object does not store attributes, but check_attribute is True. 3. the serialized object does not support json deserialization. We may change this error when we have other options for deserialization.
- Return type
- Returns
An data store object deserialized from the string.
-
get_annotation_sorting_fn
(type_name)[source]¶ This function creates a lambda method to generate the sorted list of an entry of given type. The type of the entry must be a successor of
Annotation
. It creates a lambda function that sorts annotation type entries based on theirbegin
andend
index. The function first fetches the indices of the positions where thebegin
andend
index is stored for the data store entry specified bytype_name
. These index positions are then used to create the lambda function to sort the data store entries given bytype_name
.- Parameters
type_name (
str
) – A string representing a fully qualified type name of the entry whose sorting function we want to fetch.- Returns
A lambda function representing the sorting function for entries of type type_name.
-
fetch_entry_type_data
(type_name, attributes=None)[source]¶ This function takes a fully qualified
type_name
class name and a set of tuples representing an attribute and its required type (only in the case where thetype_name
class name represents an entry being added from a user defined ontology) and creates a dictionary where the key is attribute of the entry and value is the type information of that attribute.There are two cases in which a fully qualified
type_name
class name can be handled:- If the class being added is of an existing entry: This means
that there is information present about this entry through its dataclass attributes and their respective types. Thus, we use the _get_entry_attributes_by_class method to fetch this information.
- If the class being added is of a user defined entry: In this
case, we fetch the information about the entry’s attributes and their types from the
attributes
argument.
- Parameters
- Returns: A dictionary representing attributes as key and type
information as value. For each attribute, the type information is represented by a tuple of two elements. The first element is the unsubscripted version of the attribute’s type and the second element is the type arguments for the same. The type_dict is used to populate the type information for attributes of an entry specified by
type_name
in _type_attributes. For example,type_dict = { "document_class": (list, (str,)), "sentiment": (dict, (str, float)), "classifications": (FDict, (str, Classification)) }
-
get_attr_type
(type_name, attr_name)[source]¶ Retrieve the type information of a given attribute
attr_name
in an entry of typetype_name
- Parameters
- Return type
- Returns
The type information of the required attribute. This information is stored in the
_type_attributes
dictionary of the Data Store.
-
all_entries
(entry_type_name)[source]¶ Retrieve all entry data of entry type
entry_type_name
and entries of subclasses of entry typeentry_type_name
.
-
num_entries
(entry_type_name)[source]¶ Compute the number of entries of given
entry_type_name
and entries of subclasses of entry typeentry_type_name
.
-
get_datastore_attr_idx
(type_name, attr)[source]¶ This function returns the index of where a given attribute attr is stored in the Data Store entry of type type_name
-
initialize_and_validate_entry
(entry, attribute_data)[source]¶ This function performs validation checks on the initial attributes added to a data store entry. This functions also modifies the value of certain attributes to fit data store’s purpose of storing primitive types. For example, In the data store entry of type
Group
, attribute member_type is converted from an object to str. When initializing entries, this function makes certain assumptions based on the type of entry.- if the entry is of type
Annotation
or
AudioAnnotation
, we assume that attribute_data is a list of two elements, indicating the begin and end index of the annotation respectively.
- if the entry is of type
- if the entry is of type
Group
or MultiPackGroup
, we assume that attribute_data is a list of one element representing the group’s member type.
- if the entry is of type
- if the entry is of type
Link
or MultiPackLink
, we assume that attribute_data is a list of two elements representing the link’s parent and child type respectively.
- if the entry is of type
-
add_entry_raw
(type_name, tid=None, allow_duplicate=True, attribute_data=None)[source]¶ This function provides a general implementation to add all types of entries to the data store. It can add namely Annotation, AudioAnnotation, ImageAnnotation, Link, Group and Generics. Returns the
tid
for the inserted entry.- Parameters
type_name (
str
) – The fully qualified type name of the new Entry.tid (
Optional
[int
]) –tid
of the Entry that is being added. It’s optional, and it will be auto-assigned if not given.allow_duplicate (
bool
) – Whether we allow duplicate in the DataStore. When it’s set to False, the function will return thetid
of existing entry if a duplicate is found. Default value is True.attribute_data (
Optional
[List
]) – It is a list that stores attributes relevant to the entry being added. The attributes passed in attributes_data must be present in that entries type_attributes and must only be those entries which are relevant to the initialization of the entry. For example, begin and end position when creating an entry of typeAnnotation
.
- Return type
- Returns
tid
of the entry.
-
get_attribute_positions
(type_name)[source]¶ This function returns a dictionary where the key represents the attributes of the entry of type
type_name
and value represents the index of the position where this attribute is stored in the data store entry of this type. For example:positions = data_store.get_attribute_positions( "ft.onto.base_ontology.Document" ) # positions = { # "begin": 2, # "end": 3, # "payload_idx": 4, # "document_class": 5, # "sentiment": 6, # "classifications": 7 # }
-
transform_data_store_entry
(entry)[source]¶ This method converts a raw data store entry into a format more easily understandable to users. Data Store entries are stored as lists and are not very easily understandable. This method converts
DataStore
entries from a list format to a dictionary based format where the key is the names of the attributes of an entry and the value is the values corresponding attributes in the data store entry. For example:>>> data_store = DataStore() >>> tid = data_store.add_entry_raw( ... type_name = 'ft.onto.base_ontology.Sentence', ... tid = 101, attribute_data = [0,10]) >>> entry = data_store.get_entry(tid)[0] >>> transformed_entry = data_store.transform_data_store_entry(entry) >>> transformed_entry == { 'begin': 0, 'end': 10, 'payload_idx': 0, ... 'speaker': None, 'part_id': None, 'sentiment': {}, ... 'classification': {}, 'classifications': {}, 'tid': 101, ... 'type': 'ft.onto.base_ontology.Sentence'} True
-
set_attribute
(tid, attr_name, attr_value)[source]¶ This function locates the entry data with
tid
and sets itsattr_name
with attr_value. It first findsattr_id
according toattr_name
.tid
,attr_id
, andattr_value
are passed to set_attr().
-
get_attribute
(tid, attr_name)[source]¶ This function finds the value of
attr_name
in entry withtid
. It locates the entry data withtid
and finds attr_id of its attributeattr_name
.tid
andattr_id
are passed toget_attr()
.
-
delete_entry
(tid)[source]¶ This function locates the entry data with
tid
and removes it from the data store. This function removes it from tid_ref_dict or tid_idx_dict and finds its index in the list. If it is an annotation-like entry, we retrieve the entry from tid_ref_dict and bisect the list to find its index. If it is an non-annotation-like entry, we get the type_name and its index in the list directly from tid_idx_dict.- Parameters
tid (
int
) – Unique id of the entry.- Raises
KeyError – when entry with
tid
is not found.RuntimeError – when internal storage is inconsistent.
-
get_entry
(tid)[source]¶ This function finds the entry with
tid
. It returns the entry and itstype_name
.
-
get_entry_index
(tid)[source]¶ Look up the tid_ref_dict and tid_idx_dict with key
tid
. Return theindex_id
of the entry.- Parameters
tid (
int
) – Unique id of the entry.- Return type
- Returns
Index of the entry which
tid
corresponds to in theentry_type
list.- Raises
ValueError – An error occurred when no corresponding entry is found.
-
get_length
(type_name)[source]¶ This function find the length of the type_name entry list. It should not count None placeholders that appear in non-annotation-like entry lists.
-
co_iterator_annotation_like
(type_names, range_span=None)[source]¶ Given two or more type names, iterate their entry lists from beginning to end together.
For every single type, their entry lists are sorted by the
begin
andend
fields. Theco_iterator_annotation_like
function will iterate those sorted lists together, and yield each entry in sorted order. This tasks is quite similar to merging several sorted list to one sorted list. We internally use a MinHeap to order the order of yielded items, and the ordering is determined by:start index of the entry.
end index of the entry.
the index of the entry type name in input parameter
type_names
.
The precedence of those values indicates their priority in the min heap ordering.
Lastly, the range_span argument determines the start and end position of the span range within which entries of specified by type_name need to be fetched. For example, if two entries have both the same begin and end field, then their order is decided by the order of user input type_name (the type that first appears in the target type list will return first). For entries that have the exact same begin, end and type_name, the order will be determined arbitrarily.
For example, let’s say we have two entry types,
Sentence
andEntityMention
. Each type has two entries. The two entries of type Sentence ranges from span (0,5) and (6,10). Similarly, the two entries of type EntityMention has span (0,3) and (15,20).# function signature entries = list( co_iterator_annotation_like( type_names = [ "ft.onto.base_ontology.Sentence", "ft.onto.base_ontology.EntityMention" ], range_span = (0,12) ) ) # Fetching result result = [ all_anno.append([type(anno).__name__, anno.begin, anno.end]) for all_anno in entries ] # return result = [ ['Sentence', 0, 5], ['EntityMention', 0, 5], ['Sentence', 6, 10] ]
From this we can see how range_span affects which entries will be fetched and also how the function chooses the order in which entries are fetched.
-
get
(type_name, include_sub_type=True, range_span=None)[source]¶ This function fetches entries from the data store of type
type_name
. If include_sub_type is set to True andtype_name
is in [Annotation, Group, List], this function also fetches entries of subtype oftype_name
. Otherwise, it only fetches entries of typetype_name
.- Parameters
- Return type
- Returns
An iterator of the entries matching the provided arguments.
-
iter
(type_name)[source]¶ This function iterates all type_name entries. It skips None placeholders that appear in non-annotation-like entry lists.
-
next_entry
(tid)[source]¶ Get the next entry of the same type as the
tid
entry. Callget_entry()
to find the current index and use it to find the next entry. If it is a non-annotation type, it will be sorted in the insertion order, which meansnext_entry
would return the next inserted entry.- Parameters
tid (
int
) – Unique id of the entry.- Return type
- Returns
A list of attributes representing the next entry of the same type as the
tid
entry. Return None when accessing the next entry of the last element in entry list.- Raises
IndexError – An error occurred accessing index out out of entry list.
-
prev_entry
(tid)[source]¶ Get the previous entry of the same type as the
tid
entry. Callget_entry()
to find the current index and use it to find the previous entry. If it is a non-annotation type, it will be sorted in the insertion order, which meansprev_entry
would return the previous inserted entry.- Parameters
tid (
int
) – Unique id of the entry.- Return type
- Returns
A list of attributes representing the previous entry of the same type as the
tid
entry. Return None when accessing the previous entry of the first element in entry list.- Raises
IndexError – An error occurred accessing index out out of entry list.
-
classmethod
DataPack Dataset¶
DataPackIterator¶
-
class
forte.data.data_pack_dataset.
DataPackIterator
(pack_iterator, context_type, request=None, skip_k=0)[source]¶ An iterator generating data example from a stream of data packs.
- Parameters
pack_iterator (
Iterator
[DataPack
]) – An iterator ofDataPack
.context_type (
Type
[Annotation
]) – The granularity of a single example which could be anyAnnotation
type. For example, it can beSentence
, then each training example will represent the information of a sentence.request (
Optional
[Dict
[Type
[Entry
],Union
[Dict
,List
]]]) – The request of type Dict sent toDataPack
to query specific data.skip_k (
int
) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.
- Returns
An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type
DataPack
).
Here is an example usage:
file_path: str = "data_samples/data_pack_dataset_test" reader = CoNLL03Reader() context_type = Sentence request = {Sentence: []} skip_k = 0 train_pl: Pipeline = Pipeline() train_pl.set_reader(reader) train_pl.initialize() pack_iterator: Iterator[PackType] = train_pl.process_dataset(file_path) iterator: DataPackIterator = DataPackIterator(pack_iterator, context_type, request, skip_k) for tid, data_pack in iterator: # process tid and data_pack
Note
For parameters context_type, request, skip_k, please refer to
get_data()
inDataPack
.
DataPackDataset¶
-
class
forte.data.data_pack_dataset.
DataPackDataset
(data_source, feature_schemes, hparams=None, device=None)[source]¶ A dataset representing data packs. Calling an
DataIterator
over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.- Parameters
data_source (
DataPackIterator
) – A data source of typeDataPackIterator
.feature_schemes (
Dict
) – A Dict containing all the information to do data pre-processing. This is exactly the same as the schemes inrequest()
.hparams (
Union
[Dict
,HParams
,None
]) – A dict or instance ofHParams
containing hyperparameters. Seedefault_hparams()
inDatasetBase
for the defaults.device (
Optional
[device
]) – The device of the produced batches. For GPU training, set to current CUDA device.
-
process
(raw_example)[source]¶ Given an input which is a single data example, extract feature from it.
- Parameters
raw_example (tuple(dict, DataPack)) –
A Tuple where
The first element is a Dict produced by
get_data()
inDataPack
.The second element is an instance of type
DataPack
.
- Return type
- Returns
A Dict mapping from user-specified tags to the
Feature
extracted.Note
Please refer to
request()
for details about user-specified tags.
-
collate
(examples)[source]¶ Given a batch of output from
process()
, produce pre-processed data as well as masks and features.- Parameters
examples (
List
[Dict
[str
,Feature
]]) – A List of result fromprocess()
.- Return type
- Returns
A Texar Pytorch
Batch
It can be treated as a Dict with the following structure:”data”: List or np.ndarray or torch.tensor The pre-processed data.
Please refer to
Converter
for details.”masks”: np.ndarray or torch.tensor All the masks for pre-processed data.
Please refer to
Converter
for details.”features”: List[Feature] A List of
Feature
. This is useful when users want to do customized pre-processing.Please refer to
Feature
for details.
{ "tag_a": { "data": <tensor>, "masks": [<tensor1>, <tensor2>, ...], "features": [<feature1>, <feature2>, ...] }, "tag_b": { "data": Tensor, "masks": [<tensor1>, <tensor2>, ...], "features": [<feature1>, <feature2>, ...] } }
Note
The first level key in returned batch is the user-specified tags. Please refer to
request()
for details about user-specified tags.
RawExample¶
-
forte.data.data_pack_dataset.
RawExample
¶ alias of
Tuple
[int
,forte.data.data_pack.DataPack
]
FeatureCollection¶
-
forte.data.data_pack_dataset.
FeatureCollection
¶ alias of
Dict
[str
,forte.data.converter.feature.Feature
]
Batchers¶
ProcessingBatcher¶
-
class
forte.data.batchers.
ProcessingBatcher
[source]¶ This defines the basis interface of the batcher used in
BaseBatchProcessor
. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.-
initialize
(config)[source]¶ The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.
- Returns
None
-
get_batch
(input_pack)[source]¶ By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.
- Parameters
input_pack (~PackType) – The input data pack to get features from.
- Return type
Iterator
[Tuple
[List
[~PackType],List
[Optional
[Annotation
]],Dict
]]- Returns
An iterator of A tuple contains datapack, context instance and batch data.
Note
For backward compatibility issues, this function return a list of None as contexts.
-
classmethod
default_configs
()[source]¶ Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.
Here, the available parameters are:
use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.
cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.
-
FixedSizeDataPackBatcherWithExtractor¶
-
class
forte.data.batchers.
FixedSizeDataPackBatcherWithExtractor
[source]¶ This batcher uses extractor to extract features from dataset and group them into batch. In this class, more pools are added. One is instance_pool, which is used to record the instance from which feature is extracted. The other one is feature_pool, which is used to record features before they can be yield in batch.
-
initialize
(config)[source]¶ The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.
- Returns
None
-
collate
(features_collection)[source]¶ This function use the
Converter
interface to turn a list of features into batches, where each feature is converted to tensor/matrix format. The resulting features are organized as a dictionary, where the keys are the feature names/tags, and the values are the converted features. Each feature contains the data and mask in MatrixLike form, as well as the original raw features.
-
flush
()[source]¶ Flush data in batches. Each return value contains a tuple of 3 items: the corresponding data pack, the list of annotation objects that represent the context type, and the features.
-
get_batch
(input_pack)[source]¶ By feeding data pack to this function, formatted features will be yielded based on the batching logic. Each element in the iterator is a triplet of datapack, context instance and batched data.
-
classmethod
default_configs
()[source]¶ Defines the configuration of this batcher, here:
context_type: The context scope to extract data from. It could be a annotation class or a string that is the fully qualified name of the annotation class.
feature_scheme: A dictionary of (extractor name, extractor) that can be used to extract features.
batch_size: The batch size, default is 10.
-
FixedSizeRequestDataPackBatcher¶
-
class
forte.data.batchers.
FixedSizeRequestDataPackBatcher
[source]¶ -
initialize
(config)[source]¶ The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.
- Returns
None
-
classmethod
default_configs
()[source]¶ The configuration of a batcher.
Here:
context_type (str): The fully qualified name of an Annotation type, which will be used as the context to retrieve data from. For example, if a ft.onto.Sentence type is provided, then it will extract data within each sentence.
requests: The request detail. See
get_data()
on what a request looks like.
- Return type
- Returns
The default configuration structure and default value.
-
FixedSizeMultiPackProcessingBatcher¶
-
class
forte.data.batchers.
FixedSizeMultiPackProcessingBatcher
[source]¶ A Batcher used in
MultiPackBatchProcessor
.Note
this implementation is not finished.
The Batcher calls the ProcessingBatcher inherently on each specified data pack in the MultiPack.
It’s flexible to query MultiPack so we delegate the task to the subclasses such as:
query all packs with the same
context
andinput_info
.query different packs with different
context
andinput_info
.
Since the batcher will save the data_pack_pool on the fly, it’s not trivial to do batching and slicing multiple data packs in the same time
-
initialize
(config)[source]¶ The implementation should initialize the batcher and setup the internal states of this batcher. This function will be called at the pipeline initialize stage.
- Returns
None
-
classmethod
default_configs
()[source]¶ Define the basic configuration of a batcher. Implementation of the batcher can extend this function to include more configurable parameters but need to keep the existing ones defined in this base class.
Here, the available parameters are:
use_coverage_index: A boolean value indicates whether the batcher will try to build the coverage index based on the data request. Default is True.
cross_pack: A boolean value indicates whether the batcher can go across the boundary of data packs when there is no enough data to fill the batch.
- Return type
- Returns
The default configuration.
Caster¶
MultiPackBoxer¶
Types¶
ReplaceOperationsType¶
-
forte.data.types.
ReplaceOperationsType
¶ alias of
List
[Tuple
[forte.data.span.Span
,str
]]
DataRequest¶
-
forte.data.types.
DataRequest
¶ alias of
Dict
[Type
[forte.data.ontology.core.Entry
],Union
[Dict
,List
]]
Data Utilities¶
maybe_download¶
-
forte.data.data_utils.
maybe_download
(urls: List[str], path: Union[str, PathLike], filenames: Optional[List[str]] = None, extract: bool = False, num_gdrive_retries: int = 1) → List[str][source]¶ -
forte.data.data_utils.
maybe_download
(urls: str, path: Union[str, PathLike], filenames: Optional[str] = None, extract: bool = False, num_gdrive_retries: int = 1) → str Downloads a set of files.
- Parameters
urls (
Union
[List
[str
],str
]) – A (list of) URLs to download files.path (
Union
[str
, ~PathLike]) – The destination path to save the files.filenames (
Union
[List
[str
],str
,None
]) – A (list of) strings of the file names. If given, must have the same length withurls
. If None, filenames are extracted fromurls
.extract (
bool
) – Whether to extract compressed files.num_gdrive_retries (
int
) – An integer specifying the number of attempts to download file from Google Drive. Default value is 1.
- Returns
A list of paths to the downloaded files.