Data¶
Ontology¶
base¶
core¶
-
class
forte.data.ontology.core.
Entry
(pack)[source]¶ The base class inherited by all NLP entries. This is the main data type for all in-text NLP analysis results. The main sub-types are
Annotation
,Link
andGroup
.An
forte.data.ontology.top.Annotation
object represents a span in text.A
forte.data.ontology.top.Link
object represents a binary link relation between two entries.A
forte.data.ontology.top.Group
object represents a collection of multiple entries.-
self.
embedding
¶ The embedding vectors (numpy array of floats) of this entry.
- Parameters
pack – Each entry should be associated with one pack upon creation.
-
property
embedding
¶ Get the embedding vectors (numpy array of floats) of the entry.
-
property
tid
¶ Get the id of this entry.
Returns:
-
property
pack_id
¶ Get the id of the pack that contains this entry.
Returns:
-
-
class
forte.data.ontology.core.
BaseLink
(pack, parent=None, child=None)[source]¶ -
abstract
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.
- Parameters
parent – The parent entry.
-
abstract
set_child
(child)[source]¶ This will set the child of the current instance with given Entry The child is saved internally by its pack specific index key.
- Parameters
child – The child entry
-
abstract
-
class
forte.data.ontology.core.
BaseGroup
(pack, members=None)[source]¶ Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.
This is the
BaseGroup
interface. Specific member constraints are defined in the inherited classes.-
abstract
add_member
(member)[source]¶ Add one entry to the group.
- Parameters
member – One member to be added to the group.
-
abstract
top¶
-
class
forte.data.ontology.top.
Annotation
(pack, begin, end)[source]¶ Annotation type entries, such as “token”, “entity mention” and “sentence”. Each annotation has a
Span
corresponding to its offset in the text.- Parameters
-
class
forte.data.ontology.top.
Link
(pack, parent=None, child=None)[source]¶ Link type entries, such as “predicate link”. Each link has a parent node and a child node.
- Parameters
-
ParentType
¶ alias of
forte.data.ontology.core.Entry
-
ChildType
¶ alias of
forte.data.ontology.core.Entry
-
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry The parent is saved internally by its pack specific index key.
- Parameters
parent – The parent entry.
-
set_child
(child)[source]¶ This will set the child of the current instance with given Entry. The child is saved internally by its pack specific index key.
- Parameters
child – The child entry.
-
property
parent
¶ Get
tid
of the parent node. To get the object of the parent node, callget_parent()
.
-
property
child
¶ Get
tid
of the child node. To get the object of the child node, callget_child()
.
-
class
forte.data.ontology.top.
Group
(pack, members=None)[source]¶ Group is an entry that represent a group of other entries. For example, a “coreference group” is a group of coreferential entities. Each group will store a set of members, no duplications allowed.
-
MemberType
¶ alias of
forte.data.ontology.core.Entry
-
-
class
forte.data.ontology.top.
MultiPackGroup
(pack, members=None)[source]¶ Group type entries, such as “coreference group”. Each group has a set of members.
-
MemberType
¶ alias of
forte.data.ontology.core.Entry
-
-
class
forte.data.ontology.top.
MultiPackLink
(pack, parent=None, child=None)[source]¶ This is used to link entries in a
MultiPack
, which is designed to support cross pack linking, this can support applications such as sentence alignment and cross-document coreference. Each link should have a parent node and a child node. Note that the nodes are indexed by two integers, one additional index on which pack it comes from.-
ParentType
¶ alias of
forte.data.ontology.core.Entry
-
ChildType
¶ alias of
forte.data.ontology.core.Entry
-
parent_pack_id
()[source]¶ Return the pack_id of the parent pack.
Returns: The pack_id of the parent pack..
-
child_pack_id
()[source]¶ Return the pack_id of the child pack.
Returns: The pack_id of the child pack.
-
set_parent
(parent)[source]¶ This will set the parent of the current instance with given Entry. The parent is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
parent – The parent of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.
-
set_child
(child)[source]¶ This will set the child of the current instance with given Entry. The child is saved internally as a tuple:
pack index
andentry.tid
. Pack index is the index of the data pack in the multi-pack.- Parameters
child – The child of the link, which is an Entry from a data pack, it has access to the pack index and its own tid in the pack.
-
-
class
forte.data.ontology.top.
Query
(pack)[source]¶ An entry type representing queries for information retrieval tasks.
- Parameters
pack (Data pack) – Data pack reference to which this query will be added
Packs¶
BasePack¶
-
class
forte.data.base_pack.
BasePack
(pack_name=None)[source]¶ The base class of
DataPack
andMultiPack
.- Parameters
pack_name (str, optional) – a string name of the pack.
-
abstract
delete_entry
(entry)[source]¶ Remove the entry from the pack.
- Parameters
entry – The entry to be removed.
Returns:
-
add_entry
(entry, component_name=None)[source]¶ Add an
Entry
object to theBasePack
object. Allow duplicate entries in a pack.
-
add_all_remaining_entries
(component=None)[source]¶ Calling this function will add the entries that are not added to the pack manually.
- Parameters
component (str) – Overwrite the component record with this.
Returns:
-
set_control_component
(component)[source]¶ Record the current component that is taking control of this pack.
- Parameters
component – The component that is going to take control
Returns:
-
record_field
(entry_id, field_name)[source]¶ Record who modifies the entry, will be called in
Entry
- Parameters
entry_id – The id of the entry.
field_name – The name of the field modified.
Returns:
-
on_entry_creation
(entry, component_name=None)[source]¶ Call this when adding a new entry, will be called in
Entry
when its __init__ function is called.- Parameters
Returns:
-
regret_creation
(entry)[source]¶ Will remove the entry from the pending entries internal state of the pack.
- Parameters
entry – The entry that we would not add the the pack anymore.
Returns:
-
get_entry
(tid)[source]¶ Look up the entry_index with key
ptr
. Specific implementation depends on the actual class.
-
get_single
(entry_type)[source]¶ Take a single entry of type
entry_type
from this data pack. This is useful when the target entry type appears only one time in theDataPack
for e.g., a Document entry. Or you just intended to take the first one.- Parameters
entry_type – The entry type to be retrieved.
- Returns
A single data entry.
-
get_entries_by_creator
(component)[source]¶ Return all entries created by the particular component, an unordered set.
- Parameters
component – The component to get the entries.
Returns:
DataPack¶
-
class
forte.data.data_pack.
DataPack
(pack_name=None)[source]¶ A
DataPack
contains a piece of natural language text and a collection of NLP entries (annotations, links, and groups). The natural language text could be a document, paragraph or in any other granularity.- Parameters
pack_name (str, optional) – A name for this data pack.
-
validate
(entry)[source]¶ Validate whether this entry type can be added. This method is called by the
__init__()
method when an instance ofEntry
is being added to the pack.- Parameters
item – The entry itself.
-
property
text
¶ Return the text of the data pack
-
property
all_annotations
¶ An iterator of all annotations in this data pack.
- Returns: Iterator of all annotations, of
type :class:”~forte.data.ontology.top.Annotation”.
-
property
num_annotations
¶ Number of annotations in this data pack.
Returns: (int) Number of the links.
-
property
all_links
¶ An iterator of all links in this data pack.
- Returns: Iterator of all links, of
type :class:”~forte.data.ontology.top.Link”.
-
property
num_links
¶ Number of links in this data pack.
Returns: Number of the links.
-
property
all_groups
¶ An iterator of all groups in this data pack.
- Returns: Iterator of all groups, of
type :class:”~forte.data.ontology.top.Group”.
-
property
num_groups
¶ Number of groups in this data pack.
Returns: Number of groups.
-
property
all_generic_entries
¶ An iterator of all generic entries in this data pack.
Returns: Iterator of generic
-
property
num_generics_entries
¶ Number of generics entries in this data pack.
Returns: Number of generics entries.
-
get_span_text
(span)[source]¶ Get the text in the data pack contained in the span
- Parameters
span (Span) – Span object which contains a begin and an end index
- Returns
The text within this span
-
get_original_text
()[source]¶ Get original unmodified text from the
DataPack
object.- Returns
Original text after applying the replace_back_operations of
DataPack
object to the modified text
-
get_original_span
(input_processed_span, align_mode='relaxed')[source]¶ Function to obtain span of the original text that aligns with the given span of the processed text.
- Parameters
input_processed_span – Span of the processed text for which the
span of the original text is desired (corresponding) –
align_mode – The strictness criteria for alignment in the ambiguous
cases –
is (that) –
a part of input_processed_span spans a part (if) –
the inserted span (of) –
align_mode controls whether to use the (then) –
fully or ignore it completely according to the following (span) –
values (possible) –
"strict" - do not allow ambiguous input (-) –
ValueError (give) –
"relaxed" - consider spans on both sides (-) –
"forward" - align looking forward (-) –
is –
the span (ignore) –
the left (towards) –
consider the span towards the right (but) –
"backward" - align looking backwards (-) –
is –
the span –
the right (towards) –
consider the span towards the left (but) –
- Returns
Span of the original text that aligns with input_processed_span
Example
Let o-up1, o-up2, … and m-up1, m-up2, … denote the unprocessed spans of the original and modified string respectively. Note that each o-up would have a corresponding m-up of the same size.
Let o-pr1, o-pr2, … and m-pr1, m-pr2, … denote the processed spans of the original and modified string respectively. Note that each o-p is modified to a corresponding m-pr that may be of a different size than o-pr.
Original string: <–o-up1–> <-o-pr1-> <—-o-up2—-> <—-o-pr2—-> <-o-up3->
Modified string: <–m-up1–> <—-m-pr1—-> <—-m-up2—-> <-m-pr2-> <-m-up3->
Note that self.inverse_original_spans that contains modified processed spans and their corresponding original spans, would look like - [(o-pr1, m-pr1), (o-pr2, m-pr2)]
>> data_pack = DataPack() >> original_text = “He plays in the park” >> data_pack.set_text(original_text,>> lambda _: [(Span(0, 2), “She”))] >> data_pack.text “She plays in the park” >> input_processed_span = Span(0, len(“She plays”)) >> orig_span = data_pack.get_original_span(input_processed_span) >> data_pack.get_original_text()[orig_span.begin: orig_span.end] “He plays”
-
classmethod
deserialize
(string)[source]¶ - Deserialize a Data Pack from a string. This internally calls the
internal
_deserialize()
function from theBasePack
.
- Parameters
string – The serialized string of a data pack to be deserialized.
- Returns
An data pack object deserialized from the string.
-
delete_entry
(entry)[source]¶ Delete an
Entry
object from theDataPack
. This find out the entry in the index and remove it from the index. Note that entries will only appear in the index if add_entry (or _add_entry_with_check) is called.Please note that deleting a entry do not guarantee the deletion of the related entries.
- Parameters
entry (Entry) – An
Entry
object to be deleted from the pack.
-
get_data
(context_type, request=None, skip_k=0)[source]¶ Fetch entries from the data_pack of type context_type.
Currently, we do not support Groups and Generics in the request.
Example
requests = { base_ontology.Sentence: { "component": ["dummy"], "fields": ["speaker"], }, base_ontology.Token: ["pos", "sense""], base_ontology.EntityMention: { "unit": "Token", }, } pack.get_data(base_ontology.Sentence, requests)
- Parameters
context_type (str) – The granularity of the data context, which could be any
Annotation
type.request (dict) –
The entry types and fields required. The keys of the requests dict are the required entry types and the value should be either:
a list of field names or
a dict which accepts three keys: “fields”, “component”, and “unit”.
By setting “fields” (list), users specify the requested fields of the entry. If “fields” is not specified, only the default fields will be returned.
By setting “component” (list), users can specify the components by which the entries are generated. If “component” is not specified, will return entries generated by all components.
By setting “unit” (string), users can specify a unit by which the annotations are indexed.
Note that for all annotation types, “text” and “span” fields are returned by default; for all link types, “child” and “parent” fields are returned by default.
skip_k (int) – Will skip the first skip_k instances and generate data from the (offset + 1)th instance.
- Returns
A data generator, which generates one piece of data (a dict containing the required entries, fields, and context).
-
build_coverage_for
(context_type, covered_type)[source]¶ - User can call this function to build coverage index for specific types.
The index provide a in-memory mapping from entries of context_type to the entries “covered” by it. See
forte.data.data_pack.DataIndex
for more details.
- Parameters
context_type – The context/covering type.
covered_type – The entry to find under the context type.
-
get
(entry_type, range_annotation=None, components=None)[source]¶ This function is used to get data from a data pack with various methods.
Example
for sentence in input_pack.get(Sentence): token_entries = input_pack.get(entry_type=Token, range_annotation=sentence, component=token_component) ...
In the above code snippet, we get entries of type
Token
within eachsentence
which were generated bytoken_component
- Parameters
entry_type (type) – The type of entries requested.
range_annotation (Annotation, optional) – The range of entries requested. If None, will return valid entries in the range of whole data_pack.
components (str or list, optional) – The component (creator) generating the entries requested. If None, will return valid entries generated by any component.
BaseMeta¶
Meta¶
-
class
forte.data.data_pack.
Meta
(pack_name=None, language='eng', span_unit='character')[source]¶ Basic Meta information associated with each instance of
DataPack
.- Parameters
pack_name – An name to identify the data pack, which is helpful in situation like serialization. It is suggested that the packs should have different doc ids.
language – The language used by this data pack, default is English.
span_unit – The unit used for interpreting the Span object of this data pack. Default is character.
BaseIndex¶
-
class
forte.data.base_pack.
BaseIndex
[source]¶ A set of indexes used in
BasePack
:entry_index
, the index from each tid to the corresponding entrytype_index
, the index from each type to the entries of that typelink_index
, the index from child (link_index["child_index"]
)and parent (link_index["parent_index"]
) nodes to linksgroup_index
, the index from group members to groups.
-
update_basic_index
(entries)[source]¶ Build or update the basic indexes, including
(1)
entry_index
, the index from each tid to the corresponding entry;(2)
type_index
, the index from each type to the entries of that type;(3)
component_index
, the index from each component to the entries generated by that component.- Parameters
entries (list) – a list of entries to be added into the basic index.
-
build_link_index
(links)[source]¶ Build the
link_index
, the index from child and parent nodes to links. It will build the links with the links in the dataset.link_index
consists of two sub-indexes: “child_index” is the index from child nodes to their corresponding links, and “parent_index” is the index from parent nodes to their corresponding links. Returns:
-
build_group_index
(groups)[source]¶ Build
group_index
, the index from group members to groups.Returns:
-
link_index
(tid, as_parent=True)[source]¶ Look up the link_index with key
tid
. If the link index is not built, this will throw aPackIndexError
.
-
group_index
(tid)[source]¶ Look up the group_index with key tid. If the index is not built, this will raise a
PackIndexError
.
-
update_link_index
(links)[source]¶ Update
link_index
with the provided links, the index from child and parent to links.link_index
consists of two sub-indexes:“child_index” is the index from child nodes to their corresponding links
“parent_index” is the index from parent nodes to their corresponding links.
- Parameters
links (list) – a list of links to be added into the index.
-
update_group_index
(groups)[source]¶ - Build or update
group_index
, the index from group members to groups.
- Parameters
groups (list) – a list of groups to be added into the index.
- Build or update
DataIndex¶
-
class
forte.data.data_pack.
DataIndex
[source]¶ A set of indexes used in
DataPack
, note that this class is used by the DataPack internally.entry_index
, the index from each tid to the corresponding entrytype_index
, the index from each type to the entries of that typecomponent_index
, the index from each component to the entries generated by that componentlink_index
, the index from child (link_index["child_index"]
)and parent (link_index["parent_index"]
) nodes to linksgroup_index
, the index from group members to groups._coverage_index
, the index that maps from an annotation to the entries it covers._coverage_index
is a dict of dict, where the key is a tuple of the outer entry type and the inner entry type. The outer entry type should be an annotation type. The value is a dict, where the key is the tid of the outer entry, and the value is a set of tid that are covered by the outer entry. We say an Annotation A covers an entry E if one of the following condition is met: 1. E is of Annotation type, and that E.begin >= A.begin, E.end <= E.end 2. E is of Link type, and both E’s parent and child node are Annotation that are covered by A.
-
coverage_index
(outer_type, inner_type)[source]¶ Get the coverage index from
outer_type
toinner_type
.
-
build_coverage_index
(data_pack, outer_type, inner_type)[source]¶ Build the coverage index from
outer_type
toinner_type
.
-
have_overlap
(entry1, entry2)[source]¶ Check whether the two annotations have overlap in span.
- Parameters
entry1 (str or Annotation) – An
Annotation
object to be checked, or the tid of the Annotation.entry2 (str or Annotation) – Another
Annotation
object to be checked, or the tid of the Annotation.
Readers¶
BaseReader¶
-
class
forte.data.readers.base_reader.
BaseReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ The basic data reader class. To be inherited by all data readers.
- Parameters
from_cache (bool, optional) – Decide whether to read from cache if cache file exists. By default (
False
), the reader will only read from the original file and use the cache file path for caching, it will not read from thecache_directory
. IfTrue
, the reader will try to read a datapack from the caching file.cache_directory (str, optional) – The base directory to place the path of the caching files. Each collection is contained in one cached file, under this directory. The cached location for each collection is computed by
_cache_key_function()
. Note: A collection is the data returned by_collect()
.append_to_cache (bool, optional) – Decide whether to append write if cache file already exists. By default (
False
), we will overwrite the existing caching file. IfTrue
, we will cache the datapack append to end of the caching file.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.- Parameters
resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Config) – The configuration passed in to set up this component.
-
classmethod
default_configs
()[source]¶ Returns a dict of configurations of the reader with default values. Used to replace the missing values of input configs during pipeline construction.
{ "name": "reader" }
-
parse_pack
(collection)[source]¶ Calls
_parse_pack()
to create packs from the collection. This internally setup the component meta data. Users should implement the_parse_pack()
method.
-
text_replace_operation
(text)[source]¶ Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.
- Parameters
text – The original data text to be cleaned.
Returns (List[Tuple[Tuple[int, int], str]]): the replacement operations.
-
iter
(*args, **kwargs)[source]¶ An iterator over the entire dataset, giving all Packs processed as list or Iterator depending on lazy, giving all the Packs read from the data source(s). If not reading from cache, should call
collect()
.- Parameters
args – One or more input data sources, for example, most DataPack readers accept data_source as file/folder path.
kwargs – Iterator of DataPacks.
-
cache_data
(collection, pack, append)[source]¶ Specify the path to the cache directory.
After you call this method, the dataset reader will use its
cache_directory
to store a cache ofBasePack
read from every document passed toread()
, serialized as one string-formattedBasePack
. If the cache file for a givenfile_path
exists, we read theBasePack
from the cache. If the cache file does not exist, we will create it on our first pass through the data.- Parameters
collection – The collection is a piece of data from the
_collect()
function, to be read to produce DataPack(s). During caching, a cache key is computed based on the data in this collection.pack – The data pack to be cached.
append – Whether to allow appending to the cache.
-
read_from_cache
(cache_filename)[source]¶ Reads one or more Packs from
cache_filename
, and yields Pack(s) from the cache file.- Parameters
cache_filename – Path to the cache file.
Returns: List of cached data packs.
-
finish
(resources)[source]¶ The pipeline will call this function at the end of the pipeline to notify all the components. The user can implement this function to release resources used by this component. The component can also add objects to the resources.
- Parameters
resource (Resources) – A global resource registry.
PackReader¶
MultiPackReader¶
CoNLL03Reader¶
-
class
forte.data.readers.conll03_reader.
CoNLL03Reader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ CoNLL03Reader
is designed to read in the CoNLL03 dataset.The dataset is from the following paper, Sang, Erik F., and Fien De Meulder. “Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition.” arXiv preprint cs/0306050 (2003).
Data could be downloaded from https://deepai.org/dataset/conll-2003-english
Data format: Data files contains one line “-DOCSTART- -X- -X- O” to represent the start of a document. After that, each line will contain one word and an empty line represent the start of a new sentence. Each line contains four fields, the word, its part-of-speech tag, its chunk tag and its named entity tag.
Example
EU NNP B-NP B-ORG rejects VBZ B-VP O German JJ B-NP B-MISC call NN I-NP O to TO B-VP O boycott VB I-VP O British JJ B-NP B-MISC lamb NN I-NP O . . O O
ConllUDReader¶
BaseDeserializeReader¶
RawDataDeserializeReader¶
RecursiveDirectoryDeserializeReader¶
HTMLReader¶
-
class
forte.data.readers.html_reader.
HTMLReader
(*args, **kwargs)[source]¶ HTMLReader
is designed to read in list of html strings.It takes in list of html strings, cleans the HTML tags and stores the cleaned text in pack.
MSMarcoPassageReader¶
MultiPackSentenceReader¶
-
class
forte.data.readers.multipack_sentence_reader.
MultiPackSentenceReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ MultiPackSentenceReader
is designed to read a directory of files and convert each file’s contents into a data pack. This class yields a multipack with packinput_pack_name
containing the file’s contents. It additionally packs an empty pack with nameoutput_pack_name
into the multipack.-
classmethod
default_configs
()[source]¶ Returns a dictionary of hyperparameters with default values.
{ "name": "multipack_sentence_reader" "input_pack_name": "input_src", "output_pack_name": "output_tgt" }
Here:
- “name”: str
Name of the reader
- “input_pack_name”: str
Name of the input pack. This name can be used to retrieve the input pack from the multipack.
- “output_pack_name”: str
Name of the output pack. This name can be used to retrieve the output pack from the multipack.
-
classmethod
MultiPackTerminalReader¶
OntonotesReader¶
-
class
forte.data.readers.ontonotes_reader.
OntonotesReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ OntonotesReader
is designed to read in the English OntoNotes v5.0 data in the datasets used by the CoNLL 2011/2012 shared tasks. To use this Reader, you must follow the instructions provided here (v12 release)::, which will allow you to download the CoNLL style annotations for the OntoNotes v5.0 release – LDC2013T19.tgz obtained from LDC.- Parameters
column_format –
A list of strings indicating which field each column in a line corresponds to. The length of the list should be equal to the number of columns in the files to be read. Available field types include:
"document_id"
"part_number"
"word"
"pos_tag"
"lemmatised_word"
"framenet_id"
"word_sense"
"speaker"
"entity_label"
"coreference"
"*predicate_labels"
Field types marked with
*
indicate a variable-column field: it could span multiple fields. Only one such field is allowed in the format specification.If a column should be ignored, fill in None at the corresponding position.
-
class
ParsedFields
(word, predicate_labels, document_id, part_number, pos_tag, lemmatised_word, framenet_id, word_sense, speaker, entity_label, coreference)[source]¶ -
property
word
¶ Alias for field number 0
-
property
predicate_labels
¶ Alias for field number 1
-
property
document_id
¶ Alias for field number 2
-
property
part_number
¶ Alias for field number 3
-
property
pos_tag
¶ Alias for field number 4
-
property
lemmatised_word
¶ Alias for field number 5
-
property
framenet_id
¶ Alias for field number 6
-
property
word_sense
¶ Alias for field number 7
-
property
speaker
¶ Alias for field number 8
-
property
entity_label
¶ Alias for field number 9
-
property
coreference
¶ Alias for field number 10
-
property
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.- Parameters
resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Config) – The configuration passed in to set up this component.
-
classmethod
default_configs
()[source]¶ Returns a dictionary of default hyperparameters.
{ "name": "reader", "column_format": [ "document_id", "part_number", None, "word", "pos_tag", None, "lemmatised_word", "framenet_id", "word_sense", "speaker", "entity_label", "*predicate_labels", "coreference", ] }
Here:
- “column_format”: list
A List of default column types.
Note
A None field means that column in the dataset file will be ignored during parsing.
PlainTextReader¶
-
class
forte.data.readers.plaintext_reader.
PlainTextReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ PlainTextReader
is designed to read in plain text dataset.-
text_replace_operation
(text)[source]¶ Given the possibly noisy text, compute and return the replacement operations in the form of a list of (span, str) pairs, where the content in the span will be replaced by the corresponding str.
- Parameters
text – The original data text to be cleaned.
Returns (List[Tuple[Tuple[int, int], str]]): the replacement operations.
-
ProdigyReader¶
-
class
forte.data.readers.prodigy_reader.
ProdigyReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ ProdigyReader
is designed to read in Prodigy output text.
RACEMultiChoiceQAReader¶
-
class
forte.data.readers.race_multi_choice_qa_reader.
RACEMultiChoiceQAReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ RACEMultiChoiceQAReader
is designed to read in RACE multi choice qa dataset.
StringReader¶
-
class
forte.data.readers.string_reader.
StringReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ StringReader
is designed to read in a list of string variables.
SemEvalTask8Reader¶
-
class
forte.data.readers.sem_eval_task8_reader.
SemEvalTask8Reader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ SemEvalTask8Reader
is designed to read in SemEval Task-8 dataset. The data can be obtained here: http://www.kozareva.com/downloads.htmlHendrickx, Iris, et al. SemEval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. https://www.aclweb.org/anthology/S10-1006.pdf
An example of the dataset is ‘’’ 8 “<e1>People</e1> have been moving back into <e2>downtown</e2>.” Entity-Destination(e1,e2) Comment: ‘’’.
This example will be converted to one Sentence, “People have been moving back into downtown.” and one RelationLink, link = RelationLink(parent=People, child=downtown) link.rel_type = Entity-Destination into the DataPack.
OpenIEReader¶
-
class
forte.data.readers.openie_reader.
OpenIEReader
(from_cache=False, cache_directory=None, append_to_cache=False, cache_in_memory=False)[source]¶ OpenIEReader
is designed to read in the Open IE dataset used by Open Information Extraction task. The related paper can be found here. The related source code for generating this dataset can be found here. To use this Reader, you must follow the dataset format. Each line in the dataset should contain following fields:<sentence>\t<predicate_head>\t<full_predicate>\t<arg1>\t<arg2>....
You can also find the dataset format here.
-
initialize
(resources, configs)[source]¶ The pipeline will call the initialize method at the start of a processing. The processor and reader will be initialized with
configs
, and register global resources intoresource
. The implementation should set up the states of the component.- Parameters
resources (Resources) – A global resource register. User can register shareable resources here, for example, the vocabulary.
configs (Config) – The configuration passed in to set up this component.
-
DataPack Dataset¶
DataPackIterator¶
-
class
forte.data.data_pack_dataset.
DataPackIterator
(pack_iterator, context_type, request=None, skip_k=0)[source]¶ An iterator over single data example from multiple data packs.
- Parameters
pack_iterator (Iterator[DataPack]) – An iterator of
DataPack
.context_type – The granularity of a single example which could be any
Annotation
type. For example, it can beSentence
, then each training example will represent the information of a sentence.request – The request of type Dict sent to
DataPack
to query specific data.skip_k (int) – Will skip the first skip_k instances and generate data from the (skip_k + 1)th instance.
- Returns
An Iterator that each time produces a Tuple of an tid (of type int) and a data pack (of type
DataPack
).
- Here is an example usage:
file_path: str = "data_samples/data_pack_dataset_test" reader = CoNLL03Reader() context_type = Sentence request = {Sentence: []} skip_k = 0 train_pl: Pipeline = Pipeline() train_pl.set_reader(reader) train_pl.initialize() pack_iterator: Iterator[PackType] = train_pl.process_dataset(file_path) iterator: DataPackIterator = DataPackIterator(pack_iterator, context_type, request, skip_k) for tid, data_pack in iterator: # process tid and data_pack
Note
For parameters context_type, request, skip_k, please refer to
get_data()
inDataPack
.
DataPackDataset¶
-
class
forte.data.data_pack_dataset.
DataPackDataset
(data_source, feature_schemes, hparams=None, device=None)[source]¶ A dataset representing data packs. Calling an
DataIterator
over this DataPackDataset will produce an Iterate over batch of examples parsed by a reader from given data packs.- Parameters
data_source – A data source of type
DataPackDataSource
.feature_schemes (dict) – A Dict containing all the information to do data pre-processing. This is exactly the same as the schemes in feature_resource. Please refer to
feature_resource()
inTrainPreprocessor
for details.hparams – A dict or instance of : class:~texar.torch.HParams containing hyperparameters. See
default_hparams()
inDatasetBase
for the defaults.device – The device of the produced batches. For GPU training, set to current CUDA device.
-
process
(raw_example)[source]¶ Given an input which is a single data example, extract feature from it.
- Parameters
raw_example (tuple(dict, DataPack)) –
A Tuple where
The first element is a Dict produced by
get_data()
inDataPack
.The second element is an instance of type
DataPack
.- Returns
A Dict mapping from user-specified tags to the
Feature
extracted.Note
Please refer to Please refer to
feature_resource()
inTrainPreprocessor
for details about user-specified tags.
-
collate
(examples)[source]¶ Given a batch of output from
process()
, produce pre-processed data as well as masks and features.- Parameters
examples – A List of result from
process()
.- Returns
A texar
Batch
It can be treated as a Dict with the following structure:{ "tag_a": { "data": <tensor>, "masks": [<tensor1>, <tensor2>, ...], "features": [<feature1>, <feature2>, ...] }, "tag_b": { "data": Tensor, "masks": [<tensor1>, <tensor2>, ...], "features": [<feature1>, <feature2>, ...] } }
- ”data”: List or np.ndarray or torch.tensor
The pre-processed data.
Please refer to
Converter
for details.- ”masks”: np.ndarray or torch.tensor
All the masks for pre-processed data.
Please refer to
Converter
for details.- ”features”: List[Feature]
A List of
Feature
. This is useful when users want to do customized pre-processing.Please refer to
Feature
for details.
Note
The first level key in returned batch is the user-specified tags. Please refer to
feature_resource()
inTrainPreprocessor
for details about user-specified tags.
Batchers¶
ProcessingBatcher¶
-
class
forte.data.batchers.
ProcessingBatcher
(cross_pack=True)[source]¶ This defines the basis interface of the Batcher used in
BatchProcessor
. This Batcher only batches data sequentially. It receives new packs dynamically and cache the current packs so that the processors can pack prediction results into the data packs.- Parameters
cross_pack (bool, optional) – whether to allow batches go across
packs when there is no enough data at the end. (data) –
-
initialize
(_)[source]¶ The implementation should initialize the batcher and setup the internal states of this batcher. This batcher will be called at the pipeline initialize stage.
Returns:
Data Utilities¶
maybe_download¶
-
forte.data.data_utils.
maybe_download
(urls, path, filenames=None, extract=False)[source]¶ Downloads a set of files.
- Parameters
urls – A (list of) URLs to download files.
path – The destination path to save the files.
filenames – A (list of) strings of the file names. If given, must have the same length with
urls
. If None, filenames are extracted fromurls
.extract – Whether to extract compressed files.
- Returns
A list of paths to the downloaded files.