Indexers

Base Evaluator

class forte.indexers.elastic_indexer.ElasticSearchIndexer(config=None)[source]

Indexer class for Elasticsearch.

index(document, index_name=None, refresh=False)[source]

Index a document document in the index specified by index_name. If index_name is None, it will be picked from the processor configs.

Parameters
  • document (Dict) – Document to be indexed into an Elasticsearch indexer

  • index_name (str) – Name of the index where this document will be saved. If None, value will be picked from the configs.

  • refresh (bool, str) – refresh settings to control when changes made by this request are made visible to search. Available value are “True”,”wait_for”, “False”

  • note:: (.) – “refresh” setting will greatly affect the Elasticsearch performance. Please refer to https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information on “refresh”

add(document, index_name=None, refresh=False)[source]

Add a document document to the index specified by index_name. If index_name is None, it will be picked from processor configs.

Parameters
  • document (Dict) – Document to be indexed into Elasticsearch indexer

  • index_name (str) – Name of the index where this document will be saved. If None, value will be picked from processor configs.

  • refresh (bool, str) – refresh settings to control when changes made by this request are made visible to search. Available value are “True”,”wait_for”, “False”

  • note:: (.) – “refresh” setting will greatly affect the Elasticsearch performance. Please refer to https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information on “refresh”

add_bulk(documents, index_name=None, **kwargs)[source]

Add a bulk of documents to the index specified by index_name. If index_name is None, it will be picked from the processor configs.

Parameters
  • documents (Iterable) – An iterable of documents to be indexed.

  • index_name (optional, str) – Name of the index where this document will be saved. If None, value will be picked from the processor configs.

  • kwargs (optional, dict) – Optional keyword arguments like “refresh”, “request_timeout” etc. that are passed to Elasticsearch’s bulk API. Please refer to https://elasticsearch-py.readthedocs.io/en/master/helpers.html#bulk-helpers for the complete list of arguments.

  • note:: (.) – “refresh” setting will greatly affect the Elasticsearch performance. Please refer to https://www.elastic.co/guide/en/elasticsearch/reference/master/docs-refresh.html for more information on “refresh”

search(query, index_name=None, **kwargs)[source]

Search the index specified by index_name that matches the query.

Parameters
Returns

A dict containing the documents matching the query along with meta data of the search.

static default_configs()[source]

Returns a dictionary of default hyperparameters.

{
    "index_name": "elastic_indexer",
    "hosts": "localhost:9200",
    "algorithm": "bm25"
}

Here:

“index_name”: str

A string representing the index to which the documents will be added.

“hosts”: list, str

A list of hosts or a host which the Elasticsearch client will be connected to.

Task Evaluators

class forte.indexers.embedding_based_indexer.EmbeddingBasedIndexer(config=None)[source]

This class is used for indexing documents represented as vectors. For example, each document can be passed through a neural embedding models and the vectors are indexed using this class.

Parameters

config (Config) – optional Hyperparameters. Missing hyperparameter will be set to default values. See default_configs() for the hyperparameter structure and default values.

static default_configs()[source]

Returns a dictionary of default configs.

{
    "index_type": "IndexFlatIP",
    "dim": 768,
    "device": "cpu"
}

Here:

“index_type”: str or class name

A string or class name representing the index type

Each line contains a single scalar number.

“dim”: int

The dimensionality of the vectors that will be indexed.

add(vectors, meta_data)[source]

Add vectors along with their meta_data into the index data structure.

Parameters
  • vectors (np.ndarray or torch.Tensor) – A pytorch tensor or a numpy array of shape [batch_size, *].

  • meta_data (optional dict) – Meta data associated with the vectors to be added. Meta data can include the document contents of the vectors.

search(query, k)[source]

Search k nearest vectors for the query in the index.

Parameters
  • query (numpy array) – A 2-dimensional numpy array of shape [batch_size, dim] where each row corresponds to a query.

  • k (int) – An integer representing the number of nearest vectors to return from the index

Returns

A list of len batch_size containing a list of len k of 2-D tuples (id, meta_data[id]) containing the id and meta-data associated with the vectors.

results = index.search(query, k=2)

# results contains the following
# [[(id1, txt1)], [(id2, txt2)]]

save(path)[source]

Save the index and meta data in path directory. The index will be saved as index.faiss and index.meta_data respectively inside path directory.

Parameters

path (str) – A path to the directory where the index will be saved

load(path, device=None)[source]

Load the index and meta data from path directory.

Parameters
  • path (str) – A path to the directory to load the index from.

  • device (optional str) – Device to load the index into. If None, value will be picked from hyperparameters.