Loading Data As Needed¶
Sometimes it is preferable to load data only when it is required. For example, when creating a pipeline that handles a large amount of image data, a naive way would be to load the data at the beginning (i.e. through a reader), and pass all the data along the pipeline.
Yet this approach could be inefficient since the actual images are passing along the pipeline, potentially through a network. If not all the processors in the pipeline need to access the image data, a better alternative would be to lazy load the data when needed, while all the data stays at an online location (such as an NSF location or a hyperlink).
Payload classes provides options for you to do exactly that.
from typing import Optional from dataclasses import dataclass from forte.data.data_pack import DataPack from forte.data.ontology.top import ImagePayload @dataclass class JpegPayload(ImagePayload): """ Attributes: extensions (Optional[str]): mime (Optional[str]): type_code (Optional[str]): version (Optional[int]): source_type (Optional[int]): """ extensions: Optional[str] mime: Optional[str] type_code: Optional[str] version: Optional[int] source_type: Optional[int] def __init__(self, pack: DataPack): super().__init__(pack) self.extensions: Optional[str] = None self.mime: Optional[str] = None self.type_code: Optional[str] = None self.version: Optional[int] = None self.source_type: Optional[int] = None
The class above is an example
Payload class inheriting the Forte built-in
ImagePayload class (note that this class is generated through the ontology generator, you should be able to find the definitions here).
Payload classes, as their name suggest, are used to store data. A
Payload class has certain default members, such as a
uri and a
cache, and one can also enrich the class by extending it, like above.
The simple usage of a
Payload class is to access its
uri is defined by you, it could be a URL or a remote file path. And the
cache is used to store the actual data. In a regular Forte reader implementation, one might want to specify the
uri and populate the
cache with actual data. Let’s see a quick example.
# A Payload is just another regular entry object, # so we can handle this in the same way. datapack = DataPack() sp = JpegPayload(datapack) sp.uri = "http://some/path/" print(datapack.get_single(JpegPayload).uri)
We have set the
uri for this particular payload, which is lightweight since we only added a string to it. While one can load the actual data into
sp.cache by reading the
uri now, let’s study the “lazy loading” option.
Forte allows one to do this by associating a
load function to the
Payload class using a simple decorator like below:
from forte.data.ontology.top import load_func @load_func(JpegPayload) def load(payload: JpegPayload): def read_uri(input_uri): # The function to read the URI. # to be implemented pass return read_uri(payload.uri) # Returns the payload content.
What happens here is that we decorate the
load function with the Forte built-in
load_func decorator, which associates the
JpegPayload type with the
load function. Note that this function takes an
input_uri as input, internally, Forte will pass
JpegPayload.uri to it.
Now when you call the
load function in the
JpegPayload class, it will try to populate the
cache with the return value of the
load function, by providing the
Let’s see a full implementation of this function.
@load_func(JpegPayload) def load(payload: JpegPayload): """ A function that parses payload metadata and prepare and returns a loading function. This function is not stored in data store but will be used for registering in PayloadFactory. Returns: a function that reads image data from an url. """ try: from PIL import Image import requests import numpy as np except ModuleNotFoundError as e: raise ModuleNotFoundError( "ImagePayload reading web file requires `PIL` and" "`requests` packages to be installed." ) from e def read_uri(input_uri): # customize this function to read data from uri uri_obj = requests.get(input_uri, stream=True) pil_image = Image.open(uri_obj.raw) return np.asarray(pil_image) return read_uri(payload.uri)
load implementation uses the
PIL library to read images, which supports JPEG.
Now we have registered the
load function to the
SoundFilePayload class. Let’s have a try.
datapack = DataPack("image") payload = JpegPayload(datapack) datapack.add_entry(payload) payload.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg"
We have successfully read the data URL, now we can load the payload content at any time.
(539, 810, 3)
Note that here we explicitly called the
load function for illustration purposes. Forte actually allows you to directly access the
cache, and it will attempt to
load the data without the explicit
datapack_lazy = DataPack("image") pl = JpegPayload(datapack_lazy) datapack_lazy.add_entry(pl) pl.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg" print(pl.cache.shape)
(539, 810, 3)
In this way, we achieve the “lazy loading” idea, with a registered function, and without having users to manually worry about when to load the content.
Finally, there are a few usage tips: 1. Once the data is loaded into
cache, it will stay with the data pack (which means it will be transferred through the pipeline). Currently Forte does not have a mechanism to automatically clean the
cache. One can call the
clear_cache function manually. 2. To use the lazy loading mechanism in
Payload, it is preferable to register a function for a dedicated type. This will help you organize the loading methods of different types of data. Under
the hood. Forte simply assign the loading method into the corresponding
Payload class. This means method overriding will work as expected: if a different
load function is assigned to a child class, then the
load function registered to the child class will be used.