Loading Data As Needed¶
Sometimes it is preferable to load data only when it is required. For example, when creating a pipeline that handles a large amount of image data, a naive way would be to load the data at the beginning (i.e. through a reader), and pass all the data along the pipeline.
Yet this approach could be inefficient since the actual images are passing along the pipeline, potentially through a network. If not all the processors in the pipeline need to access the image data, a better alternative would be to lazy load the data when needed, while all the data stays at an online location (such as an NSF location or a hyperlink).
Forte’s Payload
classes provides options for you to do exactly that.
[9]:
from typing import Optional
from dataclasses import dataclass
from forte.data.data_pack import DataPack
from forte.data.ontology.top import ImagePayload
@dataclass
class JpegPayload(ImagePayload):
"""
Attributes:
extensions (Optional[str]):
mime (Optional[str]):
type_code (Optional[str]):
version (Optional[int]):
source_type (Optional[int]):
"""
extensions: Optional[str]
mime: Optional[str]
type_code: Optional[str]
version: Optional[int]
source_type: Optional[int]
def __init__(self, pack: DataPack):
super().__init__(pack)
self.extensions: Optional[str] = None
self.mime: Optional[str] = None
self.type_code: Optional[str] = None
self.version: Optional[int] = None
self.source_type: Optional[int] = None
The class above is an example Payload
class inheriting the Forte built-in ImagePayload
class (note that this class is generated through the ontology generator, you should be able to find the definitions here).
The Payload
classes, as their name suggest, are used to store data. A Payload
class has certain default members, such as a uri
and a cache
, and one can also enrich the class by extending it, like above.
The simple usage of a Payload
class is to access its uri
and cache
. The uri
is defined by you, it could be a URL or a remote file path. And the cache
is used to store the actual data. In a regular Forte reader implementation, one might want to specify the uri
and populate the cache
with actual data. Let’s see a quick example.
[10]:
# A Payload is just another regular entry object,
# so we can handle this in the same way.
datapack = DataPack()
sp = JpegPayload(datapack)
sp.uri = "http://some/path/"
print(datapack.get_single(JpegPayload).uri)
http://some/path/
We have set the uri
for this particular payload, which is lightweight since we only added a string to it. While one can load the actual data into sp.cache
by reading the uri
now, let’s study the “lazy loading” option.
Forte allows one to do this by associating a load
function to the Payload
class using a simple decorator like below:
[11]:
from forte.data.ontology.top import load_func
@load_func(JpegPayload)
def load(payload: JpegPayload):
def read_uri(input_uri): # The function to read the URI.
# to be implemented
pass
return read_uri(payload.uri) # Returns the payload content.
What happens here is that we decorate the load
function with the Forte built-in load_func
decorator, which associates the JpegPayload
type with the load
function. Note that this function takes an input_uri
as input, internally, Forte will pass JpegPayload.uri
to it.
Now when you call the load
function in the JpegPayload
class, it will try to populate the cache
with the return value of the load
function, by providing the uri
.
Let’s see a full implementation of this function.
[12]:
@load_func(JpegPayload)
def load(payload: JpegPayload):
"""
A function that parses payload metadata and prepare and returns a loading function.
This function is not stored in data store but will be used
for registering in PayloadFactory.
Returns:
a function that reads image data from an url.
"""
try:
from PIL import Image
import requests
import numpy as np
except ModuleNotFoundError as e:
raise ModuleNotFoundError(
"ImagePayload reading web file requires `PIL` and"
"`requests` packages to be installed."
) from e
def read_uri(input_uri):
# customize this function to read data from uri
uri_obj = requests.get(input_uri, stream=True)
pil_image = Image.open(uri_obj.raw)
return np.asarray(pil_image)
return read_uri(payload.uri)
This load
implementation uses the PIL
library to read images, which supports JPEG.
Now we have registered the load
function to the SoundFilePayload
class. Let’s have a try.
[13]:
datapack = DataPack("image")
payload = JpegPayload(datapack)
datapack.add_entry(payload)
payload.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg"
We have successfully read the data URL, now we can load the payload content at any time.
[14]:
payload.load()
print(payload.cache.shape)
(539, 810, 3)
Note that here we explicitly called the load
function for illustration purposes. Forte actually allows you to directly access the cache
, and it will attempt to load
the data without the explicit load
call.
[15]:
datapack_lazy = DataPack("image")
pl = JpegPayload(datapack_lazy)
datapack_lazy.add_entry(pl)
pl.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg"
print(pl.cache.shape)
(539, 810, 3)
In this way, we achieve the “lazy loading” idea, with a registered function, and without having users to manually worry about when to load the content.
Finally, there are a few usage tips: 1. Once the data is loaded into cache
, it will stay with the data pack (which means it will be transferred through the pipeline). Currently Forte does not have a mechanism to automatically clean the cache
. One can call the clear_cache
function manually. 2. To use the lazy loading mechanism in Payload
, it is preferable to register a function for a dedicated type. This will help you organize the loading methods of different types of data. Under
the hood. Forte simply assign the loading method into the corresponding Payload
class. This means method overriding will work as expected: if a different load
function is assigned to a child class, then the load
function registered to the child class will be used.
[16]:
pl.clear_cache()
print(pl._cache)
None
[ ]: