Loading Data As Needed

Sometimes it is preferable to load data only when it is required. For example, when creating a pipeline that handles a large amount of image data, a naive way would be to load the data at the beginning (i.e. through a reader), and pass all the data along the pipeline.

Yet this approach could be inefficient since the actual images are passing along the pipeline, potentially through a network. If not all the processors in the pipeline need to access the image data, a better alternative would be to lazy load the data when needed, while all the data stays at an online location (such as an NSF location or a hyperlink).

Forte’s Payload classes provides options for you to do exactly that.

[9]:
from typing import Optional
from dataclasses import dataclass

from forte.data.data_pack import DataPack
from forte.data.ontology.top import ImagePayload

@dataclass
class JpegPayload(ImagePayload):
    """
    Attributes:
        extensions (Optional[str]):
        mime (Optional[str]):
        type_code (Optional[str]):
        version (Optional[int]):
        source_type (Optional[int]):
    """

    extensions: Optional[str]
    mime: Optional[str]
    type_code: Optional[str]
    version: Optional[int]
    source_type: Optional[int]

    def __init__(self, pack: DataPack):
        super().__init__(pack)
        self.extensions: Optional[str] = None
        self.mime: Optional[str] = None
        self.type_code: Optional[str] = None
        self.version: Optional[int] = None
        self.source_type: Optional[int] = None

The class above is an example Payload class inheriting the Forte built-in ImagePayload class (note that this class is generated through the ontology generator, you should be able to find the definitions here).

The Payload classes, as their name suggest, are used to store data. A Payload class has certain default members, such as a uri and a cache, and one can also enrich the class by extending it, like above.

The simple usage of a Payload class is to access its uri and cache. The uri is defined by you, it could be a URL or a remote file path. And the cache is used to store the actual data. In a regular Forte reader implementation, one might want to specify the uri and populate the cache with actual data. Let’s see a quick example.

[10]:
# A Payload is just another regular entry object,
# so we can handle this in the same way.
datapack = DataPack()
sp = JpegPayload(datapack)
sp.uri = "http://some/path/"

print(datapack.get_single(JpegPayload).uri)
http://some/path/

We have set the uri for this particular payload, which is lightweight since we only added a string to it. While one can load the actual data into sp.cache by reading the uri now, let’s study the “lazy loading” option.

Forte allows one to do this by associating a load function to the Payload class using a simple decorator like below:

[11]:
from forte.data.ontology.top import load_func


@load_func(JpegPayload)
def load(payload: JpegPayload):
    def read_uri(input_uri): # The function to read the URI.
        # to be implemented
        pass
    return read_uri(payload.uri) # Returns the payload content.

What happens here is that we decorate the load function with the Forte built-in load_func decorator, which associates the JpegPayload type with the load function. Note that this function takes an input_uri as input, internally, Forte will pass JpegPayload.uri to it.

Now when you call the load function in the JpegPayload class, it will try to populate the cache with the return value of the load function, by providing the uri.

Let’s see a full implementation of this function.

[12]:
@load_func(JpegPayload)
def load(payload: JpegPayload):
    """
    A function that parses payload metadata and prepare and returns a loading function.

    This function is not stored in data store but will be used
    for registering in PayloadFactory.

    Returns:
        a function that reads image data from an url.
    """
    try:
        from PIL import Image
        import requests
        import numpy as np
    except ModuleNotFoundError as e:
        raise ModuleNotFoundError(
            "ImagePayload reading web file requires `PIL` and"
            "`requests` packages to be installed."
        ) from e

    def read_uri(input_uri):
        # customize this function to read data from uri
        uri_obj = requests.get(input_uri, stream=True)
        pil_image = Image.open(uri_obj.raw)
        return np.asarray(pil_image)

    return read_uri(payload.uri)

This load implementation uses the PIL library to read images, which supports JPEG.

Now we have registered the load function to the SoundFilePayload class. Let’s have a try.

[13]:
datapack = DataPack("image")
payload = JpegPayload(datapack)
datapack.add_entry(payload)
payload.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg"

We have successfully read the data URL, now we can load the payload content at any time.

[14]:
payload.load()
print(payload.cache.shape)
(539, 810, 3)

Note that here we explicitly called the load function for illustration purposes. Forte actually allows you to directly access the cache, and it will attempt to load the data without the explicit load call.

[15]:
datapack_lazy = DataPack("image")
pl = JpegPayload(datapack_lazy)
datapack_lazy.add_entry(pl)
pl.uri = "https://raw.githubusercontent.com/asyml/forte/assets/ocr_tutorial/ocr.jpg"
print(pl.cache.shape)
(539, 810, 3)

In this way, we achieve the “lazy loading” idea, with a registered function, and without having users to manually worry about when to load the content.

Finally, there are a few usage tips: 1. Once the data is loaded into cache, it will stay with the data pack (which means it will be transferred through the pipeline). Currently Forte does not have a mechanism to automatically clean the cache. One can call the clear_cache function manually. 2. To use the lazy loading mechanism in Payload, it is preferable to register a function for a dedicated type. This will help you organize the loading methods of different types of data. Under the hood. Forte simply assign the loading method into the corresponding Payload class. This means method overriding will work as expected: if a different load function is assigned to a child class, then the load function registered to the child class will be used.

[16]:
pl.clear_cache()
print(pl._cache)
None
[ ]: