Audio Processing


Audio DataPack

DataPack includes a payload for audio data and a metadata for sample rate. You can set them by calling the set_audio method:

from import DataPack

pack: DataPack = DataPack()
pack.set_audio(audio, sample_rate)

The input parameter audio should be a numpy array of raw waveform and sample_rate should be an integer the specifies the sample rate. Now you can access these data using and DataPack.sample_rate.

Audio Reader

AudioReader supports reading in the audio data from files under a specific directory. You can set it as the reader of your forte pipeline whenever you need to process audio files:

from forte.pipeline import Pipeline
from import AudioReader

    config={"file_ext": ".wav"}

The example above builds a simple pipeline that can walk through the specified directory and load all the files with extension of .wav. AudioReader will create a DataPack for each file with the corresponding audio payload and the sample rate.

Automatic Speech Recognition

Automatic Speech Recognition (ASR) develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. Here we using a simple example to show how to build a forte pipeline to perform speech transcription tasks. This example is based on a pretrained wav2vec2 model and you can check out the details here.

Speaker Segmentation

Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. It is the process of partitioning an input audio stream into acoustically homogeneous segments according to the speaker identity. A typical speaker segmentation system finds potential speaker change points using the audio characteristics. In this example, the speaker segmentation is backed by a pretrained Hugging Face model where you can find details in this link.

Example script

For full example script, please refer to this audio example