Text Classification Pipeline

Packages

[ ]:
import os
from termcolor import colored
from forte.data.readers import ClassificationDatasetReader
from fortex.huggingface import ZeroShotClassifier
from forte.pipeline import Pipeline
from fortex.nltk import NLTKSentenceSegmenter
from ft.onto.base_ontology import Sentence

Background

This notebook tutorial is derived from a classification example. Given a table-like csv file with data at some columns are input text and data at one column is label, we set up a text classification pipeline below. This example is also a good example of wrapping external library classes/methods into PipelineComponent.

Inference Workflow

Pipeline

  • Pipeline setup

  • The pipeline has one reader ClassificationDatasetReader and two processor NLTKSentenceSegmenter and ZeroShotClassifier.

Reader

  • ClassificationDatasetReader

    • set_up(): It checks whether the configuration is correct. For example, skip_k_starting_lines should be larger than 0 otherwise it doesn’t make sense. It also converts different table data at the label column to a digit.

    • _collect(): read rows from csv file and returns iterator that yields line id and line data.

    • _cache_key_function(): use the line id as the cache key.

    • _parse_pack(): parse data from iterator returned by _collect and load it in the datapack

Processor

In this example, we want to classify data sentence by sentence so we wrapped nltk.PunktSentenceTokenizer in NLTKSentenceSegmenter to segment sentences.

  • _process(): split data pack text into sentence spans.

Then need a model to do classification. We wrap transformers.pipeline in Huggingface ZeroShotClassifier.

  • _process(): running classifier over data pack data and write the prediction results back to data pack.

ZeroShotClassifier and NLTKSentenceSegmenter both inherit from PackProcessor as it processes one DataPack at a time. Suppose if we processes one MultiPack at a time, we need to inherit MultiPackProcessor instead.

[2]:

csv_path = os.path.abspath( os.path.join("data_samples", "amazon_review_polarity_csv/sample.csv") ) # notebook should be running from project root folder pl = Pipeline() # initialize labels class_names = ["negative", "positive"] index2class = dict(enumerate(class_names)) pl.set_reader( ClassificationDatasetReader(), config={"index2class": index2class} ) pl.add(NLTKSentenceSegmenter()) pl.add(ZeroShotClassifier(), config={"candidate_labels": class_names}) pl.initialize() for pack in pl.process_dataset(csv_path): for sent in pack.get(Sentence): sent_text = sent.text print(colored("Sentence:", "red"), sent_text, "\n") print(colored("Prediction:", "blue"), sent.classification)
WARNING:root:Re-declared a new class named [ConstituentNode], which is probably used in import.
[nltk_data] Downloading package punkt to /home/murphy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Sentence: One of the best game music soundtracks - for a game I didn't really play
Despite the fact that I have only played a small portion of the game, the music I heard (plus the connection to Chrono Trigger which was great as well) led me to purchase the soundtrack, and it remains one of my favorite albums.

Prediction: {'positive': 0.954, 'negative': 0.0054}
Sentence: There is an incredible mix of fun, epic, and emotional songs.

Prediction: {'positive': 0.0115, 'negative': 0.0001}
Sentence: Those sad and beautiful tracks I especially like, as there's not too many of those kinds of songs in my other video game soundtracks.

Prediction: {'negative': 0.0002, 'positive': 0.0001}
Sentence: I must admit that one of the songs (Life-A Distant Promise) has brought tears to my eyes on many occasions.My one complaint about this soundtrack is that they use guitar fretting effects in many of the songs, which I find distracting.

Prediction: {'positive': 0.0365, 'negative': 0.0028}
Sentence: But even if those weren't included I would still consider the collection worth it.

Prediction: {'positive': 0.4291, 'negative': 0.0134}
Sentence: Batteries died within a year ...
I bought this charger in Jul 2003 and it worked OK for a while.

Prediction: {'negative': 0.9344, 'positive': 0.0415}
Sentence: The design is nice and convenient.

Prediction: {'positive': 0.9992, 'negative': 0.0004}
Sentence: However, after about a year, the batteries would not hold a charge.

Prediction: {'negative': 0.8903, 'positive': 0.0202}
Sentence: Might as well just get alkaline disposables, or look elsewhere for a charger that comes with batteries that have better staying power.

Prediction: {'negative': 0.1055, 'positive': 0.0037}
Sentence: works fine, but Maha Energy is better
Check out Maha Energy's website.

Prediction: {'positive': 0.68, 'negative': 0.1884}
Sentence: Their Powerex MH-C204F charger works in 100 minutes for rapid charge, with option for slower charge (better for batteries).

Prediction: {'positive': 0.3361, 'negative': 0.001}
Sentence: And they have 2200 mAh batteries.

Prediction: {'positive': 0.1198, 'negative': 0.0082}
Sentence: Great for the non-audiophile
Reviewed quite a bit of the combo players and was hesitant due to unfavorable reviews and size of machines.

Prediction: {'positive': 0.743, 'negative': 0.0118}
Sentence: I am weaning off my VHS collection, but don't want to replace them with DVD's.

Prediction: {'negative': 0.0839, 'positive': 0.0537}
Sentence: This unit is well built, easy to setup and resolution and special effects (no progressive scan for HDTV owners) suitable for many people looking for a versatile product.Cons- No universal remote.

Prediction: {'positive': 0.9867, 'negative': 0.0007}
Sentence: DVD Player crapped out after one year
I also began having the incorrect disc problems that I've read about on here.

Prediction: {'negative': 0.7764, 'positive': 0.0021}
Sentence: The VCR still works, but hte DVD side is useless.

Prediction: {'negative': 0.8537, 'positive': 0.0012}
Sentence: I understand that DVD players sometimes just quit on you, but after not even one year?

Prediction: {'negative': 0.9735, 'positive': 0.0008}
Sentence: To me that's a sign on bad quality.

Prediction: {'negative': 0.9988, 'positive': 0.0002}
Sentence: I'm giving up JVC after this as well.

Prediction: {'negative': 0.7842, 'positive': 0.1313}
Sentence: I'm sticking to Sony or giving another brand a shot.

Prediction: {'positive': 0.0407, 'negative': 0.0198}
Sentence: Incorrect Disc
I love the style of this, but after a couple years, the DVD is giving me problems.

Prediction: {'negative': 0.6986, 'positive': 0.0186}
Sentence: It doesn't even work anymore and I use my broken PS2 Now.

Prediction: {'negative': 0.8031, 'positive': 0.0014}
Sentence: I wouldn't recommend this, I'm just going to upgrade to a recorder now.

Prediction: {'negative': 0.9072, 'positive': 0.001}
Sentence: I wish it would work but I guess i'm giving up on JVC.

Prediction: {'negative': 0.8825, 'positive': 0.0086}
Sentence: I really did like this one... before it stopped working.

Prediction: {'positive': 0.7896, 'negative': 0.2465}
Sentence: The dvd player gave me problems probably after a year of having it.

Prediction: {'negative': 0.927, 'positive': 0.0089}
Sentence: DVD menu select problems
I cannot scroll through a DVD menu that is set up vertically.

Prediction: {'negative': 0.3931, 'positive': 0.0014}
Sentence: The triangle keys will only select horizontally.

Prediction: {'negative': 0.0112, 'positive': 0.0048}
Sentence: So I cannot select anything on most DVD's besides play.

Prediction: {'negative': 0.082, 'positive': 0.0067}
Sentence: No special features, no language select, nothing, just play.

Prediction: {'positive': 0.0209, 'negative': 0.0071}
Sentence: Unique Weird Orientalia from the 1930's
Exotic tales of the Orient from the 1930's.

Prediction: {'negative': 0.0067, 'positive': 0.004}
Sentence: "Dr Shen Fu", a Weird Tales magazine reprint, is about the elixir of life that grants immortality at a price.

Prediction: {'positive': 0.0179, 'negative': 0.0019}
Sentence: If you're tired of modern authors who all sound alike, this is the antidote for you.

Prediction: {'negative': 0.0348, 'positive': 0.0202}
Sentence: Owen's palette is loaded with splashes of Chinese and Japanese colours.

Prediction: {'positive': 0.0509, 'negative': 0.0005}
Sentence: Marvelous.

Prediction: {'positive': 0.9897, 'negative': 0.0001}
Sentence: Not an "ultimate guide"
Firstly,I enjoyed the format and tone of the book (how the author addressed the reader).

Prediction: {'positive': 0.286, 'negative': 0.0108}
Sentence: However, I did not feel that she imparted any insider secrets that the book promised to reveal.

Prediction: {'negative': 0.3161, 'positive': 0.0225}
Sentence: If you are just starting to research law school, and do not know all the requirements of admission, then this book may be a tremendous help.

Prediction: {'positive': 0.2075, 'negative': 0.0036}
Sentence: If you have done your homework and are looking for an edge when it comes to admissions, I recommend some more topic-specific books.

Prediction: {'positive': 0.1159, 'negative': 0.0393}
Sentence: For example, books on how to write your personal statment, books geared specifically towards LSAT preparation (Powerscore books were the most helpful for me), and there are some websites with great advice geared towards aiding the individuals whom you are asking to write letters of recommendation.

Prediction: {'positive': 0.1327, 'negative': 0.001}
Sentence: Yet, for those new to the entire affair, this book can definitely clarify the requirements for you.

Prediction: {'positive': 0.0372, 'negative': 0.0077}