Natural Language Processing Topics: Simple Text Classifier

Here we will see the code implementation of a simple text classifier using sklearn in python.

We will use the IMDB movie review dataset that can be found at this kaggle path . I have the dataset downloaded already in my local machine as a .csv file.

Read the dataset

Let us first import some libraries, and read the dataset into a dataframe.

import pandas as pd import numpy as np df = pd.read_csv("IMDBDataset.csv") print(df.shape) >>> (50000, 2)

Let us get some basic info of the dataset.

print(df.columns) >>> Index(['review', 'sentiment'], dtype='object')

Using the following we can get basic info that there are two columns: review column is text whereas sentiment column is either 'positive' or 'negative' classification.

Clean the Dataset

Here we will do a very simple cleaning in two steps:

Delete any rows that have empty review text.
Delete any reviews that are only whitespace characters

# check shape of df before and after dropping NA rows print(df.shape) df.dropna(inplace=True) print(df.shape) # check shape of df before and after dropping whitespace columns df = df[~df['review'].str.isspace()].copy() print(df.shape) >>> (50000, 2) >>> (50000, 2) >>> (50000, 2)

Looks like the data needed no cleaning.

Classifier Creation: TFIDF Vectorizer and LinearSVC

Here we will create a very simple pipeline with TFIDF vectorizer and LinearSVC. The TDIDF vectorizer converts our document data to numerical data while LinearSVC is the actual classifier.

Data Selection: Split

Let us split the train and test data. Of the two columns, our input data is review column whereas sentiment column is the output column.

X = df['review'] y = df['sentiment'] X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=56, test_size=0.3 ) print(X_train.shape,X_test.shape) >>> (35000,) (15000,)

If we have more than one column that we want to use for input, we can supply them as a list of columns while creating X above.

Create a Pipeline: Classifier

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.pipeline import Pipeline from sklearn.svm import LinearSVC my_classifier = Pipeline( [('my_tfidf',TfidfVectorizer()), ('my_svm',LinearSVC())] )

Now let us fit the model with our data, i.e. train the model.

my_classifier.fit(X_train,y_train)

Thats it !!! Its just a single line of code.

Now let us predict using the model.

predictions = my_classifier.predict(X_test)

Model Evaluation: Metrics of Model Output

Printing Metrics

Let us print the classification report. Classification report generally has all the necessary metrics needed to evaluate a model performance.

from sklearn.metrics import classification_report print(classification_report(y_test,predictions)) >>> precision recall f1-score support negative 0.90 0.89 0.90 7436 positive 0.89 0.90 0.90 7564 accuracy 0.90 15000 macro avg 0.90 0.90 0.90 15000 weighted avg 0.90 0.90 0.90 15000

This shows the model is 90% accurate.

Looking predictions side-by-side.

We can combine the X_test, y_test with predictions into a single dataframe as follows to go thorugh the output and predictions.

df_temp = pd.concat([X_test.reset_index(drop=True),y_test.reset_index(drop=True),pd.Series(predictions)],axis=1) print(df_temp.head(4)) >>> review sentiment 0 0 THE AFFAIR is a very bad TV movie from the 197... negative negative 1 Based on Ray Russell's dark bestseller, this J... negative negative 2 I saw this movie with hopes of a good laugh bu... positive negative 3 (Spoilers more than likely... nothing really i... negative negative

Simple Testing

Let us provide some simple text input to the model to see if it can predict our sentiment.

my_classifier.predict(["I did not like this movie", "It was alright movie.", "It was above average movie.", "I plan to take my son to watch this movie next week."]) >>> array(['negative', 'negative', 'negative', 'positive'], dtype=object)