Objective

Build a Sentiment Classifier using Logistic Regression:

  • Load Data
  • Vectorize using Scikit-Learn
  • Build a Logisitc Regression Model
  • Evaluate the Model
  • Update our Kaggle Submission

In [1]:
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd

from IPython.display import Image

Load Data


In [2]:
train_df = pd.read_csv("data/train.tsv", sep="\t")

In [3]:
train_df.sample(10)


Out[3]:
document_id sentiment review
14903 14903 1 i would never have thought that it would be po...
17196 17196 0 I give this movie a ONE, for it is truly an aw...
23684 23684 0 Visually disjointed and full of itself, the di...
19603 19603 0 I've seen about 820 movies released between 19...
9412 9412 1 The movie begins a with mentally-challenged gi...
6905 6905 0 This movie should be nominated for a new genre...
8637 8637 0 i guess its possible that I've seen worse movi...
5177 5177 1 Of course you could never go into a theatre an...
15110 15110 0 I absolutely adore the 'Toxic Avenger' series,...
4798 4798 0 Where the hell are all these uncharted islands...

Training process

  • Split the Overall Training examples into Training and Validation
  • Build the Models on Training Data
  • Score on Validation data
  • Choose the best model and submit to Kaggle

Caution: If you do this enough times, you will be overfitting to the Validation data. To avoid that it might be advisable to split into three ways like Train-Validation-Test and generate the final score on Test Data.


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_df["review"], train_df["sentiment"], test_size=0.2)

In [5]:
print("Training Data: {}, Validation: {}".format(len(X_train), len(X_valid)))


Training Data: 20000, Validation: 5000

Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

  • Bag of Words
  • TF-IDF
  • Word Embeddings (Word2Vec)

Scikit-Learn has nice APIs for preprocessing and feature extraction modules. In fact, these can be used even if you build your own models or use another libriary for model building process.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [8]:
# The API is very similar to model building process.
# Step 1: Instantiate the Vectorizer or more generally called Transformer

vect = CountVectorizer(max_features=5000, binary=True, stop_words="english")

In [10]:
# Fit your Training Data
vect.fit(X_train)

# Transform your training and validation data
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)

In [11]:
X_train.head()


Out[11]:
1181     The Order starts in Rome where the head of a s...
6139     While "Santa Claus Conquers the Martians" is u...
7145     This film is horribly acted, written, directed...
9375     "In the world of old-school kung fu movies, wh...
24185    A charming boy and his mother move to a middle...
Name: review, dtype: object

In [12]:
# Creates a Sparse Matrix
X_train_vect


Out[12]:
<20000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 1426615 stored elements in Compressed Sparse Row format>

In [13]:
# Understand the Vectorizer
vect


Out[13]:
CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [14]:
# Does similar things to what we did manually in our bag of words model

# vect.vocabulary_

In [15]:
# Does similar things to what we did manually in our bag of words model
from itertools import islice

list(islice(vect.vocabulary_.items(), 10))


Out[15]:
[('kiss', 2514),
 ('blatant', 503),
 ('elizabeth', 1483),
 ('screenwriter', 3881),
 ('santa', 3832),
 ('ruined', 3801),
 ('nazi', 2987),
 ('length', 2599),
 ('cable', 640),
 ('sexuality', 3956)]

In [16]:
pd.DataFrame(X_train_vect.todense(), columns=vect.vocabulary_.keys()).head()


Out[16]:
kiss blatant elizabeth screenwriter santa ruined nazi length cable sexuality ... adults pick underground distance seemingly drunken soundtrack technical appeared confess
0 0 1 0 1 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 5000 columns

Model - Logistic Regression


In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
model = LogisticRegression()

In [19]:
model.fit(X_train_vect, y_train)


Out[19]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [20]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))


Training Accuracy: 0.964

In [21]:
## Validation Accuracy
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))


Validation Accuracy: 0.854

Model Tuning

Model seems to be Overfitting. Try, Regularization to bring Training Accuracy closer to Validation Accuracy

  • What options are available in Logisitc Regression

In [22]:
model = LogisticRegression(C=0.1)
model.fit(X_train_vect, y_train)


Out[22]:
LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [23]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))


Training Accuracy: 0.934
Validation Accuracy: 0.875

Feeling Good? - Let's Update Kaggle Submission

Steps:

  • Load Test Dataset
  • Vectorize the Features (Review)
  • Predict the sentiment
  • Create the CSV file and update the submission

In [24]:
# Read in the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()


Out[24]:
document_id review
0 0 This is one of those movies that has everythin...
1 1 I don't know what some people were thinking wh...
2 2 Here is a rundown of a typical Rachael Ray Sho...
3 3 "Speck" was apparently intended to be a biopic...
4 4 Let's get it clear from the start: I am an ass...

In [25]:
# Vectorize the Review Text

X_test = test_df.review
X_test_vect = vect.transform(X_test)

In [26]:
y_test_pred = model.predict(X_test_vect)

In [27]:
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_test_pred
})

In [28]:
df.to_csv("data/logistic_reg_submission1.csv", index=False)

In [29]:
!head data/logistic_reg_submission1.csv


document_id,sentiment
0,1
1,1
2,0
3,0
4,0
5,0
6,0
7,0
8,0

The End

  • Now your turn, Open the 04-compete notebook and try different Classifiers and see if you can improve the predictions