Objective

Build a Sentiment Classifier using Logistic Regression:

Load Data
Vectorize using Scikit-Learn
Build a Logisitc Regression Model
Evaluate the Model
Update our Kaggle Submission



In [1]:

    
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd

from IPython.display import Image

Load Data



In [2]:

    
train_df = pd.read_csv("data/train.tsv", sep="\t")



In [3]:

    
train_df.sample(10)









    Out[3]:






  
    
      
      document_id
      sentiment
      review
    
  
  
    
      14903
      14903
      1
      i would never have thought that it would be po...
    
    
      17196
      17196
      0
      I give this movie a ONE, for it is truly an aw...
    
    
      23684
      23684
      0
      Visually disjointed and full of itself, the di...
    
    
      19603
      19603
      0
      I've seen about 820 movies released between 19...
    
    
      9412
      9412
      1
      The movie begins a with mentally-challenged gi...
    
    
      6905
      6905
      0
      This movie should be nominated for a new genre...
    
    
      8637
      8637
      0
      i guess its possible that I've seen worse movi...
    
    
      5177
      5177
      1
      Of course you could never go into a theatre an...
    
    
      15110
      15110
      0
      I absolutely adore the 'Toxic Avenger' series,...
    
    
      4798
      4798
      0
      Where the hell are all these uncharted islands...

Training process

Split the Overall Training examples into Training and Validation
Build the Models on Training Data
Score on Validation data
Choose the best model and submit to Kaggle

Caution: If you do this enough times, you will be overfitting to the Validation data. To avoid that it might be advisable to split into three ways like Train-Validation-Test and generate the final score on Test Data.



In [4]:

    
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_df["review"], train_df["sentiment"], test_size=0.2)



In [5]:

    
print("Training Data: {}, Validation: {}".format(len(X_train), len(X_valid)))









    



Training Data: 20000, Validation: 5000

Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

Bag of Words
TF-IDF
Word Embeddings (Word2Vec)

Scikit-Learn has nice APIs for preprocessing and feature extraction modules. In fact, these can be used even if you build your own models or use another libriary for model building process.



In [6]:

    
from sklearn.feature_extraction.text import CountVectorizer



In [8]:

    
# The API is very similar to model building process.
# Step 1: Instantiate the Vectorizer or more generally called Transformer

vect = CountVectorizer(max_features=5000, binary=True, stop_words="english")



In [10]:

    
# Fit your Training Data
vect.fit(X_train)

# Transform your training and validation data
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)



In [11]:

    
X_train.head()









    Out[11]:





1181     The Order starts in Rome where the head of a s...
6139     While "Santa Claus Conquers the Martians" is u...
7145     This film is horribly acted, written, directed...
9375     "In the world of old-school kung fu movies, wh...
24185    A charming boy and his mother move to a middle...
Name: review, dtype: object



In [12]:

    
# Creates a Sparse Matrix
X_train_vect









    Out[12]:





<20000x5000 sparse matrix of type '<class 'numpy.int64'>'
	with 1426615 stored elements in Compressed Sparse Row format>



In [13]:

    
# Understand the Vectorizer
vect









    Out[13]:





CountVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=5000, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)



In [14]:

    
# Does similar things to what we did manually in our bag of words model

# vect.vocabulary_



In [15]:

    
# Does similar things to what we did manually in our bag of words model
from itertools import islice

list(islice(vect.vocabulary_.items(), 10))









    Out[15]:





[('kiss', 2514),
 ('blatant', 503),
 ('elizabeth', 1483),
 ('screenwriter', 3881),
 ('santa', 3832),
 ('ruined', 3801),
 ('nazi', 2987),
 ('length', 2599),
 ('cable', 640),
 ('sexuality', 3956)]



In [16]:

    
pd.DataFrame(X_train_vect.todense(), columns=vect.vocabulary_.keys()).head()









    Out[16]:






  
    
      
      kiss
      blatant
      elizabeth
      screenwriter
      santa
      ruined
      nazi
      length
      cable
      sexuality
      ...
      adults
      pick
      underground
      distance
      seemingly
      drunken
      soundtrack
      technical
      appeared
      confess
    
  
  
    
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      ...
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      3
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      4
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      ...
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
    
  

5 rows × 5000 columns

Model - Logistic Regression



In [17]:

    
from sklearn.linear_model import LogisticRegression



In [18]:

    
model = LogisticRegression()



In [19]:

    
model.fit(X_train_vect, y_train)









    Out[19]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [20]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))









    



Training Accuracy: 0.964



In [21]:

    
## Validation Accuracy
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))









    



Validation Accuracy: 0.854

Model Tuning

Model seems to be Overfitting. Try, Regularization to bring Training Accuracy closer to Validation Accuracy

What options are available in Logisitc Regression



In [22]:

    
model = LogisticRegression(C=0.1)
model.fit(X_train_vect, y_train)









    Out[22]:





LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [23]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.934
Validation Accuracy: 0.875

Feeling Good? - Let's Update Kaggle Submission

Steps:

Load Test Dataset
Vectorize the Features (Review)
Predict the sentiment
Create the CSV file and update the submission



In [24]:

    
# Read in the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()









    Out[24]:






  
    
      
      document_id
      review
    
  
  
    
      0
      0
      This is one of those movies that has everythin...
    
    
      1
      1
      I don't know what some people were thinking wh...
    
    
      2
      2
      Here is a rundown of a typical Rachael Ray Sho...
    
    
      3
      3
      "Speck" was apparently intended to be a biopic...
    
    
      4
      4
      Let's get it clear from the start: I am an ass...



In [25]:

    
# Vectorize the Review Text

X_test = test_df.review
X_test_vect = vect.transform(X_test)



In [26]:

    
y_test_pred = model.predict(X_test_vect)



In [27]:

    
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_test_pred
})



In [28]:

    
df.to_csv("data/logistic_reg_submission1.csv", index=False)



In [29]:

    
!head data/logistic_reg_submission1.csv









    



document_id,sentiment
0,1
1,1
2,0
3,0
4,0
5,0
6,0
7,0
8,0

The End

Now your turn, Open the 04-compete notebook and try different Classifiers and see if you can improve the predictions

	document_id	sentiment	review
14903	14903	1	i would never have thought that it would be po...
17196	17196	0	I give this movie a ONE, for it is truly an aw...
23684	23684	0	Visually disjointed and full of itself, the di...
19603	19603	0	I've seen about 820 movies released between 19...
9412	9412	1	The movie begins a with mentally-challenged gi...
6905	6905	0	This movie should be nominated for a new genre...
8637	8637	0	i guess its possible that I've seen worse movi...
5177	5177	1	Of course you could never go into a theatre an...
15110	15110	0	I absolutely adore the 'Toxic Avenger' series,...
4798	4798	0	Where the hell are all these uncharted islands...

	blatant	screenwriter	...	pick
0	1	1	...	1
1	0	0	...	0
2	0	0	...	0
3	0	0	...	0
4	0	0	...	0

	document_id	review
0	0	This is one of those movies that has everythin...
1	1	I don't know what some people were thinking wh...
2	2	Here is a rundown of a typical Rachael Ray Sho...
3	3	"Speck" was apparently intended to be a biopic...
4	4	Let's get it clear from the start: I am an ass...

	blatant	screenwriter	...	pick
0	1	1	...	1
1	0	0	...	0
2	0	0	...	0
3	0	0	...	0
4	0	0	...	0

	blatant	screenwriter	...	pick
0	1	1	...	1
1	0	0	...	0
2	0	0	...	0
3	0	0	...	0
4	0	0	...	0