Objective

Build Multiple Models and Emsemble the Results



In [1]:

    
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd

from IPython.display import Image

Load Data



In [2]:

    
train_df = pd.read_csv("data/train.tsv", sep="\t")



In [3]:

    
train_df.sample(10)









    Out[3]:






  
    
      
      document_id
      sentiment
      review
    
  
  
    
      2060
      2060
      0
      Daphne Zuniga is the only light that shines in...
    
    
      7142
      7142
      1
      And that's why historic/biographic movies are ...
    
    
      5907
      5907
      0
      I tend to like character-driven films. I also ...
    
    
      24544
      24544
      0
      Is Miike like Chabrol, alternating art with dr...
    
    
      11448
      11448
      1
      All I can say is, first movie this season that...
    
    
      22770
      22770
      0
      An MTV-style film crew consisting of American ...
    
    
      13059
      13059
      0
      Someone here actually compared this movie in s...
    
    
      15380
      15380
      1
      This isn't the best romantic comedy ever made,...
    
    
      2022
      2022
      1
      I watched the un-aired episodes online and I w...
    
    
      4465
      4465
      0
      OK, the story - a simpleminded loony enters a ...

Training process

Split the Overall Training examples into Training and Validation
Build the Models on Training Data
Score on Validation data
Choose the best model and submit to Kaggle

Caution: If you do this enough times, you will be overfitting to the Validation data. To avoid that it might be advisable to split into three ways like Train-Validation-Test and generate the final score on Test Data.



In [4]:

    
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_df["review"], train_df["sentiment"], test_size=0.2)



In [5]:

    
print("Training Data: {}, Validation: {}".format(len(X_train), len(X_valid)))









    



Training Data: 20000, Validation: 5000

Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

Bag of Words
TF-IDF
Word Embeddings (Word2Vec)

Scikit-Learn has nice APIs for preprocessing and feature extraction modules. In fact, these can be used even if you build your own models or use another libriary for model building process.



In [6]:

    
from sklearn.feature_extraction.text import CountVectorizer



In [7]:

    
# The API is very similar to model building process.
# Step 1: Instantiate the Vectorizer or more generally called Transformer

vect = CountVectorizer(max_features=5000, binary=True, stop_words="english")



In [8]:

    
# Fit your Training Data
vect.fit(X_train)

# Transform your training and validation data
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)

Model - Logistic Regression

Logistic Regression
Naive Bayes
Neural Nets
Random Forest
Gradient Boosted Trees
...

Model 1 - Logisitc Regression



In [22]:

    
from sklearn.linear_model import LogisticRegression



In [17]:

    
model_1 = LogisticRegression()
model_1.fit(X_train_vect, y_train)









    Out[17]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)



In [20]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_1.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_1.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.967
Validation Accuracy: 0.854

Model 2 - Naive Bayes



In [23]:

    
from sklearn.naive_bayes import MultinomialNB



In [24]:

    
model_2 = MultinomialNB()
model_2.fit(X_train_vect, y_train)









    Out[24]:





MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)



In [25]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_2.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_2.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.866
Validation Accuracy: 0.845

Model 3 - Random Forest



In [26]:

    
from sklearn.ensemble import RandomForestClassifier



In [30]:

    
model_3 = RandomForestClassifier(min_samples_leaf=3, n_estimators=25, n_jobs=-1)
model_3.fit(X_train_vect, y_train)









    Out[30]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)



In [31]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_3.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_3.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.949
Validation Accuracy: 0.820

Model 4 - Gradient Boosted Trees



In [33]:

    
from sklearn.ensemble import GradientBoostingClassifier



In [34]:

    
model_4 = RandomForestClassifier(min_samples_leaf=3, n_estimators=25, n_jobs=-1)
model_4.fit(X_train_vect, y_train)









    Out[34]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)



In [35]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_4.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_4.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.946
Validation Accuracy: 0.822

Model 5 - Neural Networks (CPU Only)



In [36]:

    
from sklearn.neural_network import MLPClassifier



In [38]:

    
model_5 = MLPClassifier(hidden_layer_sizes=(32,), max_iter=100)
model_5.fit(X_train_vect, y_train)









    Out[38]:





MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(32,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=100, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)



In [39]:

    
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_5.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_5.score(X_valid_vect, y_valid)))









    



Training Accuracy: 1.000
Validation Accuracy: 0.839

Neural Nets - Textbook Case of Overfitting. Maybe the model is too powerful :-D

Model Tuning

All these models have tons of Parameters that could be tweaked to reduce overfitting



In [40]:

    
## Pass

Finding it difficult to pick the Winning Model - Why not Average the Results

After all we collectively make the right decisions, on average.



In [42]:

    
from sklearn.ensemble import VotingClassifier



In [47]:

    
classifiers = [("Logistic Regression", model_1), 
               ("Naive Bayes", model_2), 
               ("Random Forest", model_3), 
               ("Gradient Boosted", model_4), 
               ("Neural Nets", model_5)]



In [48]:

    
classifiers









    Out[48]:





[('Logistic Regression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False)),
 ('Naive Bayes', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)),
 ('Random Forest',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=3,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
              verbose=0, warm_start=False)),
 ('Gradient Boosted',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=3,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
              verbose=0, warm_start=False)),
 ('Neural Nets',
  MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
         beta_2=0.999, early_stopping=False, epsilon=1e-08,
         hidden_layer_sizes=(32,), learning_rate='constant',
         learning_rate_init=0.001, max_iter=100, momentum=0.9,
         nesterovs_momentum=True, power_t=0.5, random_state=None,
         shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
         verbose=False, warm_start=False))]



In [49]:

    
final_model = VotingClassifier(classifiers, n_jobs=-1)



In [51]:

    
# Unfortuately, have to run Fit Again on the ensembled model before using it
# Wish there was an option to not have to fit again
final_model.fit(X_train_vect, y_train)









    Out[51]:





VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)...=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False))],
         n_jobs=-1, voting='hard', weights=None)



In [52]:

    
# Drum Rolls - Accuracy on the final Model
print("Training Accuracy: {:.3f}".format(final_model.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(final_model.score(X_valid_vect, y_valid)))









    



Training Accuracy: 0.965
Validation Accuracy: 0.866

Let's Update Kaggle Submission

Steps:

Load Test Dataset
Vectorize the Features (Review)
Predict the sentiment
Create the CSV file and update the submission



In [53]:

    
# Read in the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()









    Out[53]:






  
    
      
      document_id
      review
    
  
  
    
      0
      0
      This is one of those movies that has everythin...
    
    
      1
      1
      I don't know what some people were thinking wh...
    
    
      2
      2
      Here is a rundown of a typical Rachael Ray Sho...
    
    
      3
      3
      "Speck" was apparently intended to be a biopic...
    
    
      4
      4
      Let's get it clear from the start: I am an ass...



In [54]:

    
# Vectorize the Review Text

X_test = test_df.review
X_test_vect = vect.transform(X_test)



In [55]:

    
y_test_pred = final_model.predict(X_test_vect)



In [58]:

    
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_test_pred
})



In [59]:

    
df.to_csv("data/ensemble_submission1.csv", index=False)

Other Ideas

Try Different Vectorizers
Hyper Parameter Tuning of Models

	document_id	sentiment	review
2060	2060	0	Daphne Zuniga is the only light that shines in...
7142	7142	1	And that's why historic/biographic movies are ...
5907	5907	0	I tend to like character-driven films. I also ...
24544	24544	0	Is Miike like Chabrol, alternating art with dr...
11448	11448	1	All I can say is, first movie this season that...
22770	22770	0	An MTV-style film crew consisting of American ...
13059	13059	0	Someone here actually compared this movie in s...
15380	15380	1	This isn't the best romantic comedy ever made,...
2022	2022	1	I watched the un-aired episodes online and I w...
4465	4465	0	OK, the story - a simpleminded loony enters a ...

	document_id	review
0	0	This is one of those movies that has everythin...
1	1	I don't know what some people were thinking wh...
2	2	Here is a rundown of a typical Rachael Ray Sho...
3	3	"Speck" was apparently intended to be a biopic...
4	4	Let's get it clear from the start: I am an ass...