Objective

Build Multiple Models and Emsemble the Results


In [1]:
from __future__ import print_function  # Python 2/3 compatibility
import numpy as np
import pandas as pd

from IPython.display import Image

Load Data


In [2]:
train_df = pd.read_csv("data/train.tsv", sep="\t")

In [3]:
train_df.sample(10)


Out[3]:
document_id sentiment review
2060 2060 0 Daphne Zuniga is the only light that shines in...
7142 7142 1 And that's why historic/biographic movies are ...
5907 5907 0 I tend to like character-driven films. I also ...
24544 24544 0 Is Miike like Chabrol, alternating art with dr...
11448 11448 1 All I can say is, first movie this season that...
22770 22770 0 An MTV-style film crew consisting of American ...
13059 13059 0 Someone here actually compared this movie in s...
15380 15380 1 This isn't the best romantic comedy ever made,...
2022 2022 1 I watched the un-aired episodes online and I w...
4465 4465 0 OK, the story - a simpleminded loony enters a ...

Training process

  • Split the Overall Training examples into Training and Validation
  • Build the Models on Training Data
  • Score on Validation data
  • Choose the best model and submit to Kaggle

Caution: If you do this enough times, you will be overfitting to the Validation data. To avoid that it might be advisable to split into three ways like Train-Validation-Test and generate the final score on Test Data.


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(train_df["review"], train_df["sentiment"], test_size=0.2)

In [5]:
print("Training Data: {}, Validation: {}".format(len(X_train), len(X_valid)))


Training Data: 20000, Validation: 5000

Vectorize Data (a.k.a. covert text to numbers)

Computers don't understand Texts, so we need to convert texts to numbers before we could do any math on it and see if we can build a system to classify a review as Positive or Negative.

Ways to vectorize data:

  • Bag of Words
  • TF-IDF
  • Word Embeddings (Word2Vec)

Scikit-Learn has nice APIs for preprocessing and feature extraction modules. In fact, these can be used even if you build your own models or use another libriary for model building process.


In [6]:
from sklearn.feature_extraction.text import CountVectorizer

In [7]:
# The API is very similar to model building process.
# Step 1: Instantiate the Vectorizer or more generally called Transformer

vect = CountVectorizer(max_features=5000, binary=True, stop_words="english")

In [8]:
# Fit your Training Data
vect.fit(X_train)

# Transform your training and validation data
X_train_vect = vect.transform(X_train)
X_valid_vect = vect.transform(X_valid)

Model - Logistic Regression

  • Logistic Regression
  • Naive Bayes
  • Neural Nets
  • Random Forest
  • Gradient Boosted Trees
  • ...

Model 1 - Logisitc Regression


In [22]:
from sklearn.linear_model import LogisticRegression

In [17]:
model_1 = LogisticRegression()
model_1.fit(X_train_vect, y_train)


Out[17]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [20]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_1.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_1.score(X_valid_vect, y_valid)))


Training Accuracy: 0.967
Validation Accuracy: 0.854

Model 2 - Naive Bayes


In [23]:
from sklearn.naive_bayes import MultinomialNB

In [24]:
model_2 = MultinomialNB()
model_2.fit(X_train_vect, y_train)


Out[24]:
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [25]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_2.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_2.score(X_valid_vect, y_valid)))


Training Accuracy: 0.866
Validation Accuracy: 0.845

Model 3 - Random Forest


In [26]:
from sklearn.ensemble import RandomForestClassifier

In [30]:
model_3 = RandomForestClassifier(min_samples_leaf=3, n_estimators=25, n_jobs=-1)
model_3.fit(X_train_vect, y_train)


Out[30]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [31]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_3.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_3.score(X_valid_vect, y_valid)))


Training Accuracy: 0.949
Validation Accuracy: 0.820

Model 4 - Gradient Boosted Trees


In [33]:
from sklearn.ensemble import GradientBoostingClassifier

In [34]:
model_4 = RandomForestClassifier(min_samples_leaf=3, n_estimators=25, n_jobs=-1)
model_4.fit(X_train_vect, y_train)


Out[34]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=3,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)

In [35]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_4.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_4.score(X_valid_vect, y_valid)))


Training Accuracy: 0.946
Validation Accuracy: 0.822

Model 5 - Neural Networks (CPU Only)


In [36]:
from sklearn.neural_network import MLPClassifier

In [38]:
model_5 = MLPClassifier(hidden_layer_sizes=(32,), max_iter=100)
model_5.fit(X_train_vect, y_train)


Out[38]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(32,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=100, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

In [39]:
# Training Accuracy
print("Training Accuracy: {:.3f}".format(model_5.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(model_5.score(X_valid_vect, y_valid)))


Training Accuracy: 1.000
Validation Accuracy: 0.839

Neural Nets - Textbook Case of Overfitting. Maybe the model is too powerful :-D

Model Tuning

  • All these models have tons of Parameters that could be tweaked to reduce overfitting

In [40]:
## Pass

Finding it difficult to pick the Winning Model - Why not Average the Results

  • After all we collectively make the right decisions, on average.

In [42]:
from sklearn.ensemble import VotingClassifier

In [47]:
classifiers = [("Logistic Regression", model_1), 
               ("Naive Bayes", model_2), 
               ("Random Forest", model_3), 
               ("Gradient Boosted", model_4), 
               ("Neural Nets", model_5)]

In [48]:
classifiers


Out[48]:
[('Logistic Regression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
            intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
            penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
            verbose=0, warm_start=False)),
 ('Naive Bayes', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)),
 ('Random Forest',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=3,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
              verbose=0, warm_start=False)),
 ('Gradient Boosted',
  RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
              max_depth=None, max_features='auto', max_leaf_nodes=None,
              min_impurity_split=1e-07, min_samples_leaf=3,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
              n_estimators=25, n_jobs=-1, oob_score=False, random_state=None,
              verbose=0, warm_start=False)),
 ('Neural Nets',
  MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
         beta_2=0.999, early_stopping=False, epsilon=1e-08,
         hidden_layer_sizes=(32,), learning_rate='constant',
         learning_rate_init=0.001, max_iter=100, momentum=0.9,
         nesterovs_momentum=True, power_t=0.5, random_state=None,
         shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
         verbose=False, warm_start=False))]

In [49]:
final_model = VotingClassifier(classifiers, n_jobs=-1)

In [51]:
# Unfortuately, have to run Fit Again on the ensembled model before using it
# Wish there was an option to not have to fit again
final_model.fit(X_train_vect, y_train)


Out[51]:
VotingClassifier(estimators=[('Logistic Regression', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)...=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False))],
         n_jobs=-1, voting='hard', weights=None)

In [52]:
# Drum Rolls - Accuracy on the final Model
print("Training Accuracy: {:.3f}".format(final_model.score(X_train_vect, y_train)))
print("Validation Accuracy: {:.3f}".format(final_model.score(X_valid_vect, y_valid)))


Training Accuracy: 0.965
Validation Accuracy: 0.866

Let's Update Kaggle Submission

Steps:

  • Load Test Dataset
  • Vectorize the Features (Review)
  • Predict the sentiment
  • Create the CSV file and update the submission

In [53]:
# Read in the Test Dataset
# Note that it's missing the Sentiment Column.  That's what we need to Predict
#
test_df = pd.read_csv("data/test.tsv", sep="\t")
test_df.head()


Out[53]:
document_id review
0 0 This is one of those movies that has everythin...
1 1 I don't know what some people were thinking wh...
2 2 Here is a rundown of a typical Rachael Ray Sho...
3 3 "Speck" was apparently intended to be a biopic...
4 4 Let's get it clear from the start: I am an ass...

In [54]:
# Vectorize the Review Text

X_test = test_df.review
X_test_vect = vect.transform(X_test)

In [55]:
y_test_pred = final_model.predict(X_test_vect)

In [58]:
df = pd.DataFrame({
    "document_id": test_df.document_id,
    "sentiment": y_test_pred
})

In [59]:
df.to_csv("data/ensemble_submission1.csv", index=False)

Other Ideas

  • Try Different Vectorizers
  • Hyper Parameter Tuning of Models