Improving the Text Classifier

Goals

  1. Try Random Forest on our Sentiment Data
  2. Try Support Vector Machines on our Sentiment Data
  3. Introduction to Hyperparameters

Introduction

There are three ways to improve a machine learning model

  1. Improve the training data
  2. Improve the feature extraction
  3. Improve the learning algorithm

Typically the easiest way is to get more data or cleaner data if you can do it. If that's not possible adding more features is the next easiest. The toughest way is to try to improve the learning algorithm. Not that we have a baseline in place we will try improving the learning algorithm and adding a few features.

Support Vector Machines

Support vector machines or SVMs are a popular way to model data when you have many more features that records. In our case we have about the same number of features as records so they might might sense to try. We actually used it in Lesson 3 but now we can try it out with Cross Validation. These take a little longer to train.


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score

df = pd.read_csv('../scikit/tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']

# We need to remove the empty rows from the text before we pass into CountVectorizer
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]

# Do the feature extraction
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()              # initialize the count vectorizer
count_vect.fit(fixed_text)                  # set up the columns for the feature matrix
counts = count_vect.transform(fixed_text)   # counts is the feature matrix

from sklearn.svm import LinearSVC
# Build a classifier using the LinearSVC algorithm
clf = LinearSVC()                           # initialize our classifier
clf.fit(counts, fixed_target)               # fit our classifier to the training data

scores = cross_val_score(clf, counts, fixed_target, cv=10)
print(scores)
print(scores.mean())


[ 0.67362637  0.65054945  0.65274725  0.65384615  0.66043956  0.67802198
  0.70847085  0.70077008  0.66482911  0.63947078]
0.668277158307

This SVM is 67% accurate.

Recall that the Naive Bayes Classifier was 65% accurate.
This does seem slightly better although there is a fair amount of variation in the folds and it probably wouldn't hold up to a statistical test.

There are other versions of SVMs we an try, but most of the others have a fit time that is quadratic in the number of records so they will be really slow.

Random Forests

Random forest is a really useful algorithm. It starts with a very simple and effective algorithm called a decision tree which looks at a single feature at a time and branches based on whether or not that feature is above some threshold. It then builds hundreds or thousands of these trees in parallel, each on different subsets of the data and gives each tree a single vote on which class the data is in. Random Forests take a while to train, but will probably run faster in production than an SVM.

I introduce Random Forests herebecause they are incredibly robust - if I only had one algorithm to use in any scenario I would probably use Random Forests or their close cousin Boosted Decision Trees.

Let's try Random Forests on our dataset - this may take a while.


In [4]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50) # n_estimators is the number of trees

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, counts, fixed_target, cv=10)
print(scores)
print(scores.mean())


[ 0.67472527  0.68351648  0.65494505  0.65274725  0.66813187  0.66043956
  0.70077008  0.69416942  0.67585447  0.65159868]
0.671689813068

The Random Forest is also 67% accurate but takes significantly longer to train.

Hyperparameters

Scikit has an endless list of different machine learning models. What you will usually find is that most of them perform similarly. Using the scikit flowchart we introduced in Lesson 3 will give you a great starting place.

If you look in the documentation for any of the classifiers, you will probably find a bunch of confusing optional parameters - often with greek letters. For example the Random Forest Classifier documentation has a parameter called n_estimators which is the number of trees and some options to control the size of the trees and other things. These are typically called hyperparameters. Finding the best hyperparameters is tricky - you often just have to try a bunch of different settings and see which one works the best.

For example, try running the Random Forest Classifier with 10 estimators or 1 estimator - what do you expect to happen?

One great feature of scikit is that it tends to give very good robust defaults for all of the machine learning algorithms.


In [ ]: