In lesson two we transformed our twitter sentiment data into a feature matrix. Now we can apply virtually any machine learning algorithm to our data and the python scikit-learn package makes it really easy to try any technique we want. The algorithms have subtle tradeoffs and the names can be really confusing. Many people new to machine learning spend too much time looking for the perfect machine learning algorithm for their data. In reality training data and feature extraction are almost always more important, however chosing the wrong algorithm can cause problems. In this notebook, we're going to go over how to pick an algorithm and evaluate if it's working well.
Scikit learn has a great flowchart for chosing an algorithm at scikit-learn.org/stable/tutorial/machine_learning_map/.
Let's walk through this flowchart on our data starting at the "START" circle in the upper right:
If you are actually looking at the flowchart on the scikit webpage, you can click on the green box and go to the LinearSVC documentation.
In [4]:
import pandas as pd
import numpy as np
df = pd.read_csv('../scikit/tweets.csv')
target = df['is_there_an_emotion_directed_at_a_brand_or_product']
text = df['tweet_text']
# We need to remove the empty rows from the text before we pass into CountVectorizer
fixed_text = text[pd.notnull(text)]
fixed_target = target[pd.notnull(text)]
# Do the feature extraction
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer() # initialize the count vectorizer
count_vect.fit(fixed_text) # set up the columns for the feature matrix
counts = count_vect.transform(fixed_text) # counts is the feature matrix
from sklearn.svm import LinearSVC
# Build a classifier using the LinearSVC algorithm
clf = LinearSVC() # initialize our classifier
clf.fit(counts, fixed_target) # fit our classifier to the training data
print(clf.predict(count_vect.transform(['i love my iphone']))) # try making a prediction
All classification algorithms in scikit-learn have three important functions:
Remember that the classifier will only work on feature vectors. We use our count_vect object to turn our training data into features and then we use it again to turn our new data into features. Together our count_vect obejct and our clf object work as a classifier that decide if tweets are positive or negative.
Let's try some more examples!
In [6]:
print('I hate my iphone', clf.predict(count_vect.transform(['I hate my iphone'])))
print('my iphone is great', clf.predict(count_vect.transform(['my iphone is great'])))
print('my iphone sucks', clf.predict(count_vect.transform(['my iphone sucks'])))
print('I do not love my iphone', clf.predict(count_vect.transform(['I do not love my iphone'])))
Hm, this all looks promising, except for the last one. Take a second to think about why our classifier might have gotten the last one wrong (think about the feature extraction processs).
Try some of your own examples. How well do you think the classifier is working?
Since all machine learning algorithms take in the same type of feature input, it's easy to try using a different classifier. If we go back to the diagram at the top and we follow the "Not Working" line coming out of the LinearSVC box, it takes us to the question "Text Data"? We are working with text data, so we find ourselves at the "Naive Bayes" node.
Let's switch our classifier to Naive Bayes. This is another common type of classifier which is extremely fast and easy to deploy.
In [8]:
# Build a classifier using the Naive Bayes algorithm
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(counts, fixed_target)
print(nb.predict(count_vect.transform(['i love my iphone']))) # try making a prediction
In [9]:
print('I hate my iphone', nb.predict(count_vect.transform(['I hate my iphone'])))
print('my iphone is great', nb.predict(count_vect.transform(['my iphone is great'])))
print('my iphone sucks', nb.predict(count_vect.transform(['my iphone sucks'])))
print('I do not love my iphone', nb.predict(count_vect.transform(['I do not love my iphone'])))
One of the most popular machine learning websites, Kaggle, did a survey in 2017 asking data scients which alogrithms they used.
The technique we used first was a type of SVM and the technique we used second was a type of Bayesian algorithm. These are especially good algorithms for text data.
But before we get too fancy, we need to put in place a framework to evaluate our algorithms.
In [ ]: