Basic Text Classification with Naive Bayes

In the mini-project, you'll learn the basics of text analysis using a subset of movie reviews from the rotten tomatoes database. You'll also use a fundamental technique in Bayesian inference, called Naive Bayes. This mini-project is based on Lab 10 of Harvard's CS109 class. Please free to go to the original lab for additional exercises and solutions.



In [1]:

    
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from six.moves import range

# Setup Pandas
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

Rotten Tomatoes Dataset
- Explore
The Vector Space Model and a Search Engine
- In Code
Naive Bayes
- Multinomial Naive Bayes and Other Likelihood Functions
- Picking Hyperparameters for Naive Bayes and Text Maintenance
Interpretation

Rotten Tomatoes Dataset



In [2]:

    
critics = pd.read_csv('./critics.csv')
#let's drop rows with missing quotes
critics = critics[~critics.quote.isnull()]
critics.head()









    Out[2]:






  
    
      
      critic
      fresh
      imdb
      publication
      quote
      review_date
      rtid
      title
    
  
  
    
      1
      Derek Adams
      fresh
      114709
      Time Out
      So ingenious in concept, design and execution ...
      2009-10-04
      9559
      Toy story
    
    
      2
      Richard Corliss
      fresh
      114709
      TIME Magazine
      The year's most inventive comedy.
      2008-08-31
      9559
      Toy story
    
    
      3
      David Ansen
      fresh
      114709
      Newsweek
      A winning animated feature that has something ...
      2008-08-18
      9559
      Toy story
    
    
      4
      Leonard Klady
      fresh
      114709
      Variety
      The film sports a provocative and appealing st...
      2008-06-09
      9559
      Toy story
    
    
      5
      Jonathan Rosenbaum
      fresh
      114709
      Chicago Reader
      An entertaining computer-generated, hyperreali...
      2008-03-10
      9559
      Toy story

Explore



In [3]:

    
n_reviews = len(critics)
n_movies = critics.rtid.unique().size
n_critics = critics.critic.unique().size


print("Number of reviews: {:d}".format(n_reviews))
print("Number of critics: {:d}".format(n_critics))
print("Number of movies:  {:d}".format(n_movies))









    



Number of reviews: 15561
Number of critics: 623
Number of movies:  1921



In [4]:

    
df = critics.copy()
df['fresh'] = df.fresh == 'fresh'
grp = df.groupby('critic')
counts = grp.critic.count()  # number of reviews by each critic
means = grp.fresh.mean()     # average freshness for each critic

means[counts > 100].hist(bins=10, edgecolor='w', lw=1)
plt.xlabel("Average Rating per critic")
plt.ylabel("Number of Critics")
plt.yticks([0, 2, 4, 6, 8, 10]);

Exercise Set I

Exercise: Look at the histogram above. Tell a story about the average ratings per critic. What shape does the distribution look like? What is interesting about the distribution? What might explain these interesting things?

Answer: The histogram suggests a binomial distribution. What's interesting about it is its uneven nature, where both the minimum and maximum points can be found in the center of the curve.

The Vector Space Model and a Search Engine

In Code



In [5]:

    
from sklearn.feature_extraction.text import CountVectorizer

# We'll instantiate a CountVectorizer and then call its instance method fit_transform, 
# which does two things: it learns the vocabulary of the corpus and extracts word count features.
# In a way, what we're doing is converting a collection of text documents to a matrix of
# token counts.

text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print("Original text is\n{}".format('\n'.join(text)))

vectorizer = CountVectorizer(min_df=0)

# call `fit` to build the vocabulary
vectorizer.fit(text)

# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x = x.toarray()

print("")
print("Transformed text vector is \n{}".format(x))

# `get_feature_names` tracks which word is associated with each column of the transformed x
print("")
print("Words for each feature:")
print(vectorizer.get_feature_names())

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency









    



Original text is
Hop on pop
Hop off pop
Hop Hop hop

Transformed text vector is 
[[1 0 1 1]
 [1 1 0 1]
 [3 0 0 0]]

Words for each feature:
['hop', 'off', 'on', 'pop']



In [6]:

    
def make_xy(critics, vectorizer=None):
    #Your code here    
    if vectorizer is None:
        vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(critics.quote)
    X = X.tocsc()  # some versions of sklearn return COO format
    y = (critics.fresh == 'fresh').values.astype(np.int)
    return X, y
X, y = make_xy(critics)

Naive Bayes

Exercise Set II

Exercise: Implement a simple Naive Bayes classifier:

split the data set into a training and test set
Use `scikit-learn`'s `MultinomialNB()` classifier with default parameters.
train the classifier over the training set and test on the test set
print the accuracy scores for both the training and the test sets

What do you notice? Is this a good classifier? If not, why not?



In [7]:

    
#your turn

# With these counts as features, we can go to the next steps: training a classifier.
# Naïve Bayes classifier applies the Bayes theorem with naïve independence assumptions. 
# That is, each feature (in this case word counts) is independent from every other one 
# and each one contributes to the probability that an example belongs to a particular class.

# Now, split the data set into a training and test set.
 
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

classifier = MultinomialNB()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=40)

classifier.fit(X_train, y_train)

prediction = classifier.predict(X_test)

print("Classifier's accuracy score on training data is %2.2F%%" % (accuracy_score(y_train,classifier.predict(X_train))*100))
print("Classifier's accuracy score on test data is %2.2F%%" % (accuracy_score(y_test,prediction)*100))

# Answer:
#    The classifier did better on the training set than the test set suggesting an overfitting
#    of the model.









    



Classifier's accuracy score on training data is 91.94%
Classifier's accuracy score on test data is 77.76%

Picking Hyperparameters for Naive Bayes and Text Maintenance

We need to know what value to use for $\alpha$, and we also need to know which words to include in the vocabulary. As mentioned earlier, some words are obvious stopwords. Other words appear so infrequently that they serve as noise, and other words in addition to stopwords appear so frequently that they may also serve as noise.

First, let's find an appropriate value for min_df for the CountVectorizer. min_df can be either an integer or a float/decimal. If it is an integer, min_df represents the minimum number of documents a word must appear in for it to be included in the vocabulary. If it is a float, it represents the minimum percentage of documents a word must appear in to be included in the vocabulary. From the documentation:

min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Exercise Set III

Exercise: Construct the cumulative distribution of document frequencies (df). The $x$-axis is a document count $x_i$ and the $y$-axis is the percentage of words that appear less than $x_i$ times. For example, at $x=5$, plot a point representing the percentage or number of words that appear in 5 or fewer documents.

Exercise: Look for the point at which the curve begins climbing steeply. This may be a good value for `min_df`. If we were interested in also picking `max_df`, we would likely pick the value where the curve starts to plateau. What value did you choose?



In [8]:

    
# Construct the cumulative distribution of document frequencies (df). The 
# x-axis is a document count xi and the y-axis is the percentage of words
# that appear less than xi times.

# We first want to sum the documents.
cumulative = X.sum(axis=0)

# Now, sum up the number of times the words appear in document using a dictionary
words_in_doc_freq = {}
for i in range(cumulative.max() +1): 
    words_in_doc_freq[i] = np.sum(cumulative < i)

# Plot the dictionary items
for key, value in words_in_doc_freq.items():
    plt.scatter(x=int(key),y=int(value))
    
plt.xlabel('Document Frequencies', fontsize=12)
plt.ylabel('Percent WOrds that Appear Less Than Document Frequencies', fontsize=12)









    Out[8]:





<matplotlib.text.Text at 0x21738830358>

The parameter $\alpha$ is chosen to be a small value that simply avoids having zeros in the probability computations. This value can sometimes be chosen arbitrarily with domain expertise, but we will use K-fold cross validation. In K-fold cross-validation, we divide the data into $K$ non-overlapping parts. We train on $K-1$ of the folds and test on the remaining fold. We then iterate, so that each fold serves as the test fold exactly once. The function cv_score performs the K-fold cross-validation algorithm for us, but we need to pass a function that measures the performance of the algorithm on each fold.



In [9]:

    
from sklearn.cross_validation import KFold
def cv_score(clf, X, y, scorefunc):
    result = 0.
    nfold = 5
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit the classifier, passed is as clf.
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

We use the log-likelihood as the score here in scorefunc. The higher the log-likelihood, the better. Indeed, what we do in cv_score above is to implement the cross-validation part of GridSearchCV.

The custom scoring function scorefunc allows us to use different metrics depending on the decision risk we care about (precision, accuracy, profit etc.) directly on the validation set. You will often find people using roc_auc, precision, recall, or F1-score as the scoring function.



In [10]:

    
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    rotten = y == 0
    fresh = ~rotten
    return prob[rotten, 0].sum() + prob[fresh, 1].sum()

We'll cross-validate over the regularization parameter $\alpha$.

Let's set up the train and test masks first, and then we can run the cross-validation procedure.



In [11]:

    
from sklearn.cross_validation import train_test_split
_, itest = train_test_split(range(critics.shape[0]), train_size=0.7)
mask = np.zeros(critics.shape[0], dtype=np.bool)
mask[itest] = True

Exercise Set IV

Exercise: What does using the function `log_likelihood` as the score mean? What are we trying to optimize for?

Exercise: Without writing any code, what do you think would happen if you choose a value of $\alpha$ that is too high?

Exercise: Using the skeleton code below, find the best values of the parameter `alpha`, and use the value of `min_df` you chose in the previous exercise set. Use the `cv_score` function above with the `log_likelihood` function for scoring.



In [12]:

    
# 1) Using the log_likelihood as the score means that the size of the dataset is taken
#    into account to measure fit robustness.  It returns the sum of the logs of the
#    predicted probabilities, the higher the score the better.
# 2) Using n as the vocabulary size, the probability estimate (N+alpha)/(N+alpha+n). 
#    Choosing an alpha too high means that both the probability estimates and 
#    log_likelihood function will converge around the middle.  This implies a poor
#    prediction.



In [13]:

    
from sklearn.naive_bayes import MultinomialNB

#the grid of parameters to search over
alphas = [.1, 1, 5, 10, 50]
best_min_df = 0.001 # YOUR TURN: put your value of min_df here.

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
maxscore=-np.inf
for alpha in alphas:        
    vectorizer = CountVectorizer(min_df=best_min_df)       
    Xthis, ythis = make_xy(critics, vectorizer)
    Xtrainthis = Xthis[mask]
    ytrainthis = ythis[mask]
    # your turn
    # Train and test a MultinomialNB classifier
    classifier = MultinomialNB()
    scr = cv_score(classifier, Xtrainthis, ytrainthis, log_likelihood)
    if scr > maxscore:
        maxscore = scr
        best_alpha = alpha
        best_min_df = best_min_df
        
print("best alpha: {}".format(best_alpha))
print("maxscore: {}".format(maxscore))
print("best min df: {}".format(best_min_df))









    



best alpha: 0.1
maxscore: -594.2225555510283
best min df: 0.001

Exercise Set V: Working with the Best Parameters

Exercise: Using the best value of `alpha` you just found, calculate the accuracy on the training and test sets. Is this classifier better? Why (not)?



In [14]:

    
vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(critics, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

#your turn. Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: {:2f}".format(training_accuracy))
print("Accuracy on test data:     {:2f}".format(test_accuracy))

# Based on the results, the accuracy of this classifier isn't any
# better than before because the accuracy on test data is still
# less than the accuracy of the training data.









    



Accuracy on training data: 0.832084
Accuracy on test data:     0.723467



In [15]:

    
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, clf.predict(xtest)))









    



[[2584 1696]
 [1316 5296]]

Interpretation

What are the strongly predictive features?

We use a neat trick to identify strongly predictive features (i.e. words).

first, create a data set such that each row has exactly one feature. This is represented by the identity matrix.
use the trained classifier to make predictions on this matrix
sort the rows by predicted probabilities, and pick the top and bottom $K$ rows



In [16]:

    
words = np.array(vectorizer.get_feature_names())

x = np.eye(xtest.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

good_words = words[ind[:10]]
bad_words = words[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print("Good words\t     P(fresh | word)")
for w, p in zip(good_words, good_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))
    
print("Bad words\t     P(fresh | word)")
for w, p in zip(bad_words, bad_prob):
    print("{:>20}".format(w), "{:.2f}".format(1 - np.exp(p)))









    



Good words	     P(fresh | word)
        storytelling 0.99
              finest 0.99
            touching 0.99
             delight 0.99
          remarkable 0.99
            ensemble 0.99
                ride 0.99
           tradition 0.99
            captures 0.99
           energetic 0.99
Bad words	     P(fresh | word)
            overlong 0.02
      disappointment 0.01
            annoying 0.01
              unless 0.01
     disappointingly 0.01
             million 0.01
         uninvolving 0.01
           pointless 0.01
               banal 0.01
                lame 0.01

Exercise Set VI

Exercise: Why does this method work? What does the probability for each row in the identity matrix represent

Answer: This method can be considered to be very detailed because it computes the probability for every word. The probability for each row in the matrix represent a positive review being given only if the word is present.

The above exercise is an example of feature selection. There are many other feature selection methods. A list of feature selection methods available in sklearn is here. The most common feature selection technique for text mining is the chi-squared $\left( \chi^2 \right)$ method.

Prediction Errors

We can see mis-predictions as well.



In [17]:

    
x, y = make_xy(critics, vectorizer)

prob = clf.predict_proba(x)[:, 0]
predict = clf.predict(x)

bad_rotten = np.argsort(prob[y == 0])[:5]
bad_fresh = np.argsort(prob[y == 1])[-5:]

print("Mis-predicted Rotten quotes")
print('---------------------------')
for row in bad_rotten:
    print(critics[y == 0].quote.iloc[row])
    print("")

print("Mis-predicted Fresh quotes")
print('--------------------------')
for row in bad_fresh:
    print(critics[y == 1].quote.iloc[row])
    print("")









    



Mis-predicted Rotten quotes
---------------------------
This pacifist spirit of brotherhood echoes the heroics in Princess Mononoke and other anime titles, but the artistic gap between the Miyazaki masterpiece and this project is huge.

Three female friends grow up in a small town in Ireland in the mid-50s and attend college in Dublin in this nostalgic soap opera that's vaguely evocative of Peyton Place, though generally less memorable.

Highly stylized fashion-wise but awkwardly unfocused in its plotlines, it aims for the western iconography of Sam Peckinpah and Sergio Leone but never gets past its own directorial hurdles.

If anything, it's a Hanks 'little boy lost' movie, more in the Big tradition than The Big Tradition.

It's not as dispensable as Singleton's sophomore effort, Poetic Justice, but it's a long way from the assured freshman storytelling of Boyz N the Hood.

Mis-predicted Fresh quotes
--------------------------
Essentially a $30 million version of Abbott and Costello Meet the Mummy but not at all a bad time, thanks mainly to Bill Murray's incredibly dry line readings and director Ivan Reitman's maintenance of a moderately coherent tone and plotline.

Some of the gags don't work, but fewer than in any previous Brooks film that I've seen, and when the jokes are meant to be bad, they are riotously poor. What more can one ask of Mel Brooks?

Next Friday is an extremely funny movie, and this is coming from someone who barely cracked a smile during Friday, the first installment of this franchise.

Wonder Boys digresses so entertainingly, you forget how quickly Grady got into the mess he's in, and can't imagine where we might be headed.

A movie like The Lady Eve is so hard to make that you can't make it at all unless you find a way to make it seem effortless. Preston Sturges does a kind of breathless balancing act here, involving romance, deception and physical comedy.

Exercise Set VII: Predicting the Freshness for a New Review

Exercise:

Using your best trained classifier, predict the freshness of the following sentence: *'This movie is not remarkable, touching, or superb in any way'*
Is the result what you'd expect? Why (not)?



In [18]:

    
#your turn
text = pd.DataFrame([['CNN','rotten',0,'CNN','This movie is not remarkable, touching, or superb in any way','CNN',222,'CNN']],
             columns=['critic','fresh','imdb','publication','quote','review_date','rtid','title'])

all_dataset = pd.concat([critics,text])

vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(all_dataset, vectorizer)

all_dataset.tail()

prob = classifier.predict(X)[-1]
print(prob)

# Since I used the best_min_df for my classifier, I did expect
# the probability to be greatly improved to be closer to 100%

Aside: TF-IDF Weighting for Term Importance

TF-IDF stands for

Term-Frequency X Inverse Document Frequency.

In the standard CountVectorizer model above, we used just the term frequency in a document of words in our vocabulary. In TF-IDF, we weight this term frequency by the inverse of its popularity in all documents. For example, if the word "movie" showed up in all the documents, it would not have much predictive value. It could actually be considered a stopword. By weighing its counts by 1 divided by its overall frequency, we downweight it. We can then use this TF-IDF weighted features as inputs to any classifier. TF-IDF is essentially a measure of term importance, and of how discriminative a word is in a corpus. There are a variety of nuances involved in computing TF-IDF, mainly involving where to add the smoothing term to avoid division by 0, or log of 0 errors. The formula for TF-IDF in scikit-learn differs from that of most textbooks:

$$\mbox{TF-IDF}(t, d) = \mbox{TF}(t, d)\times \mbox{IDF}(t) = n_{td} \log{\left( \frac{\vert D \vert}{\vert d : t \in d \vert} + 1 \right)}$$

where $n_{td}$ is the number of times term $t$ occurs in document $d$, $\vert D \vert$ is the number of documents, and $\vert d : t \in d \vert$ is the number of documents that contain $t$



In [19]:

    
# http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
# http://scikit-learn.org/dev/modules/classes.html#text-feature-extraction-ref
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(min_df=1, stop_words='english')
Xtfidf=tfidfvectorizer.fit_transform(critics.quote)

Exercise Set VIII: Enrichment

There are several additional things we could try. Try some of these as exercises:

Build a Naive Bayes model where the features are n-grams instead of words. N-grams are phrases containing n words next to each other: a bigram contains 2 words, a trigram contains 3 words, and 6-gram contains 6 words. This is useful because "not good" and "so good" mean very different things. On the other hand, as n increases, the model does not scale well since the feature set becomes more sparse.
Try a model besides Naive Bayes, one that would allow for interactions between words -- for example, a Random Forest classifier.
Try adding supplemental features -- information about genre, director, cast, etc.
Use word2vec or [Latent Dirichlet Allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) to group words into topics and use those topics for prediction.
Use TF-IDF weighting instead of word counts.

Exercise: Try a few of these ideas to improve the model (or any other ideas of your own). Implement here and report on the result.



In [20]:

    
# Your turn

# Build a Naive Bayes model that uses TF-IDF weighting instead of word counts. 
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(min_df=1, stop_words='english')
Xtfidf=tfidfvectorizer.fit_transform(critics.quote)

X, y = make_xy(critics, tfidfvectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clsf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

# your turn. 
# Print the accuracy on the test and training dataset
training_accuracy = clsf.score(xtrain, ytrain)
test_accuracy = clsf.score(xtest, ytest)

print("Accuracy on training data: %0.2f" % (training_accuracy))
print("Accuracy on test data:     %0.2f" % (test_accuracy))









    



Accuracy on training data: 0.97
Accuracy on test data:     0.72

	critic	fresh	imdb	publication	quote	review_date	rtid	title
1	Derek Adams	fresh	114709	Time Out	So ingenious in concept, design and execution ...	2009-10-04	9559	Toy story
2	Richard Corliss	fresh	114709	TIME Magazine	The year's most inventive comedy.	2008-08-31	9559	Toy story
3	David Ansen	fresh	114709	Newsweek	A winning animated feature that has something ...	2008-08-18	9559	Toy story
4	Leonard Klady	fresh	114709	Variety	The film sports a provocative and appealing st...	2008-06-09	9559	Toy story
5	Jonathan Rosenbaum	fresh	114709	Chicago Reader	An entertaining computer-generated, hyperreali...	2008-03-10	9559	Toy story