Basic text classification with Naive Bayes


In the mini-project, you'll learn the basics of text analysis using a subset of movie reviews from the rotten tomatoes database. You'll also use a fundamental technique in Bayesian inference, called Naive Bayes. This mini-project is based on Lab 10 of Harvard's CS109 class. Please free to go to the original lab for additional exercises and solutions.


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Rotten Tomatoes data set


In [2]:
critics = pd.read_csv('./critics.csv')
#let's drop rows with missing quotes
critics = critics[~critics.quote.isnull()]
critics.head()


Out[2]:
critic fresh imdb publication quote review_date rtid title
1 Derek Adams fresh 114709 Time Out So ingenious in concept, design and execution ... 2009-10-04 9559 Toy story
2 Richard Corliss fresh 114709 TIME Magazine The year's most inventive comedy. 2008-08-31 9559 Toy story
3 David Ansen fresh 114709 Newsweek A winning animated feature that has something ... 2008-08-18 9559 Toy story
4 Leonard Klady fresh 114709 Variety The film sports a provocative and appealing st... 2008-06-09 9559 Toy story
5 Jonathan Rosenbaum fresh 114709 Chicago Reader An entertaining computer-generated, hyperreali... 2008-03-10 9559 Toy story

Explore


In [4]:
n_reviews = len(critics)
n_movies = critics.rtid.unique().size
n_critics = critics.critic.unique().size


print("Number of reviews: %i" % n_reviews)
print("Number of critics: %i" % n_critics)
print("Number of movies:  %i" % n_movies)


Number of reviews: 15561
Number of critics: 623
Number of movies:  1921

In [5]:
df = critics.copy()
df['fresh'] = df.fresh == 'fresh'
grp = df.groupby('critic')
counts = grp.critic.count()  # number of reviews by each critic
means = grp.fresh.mean()     # average freshness for each critic

means[counts > 100].hist(bins=10, edgecolor='w', lw=1)
plt.xlabel("Average rating per critic")
plt.ylabel("N")
plt.yticks([0, 2, 4, 6, 8, 10]);


The Vector space model and a search engine.

All the diagrams here are snipped from See http://nlp.stanford.edu/IR-book/ which is a great resource on Text processing.

Also check out Python packages nltk, spacy, and pattern, and their associated resources.

Let us define the vector derived from document d by $\bar V(d)$. What does this mean? Each document is considered to be a vector made up from a vocabulary, where there is one axis for each term in the vocabulary.

To define the vocabulary, we take a union of all words we have seen in all documents. We then just associate an array index with them. So "hello" may be at index 5 and "world" at index 99.

Then the document

"hello world world"

would be indexed as

[(5,1),(99,2)]

along with a dictionary

5: Hello 99: World

so that you can see that our representation is one of a sparse array.

Then, a set of documents becomes, in the usual sklearn style, a sparse matrix with rows being sparse arrays and columns "being" the features, ie the vocabulary. I put "being" in quites as the layout in memort is that of a matrix with many 0's, but, rather, we use the sparse representation we talked about above.

Notice that this representation loses the relative ordering of the terms in the document. That is "cat ate rat" and "rat ate cat" are the same. Thus, this representation is also known as the Bag-Of-Words representation.

Here is another example, from the book quoted above, although the matrix is transposed here so that documents are columns:

Such a matrix is also catted a Term-Document Matrix. Here, the terms being indexed could be stemmed before indexing; for instance, jealous and jealousy after stemming are the same feature. One could also make use of other "Natural Language Processing" transformations in constructing the vocabulary. We could use Lemmatization, which reduces words to lemmas: work, working, worked would all reduce to work. We could remove "stopwords" from our vocabulary, such as common words like "the". We could look for particular parts of speech, such as adjectives. This is often done in Sentiment Analysis. And so on. It all deoends on our application.

From the book:

The standard way of quantifying the similarity between two documents $d_1$ and $d_2$ is to compute the cosine similarity of their vector representations $\bar V(d_1)$ and $\bar V(d_2)$:

$$S_{12} = \frac{\bar V(d_1) \cdot \bar V(d_2)}{|\bar V(d_1)| \times |\bar V(d_2)|}$$

There is a far more compelling reason to represent documents as vectors: we can also view a query as a vector. Consider the query q = jealous gossip. This query turns into the unit vector $\bar V(q)$ = (0, 0.707, 0.707) on the three coordinates below.

The key idea now: to assign to each document d a score equal to the dot product:

$$\bar V(q) \cdot \bar V(d)$$

This we can use this simple Vector Model as a Search engine.

In Code


In [50]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print("Original text is\n", '\n'.join(text))

vectorizer = CountVectorizer(min_df=0)

# call `fit` to build the vocabulary
vectorizer.fit(text)

# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x = x.toarray()

print
print("Transformed text vector is \n", x)

# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print("Words for each feature:")
print(vectorizer.get_feature_names())

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency


Original text is
 Hop on pop
Hop off pop
Hop Hop hop
Transformed text vector is 
 [[1 0 1 1]
 [1 1 0 1]
 [3 0 0 0]]
Words for each feature:
['hop', 'off', 'on', 'pop']

In [51]:
def make_xy(critics, vectorizer=None):
    #Your code here    
    if vectorizer is None:
        vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(critics.quote)
    X = X.tocsc()  # some versions of sklearn return COO format
    y = (critics.fresh == 'fresh').values.astype(np.int)
    return X, y
X, y = make_xy(critics)

Naive Bayes

$$P(c|d) \propto P(d|c) P(c) $$$$P(d|c) = \prod_k P(t_k | c) $$

the conditional independence assumption.

Then we see that for which c is $P(c|d)$ higher.

For floating point underflow we change the product into a sum by going into log space. So:

$$log(P(d|c)) = \sum_k log (P(t_k | c)) $$

But we must also handle non-existent terms, we cant have 0's for them:

$$P(t_k|c) = \frac{N_{kc}+\alpha}{N_c+\alpha N_{feat}}$$

Your turn: Implement a simple Naive Bayes classifier

  • Use scikit-learn's MultinomialNB() classifier with default parameters.
  • split the data set into a training and test set
  • train the classifier over the training set and test on the test set
  • print the accuracy scores for both the training and the test sets

What do you notice? Is this a good classifier? If not, why not?


In [52]:
#your turn
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)

naive_model = MultinomialNB().fit(X_train,y_train)
test_score = naive_model.score(X_test,y_test)
train_score = naive_model.score(X_train,y_train)
print("Multinomial test accuracy: %0.2f%%" % (100*test_score))
print("Multinomial train accuracy: %0.2f%%" % (100*train_score))


Multinomial test accuracy: 76.77%
Multinomial train accuracy: 92.24%

The accuracy score is good for the test set, but not great. When we see how it performs on the training set however, it becomes clear that the classifier is overfit. There is a $\approx 16\%$ difference in score.

Cross-Validation and hyper-parameter fitting

We use KFold instead of GridSearchCV here as we will want to also set parameters in the CountVectorizer.


In [53]:
from sklearn.model_selection import KFold
def cv_score(clf, X, y, scorefunc):
    result = 0.
    nfold = 5
    kf = KFold(n_splits = nfold)
    for train, test in kf.split(X): # split data into train/test groups, 5 times
        clf.fit(X[train], y[train]) # fit
        result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

We use the log-likelihood as the score here in scorefunc. Indeed, what we do in cv_score above is to implement the cross-validation part of GridSearchCV.

Since Naive Bayes classifiers are often used in asymmetric situations, it might help to actually maximize probability on the validation folds rather than just accuracy.

Notice something else about using a custom score function. It allows us to do a lot of the choices with the Decision risk we care about (-profit for example) directly on the validation set. You will often find people using roc_auc, precision, recall, or F1-score as risks or scores.


In [54]:
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    rotten = y == 0
    fresh = ~rotten
    return prob[rotten, 0].sum() + prob[fresh, 1].sum()

Your turn: What is using this function as the score mean? What are we trying to optimize for?

We are scoring by taking the amount that each X in the data contributes to the LOG PROBABILITY of being in the rotten class, or the fresh class. Thus we are optimizing the certainty of our predictions. Large scores indicate a greater certainty in classification.

A downfall of this is that hard to classify movies (such as the infamous Napolean Dynamite), would score high in both categories. The assumption is that in the general sense, these two components of score shouldn't be closely correlated, and so taking the largest score means optimizing for SOME classification into one of the two groups.

We'll cross-validate over the regularization parameter $\alpha$ and the min_df of the CountVectorizer.

min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Lets set up the train and test masks first:


In [55]:
itrain, itest = train_test_split(range(critics.shape[0]), train_size=0.7)
mask=np.ones(critics.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

Your turn:

Using the skeleton code below, find the best values of the parameters alpha and min_df. Use the cv_score function above with the log_likelihood function for scoring.


In [56]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(min_df = min_df)       
        Xthis, ythis = make_xy(critics, vectorizer)
        Xtrainthis=Xthis[mask]
        ytrainthis=ythis[mask]
        #your turn
        naive_bayes = MultinomialNB(alpha=alpha)
        crossval_score = cv_score(naive_bayes,Xtrainthis,ytrainthis, log_likelihood)
        
        if crossval_score > maxscore:
            maxscore = crossval_score
            best_alpha,best_min_df = alpha,min_df

In [57]:
print("alpha: %f" % best_alpha)
print("min_df: %f" % best_min_df)


alpha: 5.000000
min_df: 0.001000

Work with the best params

Your turn: Using the best values of alpha and min_df you just found, calculate the accuracy on the training and test sets. Is this classifier better? Why (not)?


In [61]:
vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(critics, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]

clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

#your turn. Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print("Accuracy on training data: %0.2f%%" % (training_accuracy))
print("Accuracy on test data:     %0.2f%%" % (test_accuracy))


Accuracy on training data: 0.79%
Accuracy on test data:     0.74%

In [40]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, clf.predict(xtest)))


[[1111  738]
 [ 455 2365]]

The classifier performs slightly worse on the test data, but the closeness of the scores suggests that we are no longer over fitting. One would need to get new novel data and test against that to be sure, but the initial impression is that this classifier will perform better over a greater variety of datasets.

Interpretation

What are the strongly predictive features?

We use a neat trick to identify strongly predictive features (i.e. words).

  • first, create a data set such that each row has exactly one feature. This is represented by the identity matrix.
  • use the trained classifier to make predictions on this matrix
  • sort the rows by predicted probabilities, and pick the top and bottom $K$ rows

In [65]:
words = np.array(vectorizer.get_feature_names())

x = np.eye(xtest.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

good_words = words[ind[:10]]
bad_words = words[ind[-10:]]

good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]

print("Good words\t     P(fresh | word)")
for w, p in zip(good_words, good_prob):
    print("%20s" % w, "%0.2f" % (1 - np.exp(p)))
    
print("Bad words\t     P(fresh | word)")
for w, p in zip(bad_words, bad_prob):
    print("%20s" % w, "%0.2f" % (1 - np.exp(p)))


Good words	     P(fresh | word)
            touching 0.89
             delight 0.89
             perfect 0.88
          delightful 0.88
         masterpiece 0.88
              moving 0.87
              superb 0.87
         intelligent 0.87
               witty 0.86
          remarkable 0.86
Bad words	     P(fresh | word)
             muddled 0.21
                dull 0.21
            tiresome 0.20
               worst 0.20
             unfunny 0.19
               bland 0.19
          uninspired 0.18
           pointless 0.18
                lame 0.16
       unfortunately 0.14

Your turn: Why does this method work? What does the probability for each row in the identity matrix represent?

This methods works because we have made it so that each row in our matrix has only one word (feature). The total probability for all the features summed along a row is thus the probability of freshness given that single word in the row. The words for which we have a very high probability indicate that the word on its own correlates very strongly with freshness, and words that have very low probablities indicate that there is a strong correlation with rottenness.

Mis-predictions

We can see mis-predictions as well.


In [74]:
x, y = make_xy(critics, vectorizer)

prob = clf.predict_proba(x)[:, 0]
predict = clf.predict(x)

bad_rotten = np.argsort(prob[y == 0])[:5]
bad_fresh = np.argsort(prob[y == 1])[-5:]

print("Mis-predicted Rotten quotes")
print ('---------------------------')
for row in bad_rotten:
    print (critics[y == 0].quote.iat[row])
    print()

print("Mis-predicted Fresh quotes")
print('--------------------------')
for row in bad_fresh:
    print(critics[y == 1].quote.iat[row])
    print()


Mis-predicted Rotten quotes
---------------------------
With its feints at horror and pathos, the third Star Wars film is the most Disney-esque in its emotional outline, yet that outline is buried beneath an obnoxiously hyped-up pace that reduces the emotions to rubble.

It survives today only as an unusually pure example of a typical 50s art-film strategy: the attempt to make the most modern and most popular of art forms acceptable to the intelligentsia by forcing it into an arcane, antique mold.

Herzog offers some evidence of Kinski's great human warmth, somewhat more of his rage of unimaginable proportions, and a good demonstration of Kinski's uncanny capacity to corkscrew his way into the frame.

Walken is one of the few undeniably charismatic male villains of recent years; he can generate a snakelike charm that makes his worst characters the most memorable, and here he operates on pure style.

I know that Platoon is being acclaimed for its realism, and I expect to be chastened for being a woman finding fault with a war film. But I've probably seen as much combat as most of the men saying, 'This is how war is.'

Mis-predicted Fresh quotes
--------------------------
A kind of insane logic seems to connect the sketches, if you look hard enough, but mostly the movie seems to exist in the present and be willing to try anything for a laugh.

There's too much talent and too strong a story to mess it up. There was potential for more here, but this incarnation is nothing to be ashamed of, and some of the actors answer the bell.

The gangland plot is flimsy (bad guy Peter Greene wears too much eyeliner), and the jokes are erratic, but it's a far better showcase for Carrey's comic-from-Uranus talent than Ace Ventura.

Though it's a good half hour too long, this overblown 1993 spin-off of the 60s TV show otherwise adds up to a pretty good suspense thriller.

Some of the gags don't work, but fewer than in any previous Brooks film that I've seen, and when the jokes are meant to be bad, they are riotously poor. What more can one ask of Mel Brooks?

Predicting the freshness for a new review

Your turn:

  • Using your best trained classifier, predict the freshness of the following sentence: 'This movie is not remarkable, touching, or superb in any way'
  • Is the result what you'd expect? Why (not)?

In [76]:
#your turn
clf.predict_proba(vectorizer.transform(['This movie is not remarkable, touching, or superb in any way']))


Out[76]:
array([[ 0.01359199,  0.98640801]])

This classifier gives a 98.6% probability of being fresh, despite the fact that the sentence is clearly negative. The word 'not' should negate all the postive adjectives which follow, but our simple bag-of-words approach doesn't have any way of dealing with this, and simply takes the positive features as is. Thus, a completely naive approach fails when confronted with the subleties of language.

Fun things to try and improve this model:

There are many things worth trying. Some examples:

  • You could try to build a NB model where the features are word pairs instead of words. This would be smart enough to realize that "not good" and "so good" mean very different things. This technique doesn't scale very well, since these features are much more sparse (and hence harder to detect repeatable patterns within).
  • You could try a model besides NB, that would allow for interactions between words -- for example, a Random Forest classifier.
  • You could consider adding supplemental features -- information about genre, director, cast, etc.
  • You could build a visualization that prints word reviews, and visually encodes each word with size or color to indicate how that word contributes to P(Fresh). For example, really bad words could show up as big and red, good words as big and green, common words as small and grey, etc.

Better features

We could use TF-IDF instead. What is this? It stands for

Term-Frequency X Inverse Document Frequency.

In the standard CountVectorizer model above, we used just the term frequency in a document of words in our vocabulary. In TF-IDF, we weigh this term frequency by the inverse of its popularity in all document. For example, if the word "movie" showed up in all the documents, it would not have much predictive value. By weighing its counts by 1 divided by its overall frequency, we down-weight it. We can then use this tfidf weighted features as inputs to any classifier.


In [77]:
#http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
#http://scikit-learn.org/dev/modules/classes.html#text-feature-extraction-ref
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(min_df=1, stop_words='english')
Xtfidf=tfidfvectorizer.fit_transform(critics.quote)

Your turn (extra credit): Try a few of these ideas to improve the model (or any other ideas of your own). Implement here and report on the result.


In [78]:
Xtfidf[0].toarray()


Out[78]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [ ]: