In the mini-project, you'll learn the basics of text analysis using a subset of movie reviews from the rotten tomatoes database. You'll also use a fundamental technique in Bayesian inference, called Naive Bayes. This mini-project is based on Lab 10 of Harvard's CS109 class. Please free to go to the original lab for additional exercises and solutions.
In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
In [2]:
critics = pd.read_csv('./critics.csv')
#let's drop rows with missing quotes
critics = critics[~critics.quote.isnull()]
critics.head()
Out[2]:
In [4]:
n_reviews = len(critics)
n_movies = critics.rtid.unique().size
n_critics = critics.critic.unique().size
print("Number of reviews: %i" % n_reviews)
print("Number of critics: %i" % n_critics)
print("Number of movies: %i" % n_movies)
In [5]:
df = critics.copy()
df['fresh'] = df.fresh == 'fresh'
grp = df.groupby('critic')
counts = grp.critic.count() # number of reviews by each critic
means = grp.fresh.mean() # average freshness for each critic
means[counts > 100].hist(bins=10, edgecolor='w', lw=1)
plt.xlabel("Average rating per critic")
plt.ylabel("N")
plt.yticks([0, 2, 4, 6, 8, 10]);
All the diagrams here are snipped from See http://nlp.stanford.edu/IR-book/ which is a great resource on Text processing.
Also check out Python packages nltk, spacy, and pattern, and their associated resources.
Let us define the vector derived from document d by $\bar V(d)$. What does this mean? Each document is considered to be a vector made up from a vocabulary, where there is one axis for each term in the vocabulary.
To define the vocabulary, we take a union of all words we have seen in all documents. We then just associate an array index with them. So "hello" may be at index 5 and "world" at index 99.
Then the document
"hello world world"
would be indexed as
[(5,1),(99,2)]
along with a dictionary
5: Hello
99: World
so that you can see that our representation is one of a sparse array.
Then, a set of documents becomes, in the usual sklearn
style, a sparse matrix with rows being sparse arrays and columns "being" the features, ie the vocabulary. I put "being" in quites as the layout in memort is that of a matrix with many 0's, but, rather, we use the sparse representation we talked about above.
Notice that this representation loses the relative ordering of the terms in the document. That is "cat ate rat" and "rat ate cat" are the same. Thus, this representation is also known as the Bag-Of-Words representation.
Here is another example, from the book quoted above, although the matrix is transposed here so that documents are columns:
Such a matrix is also catted a Term-Document Matrix. Here, the terms being indexed could be stemmed before indexing; for instance, jealous and jealousy after stemming are the same feature. One could also make use of other "Natural Language Processing" transformations in constructing the vocabulary. We could use Lemmatization, which reduces words to lemmas: work, working, worked would all reduce to work. We could remove "stopwords" from our vocabulary, such as common words like "the". We could look for particular parts of speech, such as adjectives. This is often done in Sentiment Analysis. And so on. It all deoends on our application.
From the book:
$$S_{12} = \frac{\bar V(d_1) \cdot \bar V(d_2)}{|\bar V(d_1)| \times |\bar V(d_2)|}$$The standard way of quantifying the similarity between two documents $d_1$ and $d_2$ is to compute the cosine similarity of their vector representations $\bar V(d_1)$ and $\bar V(d_2)$:
There is a far more compelling reason to represent documents as vectors: we can also view a query as a vector. Consider the query q = jealous gossip. This query turns into the unit vector $\bar V(q)$ = (0, 0.707, 0.707) on the three coordinates below.
$$\bar V(q) \cdot \bar V(d)$$The key idea now: to assign to each document d a score equal to the dot product:
This we can use this simple Vector Model as a Search engine.
In [50]:
from sklearn.feature_extraction.text import CountVectorizer
text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print("Original text is\n", '\n'.join(text))
vectorizer = CountVectorizer(min_df=0)
# call `fit` to build the vocabulary
vectorizer.fit(text)
# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)
# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to
# convert back to a "normal" numpy array
x = x.toarray()
print
print("Transformed text vector is \n", x)
# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print("Words for each feature:")
print(vectorizer.get_feature_names())
# Notice that the bag of words treatment doesn't preserve information about the *order* of words,
# just their frequency
In [51]:
def make_xy(critics, vectorizer=None):
#Your code here
if vectorizer is None:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(critics.quote)
X = X.tocsc() # some versions of sklearn return COO format
y = (critics.fresh == 'fresh').values.astype(np.int)
return X, y
X, y = make_xy(critics)
the conditional independence assumption.
Then we see that for which c is $P(c|d)$ higher.
For floating point underflow we change the product into a sum by going into log space. So:
$$log(P(d|c)) = \sum_k log (P(t_k | c)) $$But we must also handle non-existent terms, we cant have 0's for them:
$$P(t_k|c) = \frac{N_{kc}+\alpha}{N_c+\alpha N_{feat}}$$Your turn: Implement a simple Naive Bayes classifier
scikit-learn
's MultinomialNB()
classifier with default parameters.What do you notice? Is this a good classifier? If not, why not?
In [52]:
#your turn
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=4)
naive_model = MultinomialNB().fit(X_train,y_train)
test_score = naive_model.score(X_test,y_test)
train_score = naive_model.score(X_train,y_train)
print("Multinomial test accuracy: %0.2f%%" % (100*test_score))
print("Multinomial train accuracy: %0.2f%%" % (100*train_score))
The accuracy score is good for the test set, but not great. When we see how it performs on the training set however, it becomes clear that the classifier is overfit. There is a $\approx 16\%$ difference in score.
We use KFold
instead of GridSearchCV
here as we will want to also set parameters in the CountVectorizer.
In [53]:
from sklearn.model_selection import KFold
def cv_score(clf, X, y, scorefunc):
result = 0.
nfold = 5
kf = KFold(n_splits = nfold)
for train, test in kf.split(X): # split data into train/test groups, 5 times
clf.fit(X[train], y[train]) # fit
result += scorefunc(clf, X[test], y[test]) # evaluate score function on held-out data
return result / nfold # average
We use the log-likelihood as the score here in scorefunc
. Indeed, what we do in cv_score
above is to implement the cross-validation part of GridSearchCV
.
Since Naive Bayes classifiers are often used in asymmetric situations, it might help to actually maximize probability on the validation folds rather than just accuracy.
Notice something else about using a custom score function. It allows us to do a lot of the choices with the Decision risk we care about (-profit for example) directly on the validation set. You will often find people using roc_auc
, precision, recall, or F1-score
as risks or scores.
In [54]:
def log_likelihood(clf, x, y):
prob = clf.predict_log_proba(x)
rotten = y == 0
fresh = ~rotten
return prob[rotten, 0].sum() + prob[fresh, 1].sum()
Your turn: What is using this function as the score mean? What are we trying to optimize for?
We are scoring by taking the amount that each X in the data contributes to the LOG PROBABILITY of being in the rotten class, or the fresh class. Thus we are optimizing the certainty of our predictions. Large scores indicate a greater certainty in classification.
A downfall of this is that hard to classify movies (such as the infamous Napolean Dynamite), would score high in both categories. The assumption is that in the general sense, these two components of score shouldn't be closely correlated, and so taking the largest score means optimizing for SOME classification into one of the two groups.
We'll cross-validate over the regularization parameter $\alpha$ and the min_df
of the CountVectorizer
.
min_df: When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
Lets set up the train and test masks first:
In [55]:
itrain, itest = train_test_split(range(critics.shape[0]), train_size=0.7)
mask=np.ones(critics.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)
Your turn:
Using the skeleton code below, find the best values of the parameters alpha
and min_df
.
Use the cv_score
function above with the log_likelihood
function for scoring.
In [56]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]
#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
maxscore=-np.inf
for alpha in alphas:
for min_df in min_dfs:
vectorizer = CountVectorizer(min_df = min_df)
Xthis, ythis = make_xy(critics, vectorizer)
Xtrainthis=Xthis[mask]
ytrainthis=ythis[mask]
#your turn
naive_bayes = MultinomialNB(alpha=alpha)
crossval_score = cv_score(naive_bayes,Xtrainthis,ytrainthis, log_likelihood)
if crossval_score > maxscore:
maxscore = crossval_score
best_alpha,best_min_df = alpha,min_df
In [57]:
print("alpha: %f" % best_alpha)
print("min_df: %f" % best_min_df)
Your turn: Using the best values of alpha
and min_df
you just found, calculate the accuracy on the training and test sets. Is this classifier better? Why (not)?
In [61]:
vectorizer = CountVectorizer(min_df=best_min_df)
X, y = make_xy(critics, vectorizer)
xtrain=X[mask]
ytrain=y[mask]
xtest=X[~mask]
ytest=y[~mask]
clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)
#your turn. Print the accuracy on the test and training dataset
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print("Accuracy on training data: %0.2f%%" % (training_accuracy))
print("Accuracy on test data: %0.2f%%" % (test_accuracy))
In [40]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, clf.predict(xtest)))
The classifier performs slightly worse on the test data, but the closeness of the scores suggests that we are no longer over fitting. One would need to get new novel data and test against that to be sure, but the initial impression is that this classifier will perform better over a greater variety of datasets.
We use a neat trick to identify strongly predictive features (i.e. words).
In [65]:
words = np.array(vectorizer.get_feature_names())
x = np.eye(xtest.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)
good_words = words[ind[:10]]
bad_words = words[ind[-10:]]
good_prob = probs[ind[:10]]
bad_prob = probs[ind[-10:]]
print("Good words\t P(fresh | word)")
for w, p in zip(good_words, good_prob):
print("%20s" % w, "%0.2f" % (1 - np.exp(p)))
print("Bad words\t P(fresh | word)")
for w, p in zip(bad_words, bad_prob):
print("%20s" % w, "%0.2f" % (1 - np.exp(p)))
Your turn: Why does this method work? What does the probability for each row in the identity matrix represent?
This methods works because we have made it so that each row in our matrix has only one word (feature). The total probability for all the features summed along a row is thus the probability of freshness given that single word in the row. The words for which we have a very high probability indicate that the word on its own correlates very strongly with freshness, and words that have very low probablities indicate that there is a strong correlation with rottenness.
In [74]:
x, y = make_xy(critics, vectorizer)
prob = clf.predict_proba(x)[:, 0]
predict = clf.predict(x)
bad_rotten = np.argsort(prob[y == 0])[:5]
bad_fresh = np.argsort(prob[y == 1])[-5:]
print("Mis-predicted Rotten quotes")
print ('---------------------------')
for row in bad_rotten:
print (critics[y == 0].quote.iat[row])
print()
print("Mis-predicted Fresh quotes")
print('--------------------------')
for row in bad_fresh:
print(critics[y == 1].quote.iat[row])
print()
In [76]:
#your turn
clf.predict_proba(vectorizer.transform(['This movie is not remarkable, touching, or superb in any way']))
Out[76]:
This classifier gives a 98.6% probability of being fresh, despite the fact that the sentence is clearly negative. The word 'not' should negate all the postive adjectives which follow, but our simple bag-of-words approach doesn't have any way of dealing with this, and simply takes the positive features as is. Thus, a completely naive approach fails when confronted with the subleties of language.
There are many things worth trying. Some examples:
We could use TF-IDF instead. What is this? It stands for
Term-Frequency X Inverse Document Frequency
.
In the standard CountVectorizer
model above, we used just the term frequency in a document of words in our vocabulary. In TF-IDF, we weigh this term frequency by the inverse of its popularity in all document. For example, if the word "movie" showed up in all the documents, it would not have much predictive value. By weighing its counts by 1 divided by its overall frequency, we down-weight it. We can then use this tfidf weighted features as inputs to any classifier.
In [77]:
#http://scikit-learn.org/dev/modules/feature_extraction.html#text-feature-extraction
#http://scikit-learn.org/dev/modules/classes.html#text-feature-extraction-ref
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(min_df=1, stop_words='english')
Xtfidf=tfidfvectorizer.fit_transform(critics.quote)
Your turn (extra credit): Try a few of these ideas to improve the model (or any other ideas of your own). Implement here and report on the result.
In [78]:
Xtfidf[0].toarray()
Out[78]:
In [ ]: