Homework 3. Bayesian Tomatoes

Due Thursday, October 17, 11:59pm

In this assignment, you'll be analyzing movie reviews from Rotten Tomatoes. This assignment will cover:

  • Working with web APIs
  • Making and interpreting predictions from a Bayesian perspective
  • Using the Naive Bayes algorithm to predict whether a movie review is positive or negative
  • Using cross validation to optimize models

Useful libraries for this assignment


In [3]:
%matplotlib inline

import json

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup as bs

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()


/Users/apple/anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))

Introduction

Rotten Tomatoes gathers movie reviews from critics. An entry on the website typically consists of a short quote, a link to the full review, and a Fresh/Rotten classification which summarizes whether the critic liked/disliked the movie.

When critics give quantitative ratings (say 3/4 stars, Thumbs up, etc.), determining the Fresh/Rotten classification is easy. However, publications like the New York Times don't assign numerical ratings to movies, and thus the Fresh/Rotten classification must be inferred from the text of the review itself.

This basic task of categorizing text has many applications. All of the following questions boil down to text classification:

  • Is a movie review positive or negative?
  • Is an email spam, or not?
  • Is a comment on a blog discussion board appropriate, or not?
  • Is a tweet about your company positive, or not?

Language is incredibly nuanced, and there is an entire field of computer science dedicated to the topic (Natural Language Processing). Nevertheless, we can construct basic language models using fairly straightforward techniques.

The Data

You will be starting with a database of Movies, derived from the MovieLens dataset. This dataset includes information for about 10,000 movies, including the IMDB id for each movie.

Your first task is to download Rotten Tomatoes reviews from 3000 of these movies, using the Rotten Tomatoes API (Application Programming Interface).

Working with Web APIs

Web APIs are a more convenient way for programs to interact with websites. Rotten Tomatoes has a nice API that gives access to its data in JSON format.

To use this, you will first need to register for an API key. For "application URL", you can use anything -- it doesn't matter.

After you have a key, the documentation page shows the various data you can fetch from Rotten Tomatoes -- each type of data lives at a different web address. The basic pattern for fetching this data with Python is as follows (compare this to the Movie Reviews tab on the documentation page):


In [2]:
""""
api_key = 'PUT YOUR KEY HERE'
movie_id = '770672122'  # toy story 3
url = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json' % movie_id

#these are "get parameters"
options = {'review_type': 'top_critic', 'page_limit': 20, 'page': 1, 'apikey': api_key}
data = requests.get(url, params=options).text
data = json.loads(data)  # load a json string into a collection of lists and dicts

print json.dumps(data['reviews'][0], indent=2)  # dump an object into a json string
"""


Out[2]:
'"\napi_key = \'PUT YOUR KEY HERE\'\nmovie_id = \'770672122\'  # toy story 3\nurl = \'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json\' % movie_id\n\n#these are "get parameters"\noptions = {\'review_type\': \'top_critic\', \'page_limit\': 20, \'page\': 1, \'apikey\': api_key}\ndata = requests.get(url, params=options).text\ndata = json.loads(data)  # load a json string into a collection of lists and dicts\n\nprint json.dumps(data[\'reviews\'][0], indent=2)  # dump an object into a json string\n'

Part 1: Get the data

Here's a chunk of the MovieLens Dataset:


In [3]:
from io import StringIO  
movie_txt = requests.get('https://raw.github.com/cs109/cs109_data/master/movies.dat').text
movie_file = StringIO(movie_txt) # treat a string like a file
movies = pd.read_csv(movie_file, delimiter='\t')

#print the first row
movies[['id', 'title', 'imdbID', 'year']].irow(0)
print movies.shape
movies = movies.dropna()
movies = movies.reset_index(drop = True)
print movies.shape
movies.head()
movies.dtypes


(9423, 21)
(8975, 21)
/Users/apple/anaconda/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: irow(i) is deprecated. Please use .iloc[i]
Out[3]:
id                         int64
title                     object
imdbID                     int64
spanishTitle              object
imdbPictureURL            object
year                       int64
rtID                      object
rtAllCriticsRating        object
rtAllCriticsNumReviews    object
rtAllCriticsNumFresh      object
rtAllCriticsNumRotten     object
rtAllCriticsScore         object
rtTopCriticsRating        object
rtTopCriticsNumReviews    object
rtTopCriticsNumFresh      object
rtTopCriticsNumRotten     object
rtTopCriticsScore         object
rtAudienceRating          object
rtAudienceNumRatings      object
rtAudienceScore           object
rtPictureURL              object
dtype: object

P1.1

We'd like you to write a function that looks up the first 20 Top Critic Rotten Tomatoes reviews for a movie in the movies dataframe. This involves two steps:

  1. Use the Movie Alias API to look up the Rotten Tomatoes movie id from the IMDB id
  2. Use the Movie Reviews API to fetch the first 20 top-critic reviews for this movie

Not all movies have Rotten Tomatoes IDs. In these cases, your function should return None. The detailed spec is below. We are giving you some freedom with how you implement this, but you'll probably want to break this task up into several small functions.

Hint In some situations, the leading 0s in front of IMDB ids are important. IMDB ids have 7 digits


In [4]:
"""
Function
--------
fetch_reviews(movies, row)

Use the Rotten Tomatoes web API to fetch reviews for a particular movie

Parameters
----------
movies : DataFrame 
  The movies data above
row : int
  The row of the movies DataFrame to use
  
Returns
-------
If you can match the IMDB id to a Rotten Tomatoes ID:
  A DataFrame, containing the first 20 Top Critic reviews 
  for the movie. If a movie has less than 20 total reviews, return them all.
  This should have the following columns:
    critic : Name of the critic
    fresh  : 'fresh' or 'rotten'
    imdb   : IMDB id for the movie
    publication: Publication that the critic writes for
    quote  : string containing the movie review quote
    review_data: Date of review
    rtid   : Rotten Tomatoes ID for the movie
    title  : Name of the movie
    
If you cannot match the IMDB id to a Rotten Tomatoes ID, return None

Examples
--------
>>> reviews = fetch_reviews(movies, 0)
>>> print len(reviews)
20
>>> print reviews.irow(1)
critic                                               Derek Adams
fresh                                                      fresh
imdb                                                      114709
publication                                             Time Out
quote          So ingenious in concept, design and execution ...
review_date                                           2009-10-04
rtid                                                        9559
title                                                  Toy story
Name: 1, dtype: object
"""
#RT API is not available, so we'll scrape using the movie titles from the movies data frame
def fetch_reviews(movies, row):
    url  = 'https://www.rottentomatoes.com/m/'+movies.rtID[row]+'/reviews/?type=top_critics'
    
    #create the reviews df by parsing the required fields from the URL page
    try:
        data  = requests.get(url)
    except requests.exceptions.ConnectionError:
        r.status_code = "Connection refused"
    
    soup = bs(data.content)
    critic = []
    fresh_rating =[]
    imdb =[]
    publ =[]
    quote = []
    review_data = []
    title =[]
    rt = []
    

    reviews = soup.find_all('div',{'class':'row review_table_row'}, limit =20)
    for review in reviews:
        title.append(movies.title[row])
        imdb.append(movies.imdbID[row])
        rt.append(movies.rtID[row])
        try:
            #critic
            critic.append(review.contents[1].find_all('a', {'class':'unstyled bold articleLink'})[0].text)
        except IndexError:
            critic.append(None)
        try:
            rating = str(review.contents[3].find_all('div', {'class':'review_icon'}))
            if 'fresh' in rating:
                fresh = 'fresh'
            elif 'rotten' in rating:
                fresh = 'rotten'
            else: 
                fresh = None
            fresh_rating.append(fresh)
        except IndexError:
            fresh = None
        try:
            publ.append(review.contents[1].find_all('em', {'class':'subtle'})[0].text)
        except IndexError:
            publ.append(None)
        try:
            quote.append(review.contents[3].find_all('div', {'class':'the_review'})[0].text.strip())
        except IndexError:
            quote.append(None)
        try:
            review_data.append(review.contents[3].find_all('div', {'class':'review_date'})[0].text.strip())
        except IndexError:
            review_data.append(None)

    return pd.DataFrame(dict(critic = critic, fresh = fresh_rating, imdb = imdb, publication = publ, quote = quote, review_date = review_data,rtid = rt,title = title))


reviews = fetch_reviews(movies, 2931)
reviews


/Users/apple/anaconda/envs/python2/lib/python2.7/site-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "lxml")

  markup_type=markup_type))
Out[4]:
critic fresh imdb publication quote review_date rtid title
0 Walter Goodman fresh 89994 New York Times By and large, the script by Tony Geiss and Jud... May 21, 2003 sesame_street_presents_follow_that_bird Sesame Street Presents: Follow that Bird

P1.2

Use the function you wrote to retrieve reviews for the first 3,000 movies in the movies dataframe.

Hints
  • Rotten Tomatoes limits you to 10,000 API requests a day. Be careful about this limit! Test your code on smaller inputs before scaling. You are responsible if you hit the limit the day the assignment is due :)
  • This will take a while to download. If you don't want to re-run this function every time you restart the notebook, you can save and re-load this data as a CSV file. However, please don't submit this file

In [5]:
"""
Function
--------
build_table

Parameters
----------
movies : DataFrame
  The movies data above
rows : int
  The number of rows to extract reviews for
  
Returns
--------
A dataframe
  The data obtained by repeatedly calling `fetch_reviews` on the first `rows`
  of `movies`, discarding the `None`s,
  and concatenating the results into a single DataFrame
"""
#your code here
def build_table(movies, rows):
    table = pd.DataFrame()
    for i in range(rows):
        table  = table.append(fetch_reviews(movies,i ))
    return table

In [2]:
#you can toggle which lines are commented, if you
#want to re-load your results to avoid repeatedly calling this function

#critics = build_table(movies, 3000) # may result in connection error as RT refuses connection request 
                                        #after n (n ~ 1500+) number of requests. Try retrieving the data in smaller chunks instead
#critics.to_csv('critics.csv', index=False,encoding='utf-8')
critics = pd.read_csv('critics.csv')
critics[critics.fresh == 'fresh'].count


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-866b2b54576a> in <module>()
      5                                         #after n (n ~ 1500+) number of requests. Try retrieving the data in smaller chunks instead
      6 #critics.to_csv('critics.csv', index=False,encoding='utf-8')
----> 7 critics = pd.read_csv('critics.csv')
      8 critics[critics.fresh == 'fresh'].count

NameError: name 'pd' is not defined

In [1]:
#for this assignment, let's drop rows with missing data
critics = critics[~critics.quote.isnull()]
#critics = critics[critics.fresh != None]
critics = critics[critics.fresh.notnull()]
critics = critics[critics.quote.str.len() > 0]
critics = critics.reset_index(drop = True)
critics.shape


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-9160a4976bf9> in <module>()
      2 
      3 #for this assignment, let's drop rows with missing data
----> 4 critics = critics[~critics.quote.isnull()]
      5 #critics = critics[critics.fresh != None]
      6 critics = critics[critics.fresh.notnull()]

NameError: name 'critics' is not defined

A quick sanity check that everything looks ok at this point


In [8]:
assert set(critics.columns) == set('critic fresh imdb publication quote review_date rtid title'.split())
assert len(critics) > 10000

Part 2: Explore

Before delving into analysis, get a sense of what these data look like. Answer the following questions. Include your code!

2.1 How many reviews, critics, and movies are in this dataset?


In [9]:
#your code here
print 'No of reviews: %s'%critics.quote.size
print 'No of critics: %s'%critics.critic.nunique()
print 'No of movies: %s'%critics.title.nunique()


No of reviews: 15924
No of critics: 682
No of movies: 1590

2.2 What does the distribution of number of reviews per reviewer look like? Make a histogram


In [10]:
#Your code here

def histogram_style():
    remove_border(left=False)
    plt.grid(False)
    plt.grid(axis='y', color='w', linestyle='-', lw=1)

#critics.groupby('critic').size().hist(log=True, bins=range(20), edgecolor='white')
plt.hist(critics.groupby('critic').rtid.count(),log = True, bins=range(20), edgecolor='white')
plt.xlabel("Number of reviews per critic")
plt.ylabel("N")
histogram_style()


2.3 List the 5 critics with the most reviews, along with the publication they write for


In [11]:
#Your code here
counts  =critics.groupby(['critic','publication']).critic.count()
counts = counts.sort_values(ascending = False)
counts[:5]


Out[11]:
critic              publication      
Roger Ebert         Chicago Sun-Times    1051
James Berardinelli  ReelViews             805
Janet Maslin        New York Times        549
Desson Thomson      Washington Post       486
Jonathan Rosenbaum  Chicago Reader        423
Name: critic, dtype: int64

2.4 Of the critics with > 100 reviews, plot the distribution of average "freshness" rating per critic


In [12]:
#Your code here
df = critics.copy()
df['fresh'] = df.fresh == 'fresh'

grp = df.groupby('critic')
counts = grp.rtid.count()
means = grp.fresh.mean()
means[counts > 100].hist(bins=10, edgecolor='w', lw=1)
plt.xlabel("Average rating per critic")
plt.ylabel("N")
plt.yticks([0, 2, 4, 6, 8, 10])
histogram_style()


2.5 Using the original movies dataframe, plot the rotten tomatoes Top Critics Rating as a function of year. Overplot the average for each year, ignoring the score=0 examples (some of these are missing data). Comment on the result -- is there a trend? What do you think it means?


In [13]:
#Your code here
data = movies[['year','rtTopCriticsRating']]
#data = data.convert_objects(convert_numeric=True) #deprecated
data = data.apply(pd.to_numeric)
data = data[(data.rtTopCriticsRating > 0.00)]
means  = data.groupby('year').mean()

plt.plot(data['year'], data['rtTopCriticsRating'], 'o', mec = 'none', alpha = .5, label = 'Data' )
plt.plot(means.index, means['rtTopCriticsRating'], '-',  label = 'Yearly Average' )
plt.legend(loc ='lower left', frameon = False)
plt.xlabel("Year")
plt.ylabel("Average Score")
remove_border()


Looks like the average rating score is on a downward trend. The plot also indicates that the number of ratings given in the last 20 years or so is much greater than those given before. Questions worth exploring: Does the down trend mean:

  • the quality of movies has dropped ?
  • the top critics became more stringent in rating?
  • more top critics with stringent ratings entered the scene?
  • there is a selection bias? Rotten Tomatoes probably doesn't archive reviews for all movies, especially ones that came out before the website existed. Thus, reviews of old movies are more often "the classics". Mediocre old movies have been partially forgotten, and are underrepresented in the data.

Part 3: Sentiment Analysis

You will now use a Naive Bayes classifier to build a prediction model for whether a review is fresh or rotten, depending on the text of the review. See Lecture 9 for a discussion of Naive Bayes.

Most models work with numerical data, so we need to convert the textual collection of reviews to something numerical. A common strategy for text classification is to represent each review as a "bag of words" vector -- a long vector of numbers encoding how many times a particular word appears in a blurb.

Scikit-learn has an object called a CountVectorizer that turns text into a bag of words. Here's a quick tutorial:


In [14]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print "Original text is\n", '\n'.join(text)

vectorizer = CountVectorizer(min_df=0)

# call `fit` to build the vocabulary
vectorizer.fit(text)

# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x = x.toarray()

print
print "Transformed text vector is \n", x

# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print "Words for each feature:"
print vectorizer.get_feature_names()

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency


Original text is
Hop on pop
Hop off pop
Hop Hop hop

Transformed text vector is 
[[1 0 1 1]
 [1 1 0 1]
 [3 0 0 0]]

Words for each feature:
[u'hop', u'off', u'on', u'pop']

3.1

Using the critics dataframe, compute a pair of numerical X, Y arrays where:

  • X is a (nreview, nwords) array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to your model.
  • Y is a nreview-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired output from your model.

In [15]:
#hint: Consult the scikit-learn documentation to
#      learn about what these classes do do
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

"""
Function
--------
make_xy

Build a bag-of-words training set for the review data

Parameters
-----------
critics : Pandas DataFrame
    The review data from above
    
vectorizer : CountVectorizer object (optional)
    A CountVectorizer object to use. If None,
    then create and fit a new CountVectorizer.
    Otherwise, re-fit the provided CountVectorizer
    using the critics data
    
Returns
-------
X : numpy array (dims: nreview, nwords)
    Bag-of-words representation for each review.
Y : numpy array (dims: nreview)
    1/0 array. 1 = fresh review, 0 = rotten review

Examples
--------
X, Y = make_xy(critics)
"""
def make_xy(critics, vectorizer=None):
    #Your code here    
    reviews=  critics['quote'] 
    
    if vectorizer is None:
        vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(reviews)
    X = vectorizer.transform(reviews)
    #X = X.toarray()  
    X = X.tocsc()
    Y = (critics.fresh == 'fresh').values.astype(np.int)
    return X, Y

In [16]:
X, Y = make_xy(critics)

3.2 Next, randomly split the data into two groups: a training set and a validation set.

Use the training set to train a MultinomialNB classifier, and print the accuracy of this model on the validation set

Hint You can use train_test_split to split up the training data


In [17]:
#Your code here

xtrain, xtest,ytrain,ytest = train_test_split(X, Y)

clf = MultinomialNB()
clf.fit(xtrain, ytrain)

print "Accuracy: %0.2f%%"% (100*clf.score(xtest, ytest))


Accuracy: 79.30%

3.3:

We say a model is overfit if it performs better on the training data than on the test data. Is this model overfit? If so, how much more accurate is the model on the training data compared to the test data?


In [18]:
# Your code here. Print the accuracy on the test and training dataset
print "Accuracy on training data: %0.2f%%"% (100*clf.score(xtrain, ytrain))
print "Accuracy on test data: %0.2f%%"% (100*clf.score(xtest, ytest))


Accuracy on training data: 92.90%
Accuracy on test data: 79.30%

Some overfitting seems to be happening here as the error rate on test data(~20%) is more than twice as large as the error rate on trainng data (7.5%). Though it's possible that the accuracy difference is due to random chance and not a symptom of overfitting , it is unlikely. This can be tested by cross-validation, i.e. repeatedly testing the model on different train/test splits. If the accuracy on training data is consistently higher than that of test data, then overfitting has occured.

3.4: Model Calibration

Bayesian models like the Naive Bayes classifier have the nice property that they compute probabilities of a particular classification -- the predict_proba and predict_log_proba methods of MultinomialNB compute these probabilities.

Being the respectable Bayesian that you are, you should always assess whether these probabilities are calibrated -- that is, whether a prediction made with a confidence of x% is correct approximately x% of the time. We care about calibration because it tells us whether we can trust the probabilities computed by a model. If we can trust model probabilities, we can make better decisions using them (for example, we can calculate how much we should bet or invest in a given prediction).

Let's make a plot to assess model calibration. Schematically, we want something like this:

In words, we want to:

  • Take a collection of examples, and compute the freshness probability for each using clf.predict_proba
  • Gather examples into bins of similar freshness probability (the diagram shows 5 groups -- you should use something closer to 20)
  • For each bin, count the number of examples in that bin, and compute the fraction of examples in the bin which are fresh
  • In the upper plot, graph the expected P(Fresh) (x axis) and observed freshness fraction (Y axis). Estimate the uncertainty in observed freshness fraction $F$ via the equation $\sigma = \sqrt{F (1-F) / N}$
  • Overplot the line y=x. This is the trend we would expect if the model is calibrated
  • In the lower plot, show the number of examples in each bin

Hints

The output of clf.predict_proba(X) is a (N example, 2) array. The first column gives the probability $P(Y=0)$ or $P(Rotten)$, and the second gives $P(Y=1)$ or $P(Fresh)$.

The above image is just a guideline -- feel free to explore other options!


In [19]:
"""
Function
--------
calibration_plot

Builds a plot like the one above, from a classifier and review data

Inputs
-------
clf : Classifier object
    A MultinomialNB classifier
X : (Nexample, Nfeature) array
    The bag-of-words data
Y : (Nexample) integer array
    1 if a review is Fresh
"""    
def calibration_plot(clf, xtest, ytest):

    pfresh = clf.predict_proba(xtest)[:,1]
    print pfresh
    data =  pd.DataFrame(dict(prob = pfresh, outcome = ytest))


    # group in to bins of similar probabilities
    bins = np.linspace(0,1,20)
    prob_bins = pd.cut(data.prob, bins)
    binwidth = bins[1] - bins[0]

    #freshness ratio and number of examples in each bin
    cal = data.groupby(prob_bins).outcome.agg(['mean', 'count'])
    cal['pmid'] = (bins[:-1] + bins[1:]) / 2
    cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])

    #the calibration plot
    ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
    plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
    plt.ylabel("Empirical P(Fresh)")
    remove_border(ax)

    #the distribution of P(fresh)
    ax = plt.subplot2grid((3, 1), (2, 0), sharex=ax)

    plt.bar(left=cal.pmid - binwidth / 2, height=cal['count'],
        width=.95 * (bins[1] - bins[0]),
        fc=p[0].get_color())

    plt.xlabel("Predicted P(Fresh)")
    remove_border()
    plt.ylabel("Number")

In [20]:
calibration_plot(clf, xtest, ytest)


[ 0.96829278  0.9810975   0.99803793 ...,  0.9956056   0.98107983
  0.99792337]

3.5 We might say a model is over-confident if the freshness fraction is usually closer to 0.5 than expected (that is, there is more uncertainty than the model predicted). Likewise, a model is under-confident if the probabilities are usually further away from 0.5. Is this model generally over- or under-confident?

Your Answer Here

Cross Validation

Our classifier has a few free parameters. The two most important are:

  1. The min_df keyword in CountVectorizer, which will ignore words which appear in fewer than min_df fraction of reviews. Words that appear only once or twice can lead to overfitting, since words which occur only a few times might correlate very well with Fresh/Rotten reviews by chance in the training dataset.

  2. The alpha keyword in the Bayesian classifier is a "smoothing parameter" -- increasing the value decreases the sensitivity to any single feature, and tends to pull prediction probabilities closer to 50%.

As discussed in lecture and HW2, a common technique for choosing appropriate values for these parameters is cross-validation. Let's choose good parameters by maximizing the cross-validated log-likelihood.

3.6 Using clf.predict_log_proba, write a function that computes the log-likelihood of a dataset


In [21]:
"""
Function
--------
log_likelihood

Compute the log likelihood of a dataset according to a bayesian classifier. 
The Log Likelihood is defined by

L = Sum_fresh(logP(fresh)) + Sum_rotten(logP(rotten))

Where Sum_fresh indicates a sum over all fresh reviews, 
and Sum_rotten indicates a sum over rotten reviews
    
Parameters
----------
clf : Bayesian classifier
x : (nexample, nfeature) array
    The input data
y : (nexample) integer array
    Whether each review is Fresh
"""
#your code here

def log_likelihood(clf,x,y):
    prob = clf.predict_log_proba(x)
    fresh = y == 1 # generates an array with True when fresh and False when rotten
    rotten = ~fresh #fresh inverted
    return  prob[rotten, 0].sum() + prob[fresh, 1].sum()
    
log_likelihood(clf,xtest,ytest)


Out[21]:
-2098.5180722595032

Here's a function to estimate the cross-validated value of a scoring function, given a classifier and data


In [22]:
from sklearn.cross_validation import KFold
from sklearn.cross_validation import cross_val_score

def cv_score(clf, x, y, score_func):
    """
    Uses 5-fold cross validation to estimate a score of a classifier
    
    Inputs
    ------
    clf : Classifier object
    x : Input feature vector
    y : Input class labels
    score_func : Function like log_likelihood, that takes (clf, x, y) as input,
                 and returns a score
                 
    Returns
    -------
    The average score obtained by randomly splitting (x, y) into training and 
    test sets, fitting on the training set, and evaluating score_func on the test set
    
    Examples
    cv_score(clf, x, y, log_likelihood)
    """
    result = 0
    nfold = 5
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

print cv_score(clf, xtrain, ytrain, log_likelihood)
# as a side note, this function is builtin to the newest version of sklearn. We could just write
# sklearn.cross_validation.cross_val_score(clf, x, y, scorer=log_likelihood).

#print cross_val_score(clf, xtrain, ytrain, cv =5,scoring = log_likelihood).sum()/5


-1227.21589049

3.7

Fill in the remaining code in this block, to loop over many values of alpha and min_df to determine which settings are "best" in the sense of maximizing the cross-validated log-likelihood


In [38]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf

for alpha in alphas:
    for min_df in min_dfs:         
        vectorizer = CountVectorizer(min_df = min_df)       
        X, Y = make_xy(critics, vectorizer)

        clf = MultinomialNB(alpha = alpha)
        clf.fit(X, Y)
        loglike = cv_score(clf, X, Y, log_likelihood)
        #print "Loglike: %.4f"%loglike, "alpha: %.6f"%alpha, "min_df: %.6f"%min_df
        if loglike > max_loglike:
            max_loglike = loglike
            best_alpha, best_min_df = alpha, min_df

## On the first glace, it may look like the above can be implemented using GridSearchCV but it may not work 
## since the training data required for GridSearchCV is generated using one of the parameters (min_df) that 
## GridSearchCV is optimizing for.

In [24]:
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df


alpha: 1.000000
min_df: 0.000100

3.8 Now that you've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the steps in 3.1, 3.2, and 3.4 to train a final classifier with these parameters, re-evaluate the accuracy, and draw a new calibration plot.


In [41]:
#Your code here

vectorizer = CountVectorizer(min_df = best_min_df)  
X, Y = make_xy(critics, vectorizer)
xtrain, xtest,ytrain,ytest = train_test_split(X, Y)

clf = MultinomialNB(alpha = best_alpha)
clf.fit(xtrain, ytrain)

calibration_plot(clf, xtest, ytest)


# Your code here. Print the accuracy on the test and training dataset
print "Accuracy on training data: %0.2f%%"% (100*clf.score(xtrain, ytrain))
print "Accuracy on test data: %0.2f%%"% (100*clf.score(xtest, ytest))


[  9.99981030e-01   9.98786752e-01   4.38130581e-01 ...,   1.13194215e-05
   1.50680598e-01   6.01598046e-01]
Accuracy on training data: 90.99%
Accuracy on test data: 79.93%

3.9 Discuss the various ways in which Cross-Validation has affected the model. Is the new model more or less accurate? Is overfitting better or worse? Is the model more or less calibrated?

Can't see much difference. To explore further.

To think about/play with, but not to hand in: What would happen if you tried this again using a function besides the log-likelihood -- for example, the classification accuracy?

Part 4: Interpretation. What words best predict a fresh or rotten review?

4.1 Using your classifier and the vectorizer.get_feature_names method, determine which words best predict a positive or negative review. Print the 10 words that best predict a "fresh" review, and the 10 words that best predict a "rotten" review. For each word, what is the model's probability of freshness if the word appears one time?

Hints

  • Try computing the classification probability for a feature vector which consists of all 0s, except for a single 1. What does this probability refer to?

  • np.eye generates a matrix where the ith row is all 0s, except for the ith column which is 1.


In [73]:
# Your code here
words = np.array(vectorizer.get_feature_names())

x = np.eye(xtest.shape[1])
probs = clf.predict_proba(x)[:,0] 
ind = np.argsort(probs) # returns sorted indices


good_words = words[ind[:10]]
bad_words = words[ind[-10:]] # slice format: [start : end] --> in this case, last 10 elements

good_probs = probs[ind[:10]]
bad_probs = probs[ind[-10:]]

print "Good Words \t    P(fresh | word)"
for w, p in zip(good_words,good_probs):
    #print "%20s" %w, "%0.2f"% (1- np.exp(p)) # if clf.predict_log_ proba was used
    print "%20s" %w, "%0.2f"% (1-p) #%20s --> right aligned 20 spaces
    
print "Bad Words \t    P(fresh | word)"
for w, p in zip (bad_words,bad_probs): 
    print "%20s" %w, "%0.2f"% (1-p) # 1-p since prob of rotten is chosen. rotten is chosen for the 
                                                #descending order listing


Good Words 	    P(fresh | word)
         brilliantly 0.96
            captures 0.96
             rousing 0.96
              gentle 0.95
              superb 0.95
               equal 0.95
               whale 0.95
              deftly 0.95
            riveting 0.95
             delight 0.95
Bad Words 	    P(fresh | word)
              intent 0.07
               blame 0.07
               tepid 0.07
             witless 0.07
             trailer 0.06
               weren 0.06
            tiresome 0.06
               sadly 0.06
           pointless 0.05
                lame 0.04

4.2

One of the best sources for inspiration when trying to improve a model is to look at examples where the model performs poorly.

Find 5 fresh and rotten reviews where your model performs particularly poorly. Print each review.


In [90]:
#Your code here
x, y = make_xy(critics, vectorizer)
prob  = clf.predict_proba(x)[:,0]

bad_rotten = np.argsort(prob[y == 0])[:5]
bad_fresh = np.argsort(prob[y == 1]) [-5:]

print "5 Mis-predicted rotten reviews (Actually rotten, predicted fresh)"
print "---------------------------"
for row in bad_rotten:
    print critics[y == 0].quote.iloc[row], "(",critics[y == 0].title.iloc[row],")","\n"

print "5 Mis-predicted fresh reviews (Actually fresh, predicted rotten)"
print "---------------------------"
for row in bad_fresh:
    print critics[y == 1].quote.iloc[row] , "(",critics[y == 1].title.iloc[row],")","\n"


5 Mis-predicted rotten reviews (Actually rotten, predicted fresh)
---------------------------
A Star Wars that has not only lost much of its humor and charm but more important a good deal of its innocence, traveling in the process light years away from the shiny first magnitude of its original world. ( Star Wars: Episode V - The Empire Strikes Back ) 

At the center of every swirling storm is a place of placid inertia, safe and still -- and not very exciting. And it's where Affleck and Bullock spend most of their time, floating amiably but never doing enough to truly connect. ( Forces of Nature ) 

Benefits from a lively lead performance by the miscast Denzel Washington but doesn't come within light years of the book, one of the greatest American autobiographies. ( Malcolm X ) 

It's probably safe to say that the British director Peter Greenaway holds the ugliest view of mankind ever put forth by a maker of feature films. ( The Cook the Thief His Wife & Her Lover ) 

With the exception of Miss Streep's performance, the pleasures of Out of Africa are all peripheral -- David Watkin's photography, the landscapes, the shots of animal life -all of which would fit neatly into a National Geographic layout. ( Out of Africa ) 

5 Mis-predicted fresh reviews (Actually fresh, predicted rotten)
---------------------------
Is In the Army Now funny? Yes, Drill Sergeant, Sir! Is it stupid? Yes, sir! Does it kill brain cells? Yes, sirree! ( In the Army Now ) 

This tough-to-peg whodunit keeps you going for two hours, despite a few James Bond-ish (or Jane Bond-ish) turns that play less preposterously than you might assume were they to be divulged. ( Smilla's Sense of Snow ) 

Ultimately, Crash succeeds in spite of itself. Its color war starts to feel obvious and schematic. Its coincidences and cliches become like a pileup on the 405 freeway, but there it is -- you find yourself rubbernecking and can't manage to look away. ( Crash ) 

Consider this the big-screen equivalent of a beach read: Just turn off your brain and wallow in whatever turn-ons -- Whoopi and whoopee -- Stella offers. ( How Stella Got Her Groove Back ) 

En route to this cynical fade-out, Conspiracy Theory does show off enough glossy style ... and incidental cleverness to keep viewers largely hooked. ( Conspiracy Theory ) 

4.3 What do you notice about these mis-predictions? Naive Bayes classifiers assume that every word affects the probability independently of other words. In what way is this a bad assumption? In your answer, report your classifier's Freshness probability for the review "This movie is not remarkable, touching, or superb in any way".


In [91]:
clf.predict_proba(vectorizer.transform(['This movie is not remarkable, touching, or superb in any way']))


Out[91]:
array([[ 0.00123338,  0.99876662]])

Answer Many mis-predictions seem due to the fact that the quotes use more ambivalent language -- quotes along the lines of "this should have been a good movie, but it wasn't". Words like "but", "not", etc. act to negate the sentiment of words. However, because Naive Bayes treats each word separately, it isn't able to capture these kind of word interactions. Because the quote "this movie is not remarkable, touching, or superb in any way" contains typically positive words like remarkabke/touching/superb, the classifier gives it P(Fresh)=0.98.

4.4 If this was your final project, what are 3 things you would try in order to build a more effective review classifier? What other exploratory or explanatory visualizations do you think might be helpful?

There are many things worth trying. Some examples:

  1. You could try to build a NB model where the features are word pairs instead of words. This would be smart enough to realize that "not good" and "so good" mean very different things. This technique doesn't scale very well, since these features are much more sparse (and hence harder to detect repeatable patterns within).
  2. You could try a model besides NB, that would allow for interactions between words -- for example, a Random Forest classifier.
  3. You could consider adding supplemental features -- information about genre, director, cast, etc.
  4. You could build a visualization that prints word reviews, and visually encodes each word with size or color to indicate how that word contributes to P(Fresh). For example, really bad words could show up as big and red, good words as big and green, common words as small and grey, etc.

How to Submit

Restart and run your notebook one last time, to make sure the output from each cell is up to date. To submit your homework, create a folder named lastname_firstinitial_hw3 and place your solutions in the folder. Double check that the file is still called HW3.ipynb, and that it contains your code. Please do not include the critics.csv data file, if you created one. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work!


css tweaks in this cell