Homework 3. Bayesian Tomatoes

Due Thursday, October 17, 11:59pm

In this assignment, you'll be analyzing movie reviews from Rotten Tomatoes. This assignment will cover:

  • Working with web APIs
  • Making and interpreting predictions from a Bayesian perspective
  • Using the Naive Bayes algorithm to predict whether a movie review is positive or negative
  • Using cross validation to optimize models

Useful libraries for this assignment


In [207]:
%matplotlib inline

import json

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()

In [208]:
pd.version.version


Out[208]:
'0.14.0'

Introduction

Rotten Tomatoes gathers movie reviews from critics. An entry on the website typically consists of a short quote, a link to the full review, and a Fresh/Rotten classification which summarizes whether the critic liked/disliked the movie.

When critics give quantitative ratings (say 3/4 stars, Thumbs up, etc.), determining the Fresh/Rotten classification is easy. However, publications like the New York Times don't assign numerical ratings to movies, and thus the Fresh/Rotten classification must be inferred from the text of the review itself.

This basic task of categorizing text has many applications. All of the following questions boil down to text classification:

  • Is a movie review positive or negative?
  • Is an email spam, or not?
  • Is a comment on a blog discussion board appropriate, or not?
  • Is a tweet about your company positive, or not?

Language is incredibly nuanced, and there is an entire field of computer science dedicated to the topic (Natural Language Processing). Nevertheless, we can construct basic language models using fairly straightforward techniques.

The Data

You will be starting with a database of Movies, derived from the MovieLens dataset. This dataset includes information for about 10,000 movies, including the IMDB id for each movie.

Your first task is to download Rotten Tomatoes reviews from 3000 of these movies, using the Rotten Tomatoes API (Application Programming Interface).

Working with Web APIs

Web APIs are a more convenient way for programs to interact with websites. Rotten Tomatoes has a nice API that gives access to its data in JSON format.

To use this, you will first need to register for an API key. For "application URL", you can use anything -- it doesn't matter.

After you have a key, the documentation page shows the various data you can fetch from Rotten Tomatoes -- each type of data lives at a different web address. The basic pattern for fetching this data with Python is as follows (compare this to the Movie Reviews tab on the documentation page):


In [209]:
api_key = 'en3yzpn423n4q9ppmysy49yq'
movie_id = '770672122'  # toy story 3
url = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json' % movie_id

#these are "get parameters"
options = {'review_type': 'top_critic', 'page_limit': 20, 'page': 1, 'apikey': api_key}
data = requests.get(url, params=options).text
data = json.loads(data)  # load a json string into a collection of lists and dicts

print json.dumps(data['reviews'][0], indent=2)  # dump an object into a json string
#data


{
  "publication": "Village Voice", 
  "links": {
    "review": "http://www.villagevoice.com/2010-06-15/film/toys-are-us-in-toy-story-3/full/"
  }, 
  "quote": "When teenaged Andy plops down on the grass to share his old toys with a shy little girl, the film spikes with sadness and layered pleasure -- a concise, deeply wise expression of the ephemeral that feels real and yet utterly transporting.", 
  "freshness": "fresh", 
  "critic": "Eric Hynes", 
  "date": "2013-08-04"
}

Part 1: Get the data

Here's a chunk of the MovieLens Dataset:


In [227]:
from io import StringIO  
movie_txt = requests.get('https://raw.github.com/cs109/cs109_data/master/movies.dat').text
movie_file = StringIO(movie_txt) # treat a string like a file
movies = pd.read_csv(movie_file, delimiter='\t')

#print the first row
movies[['id', 'title', 'imdbID', 'year']].irow(0)
movies.head()


Out[227]:
id title imdbID spanishTitle imdbPictureURL year rtID rtAllCriticsRating rtAllCriticsNumReviews rtAllCriticsNumFresh rtAllCriticsNumRotten rtAllCriticsScore rtTopCriticsRating rtTopCriticsNumReviews rtTopCriticsNumFresh rtTopCriticsNumRotten rtTopCriticsScore rtAudienceRating rtAudienceNumRatings rtAudienceScore rtPictureURL
0 1 Toy story 114709 Toy story (juguetes) http://ia.media-imdb.com/images/M/MV5BMTMwNDU0... 1995 toy_story 9 73 73 0 100 8.5 17 17 0 100 3.7 102338 81 http://content7.flixster.com/movie/10/93/63/10...
1 2 Jumanji 113497 Jumanji http://ia.media-imdb.com/images/M/MV5BMzM5NjE1... 1995 1068044-jumanji 5.6 28 13 15 46 5.8 5 2 3 40 3.2 44587 61 http://content8.flixster.com/movie/56/79/73/56...
2 3 Grumpy Old Men 107050 Dos viejos gruñones http://ia.media-imdb.com/images/M/MV5BMTI5MTgy... 1993 grumpy_old_men 5.9 36 24 12 66 7 6 5 1 83 3.2 10489 66 http://content6.flixster.com/movie/25/60/25602...
3 4 Waiting to Exhale 114885 Esperando un respiro http://ia.media-imdb.com/images/M/MV5BMTczMTMy... 1995 waiting_to_exhale 5.6 25 14 11 56 5.5 11 5 6 45 3.3 5666 79 http://content9.flixster.com/movie/10/94/17/10...
4 5 Father of the Bride Part II 113041 Vuelve el padre de la novia (Ahora también abu... http://ia.media-imdb.com/images/M/MV5BMTg1NDc2... 1995 father_of_the_bride_part_ii 5.3 19 9 10 47 5.4 5 1 4 20 3 13761 64 http://content8.flixster.com/movie/25/54/25542...

In [211]:
movies[['id', 'title', 'imdbID', 'year']].irow(0)


Out[211]:
id                1
title     Toy story
imdbID       114709
year           1995
Name: 0, dtype: object

In [212]:
movies.irow(0)['id']


Out[212]:
1

P1.1

We'd like you to write a function that looks up the first 20 Top Critic Rotten Tomatoes reviews for a movie in the movies dataframe. This involves two steps:

  1. Use the Movie Alias API to look up the Rotten Tomatoes movie id from the IMDB id
  2. Use the Movie Reviews API to fetch the first 20 top-critic reviews for this movie

Not all movies have Rotten Tomatoes IDs. In these cases, your function should return None. The detailed spec is below. We are giving you some freedom with how you implement this, but you'll probably want to break this task up into several small functions.

Hint In some situations, the leading 0s in front of IMDB ids are important. IMDB ids have 7 digits


In [142]:
"""
Function
--------
fetch_reviews(movies, row)

Use the Rotten Tomatoes web API to fetch reviews for a particular movie

Parameters
----------
movies : DataFrame 
  The movies data above
row : int
  The row of the movies DataFrame to use
  
Returns
-------
If you can match the IMDB id to a Rotten Tomatoes ID:
  A DataFrame, containing the first 20 Top Critic reviews 
  for the movie. If a movie has less than 20 total reviews, return them all.
  This should have the following columns:
    critic : Name of the critic
    fresh  : 'fresh' or 'rotten'
    imdb   : IMDB id for the movie
    publication: Publication that the critic writes for
    quote  : string containing the movie review quote
    review_data: Date of review
    rtid   : Rotten Tomatoes ID for the movie
    title  : Name of the movie
    
If you cannot match the IMDB id to a Rotten Tomatoes ID, return None

Examples
--------
>>> reviews = fetch_reviews(movies, 0)
>>> print len(reviews)
20
>>> print reviews.irow(1)
critic                                               Derek Adams
fresh                                                      fresh
imdb                                                      114709
publication                                             Time Out
quote          So ingenious in concept, design and execution ...
review_date                                           2009-10-04
rtid                                                        9559
title                                                  Toy story
Name: 1, dtype: object
"""

from pandas.io.json import json_normalize

def fetch_reviews(movies, row):
    movie_id = str(movies.irow(row)['imdbID']).zfill(7)
    url = 'http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id=%s&type=imdb&apikey=en3yzpn423n4q9ppmysy49yq' % movie_id
    data = requests.get(url).text
    data = json.loads(data)
    
    if not data.has_key('error'):
        rt_id = data['id']
        url2 = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json?review_type=top_critic&page_limit=20&page=1&country=us&apikey=en3yzpn423n4q9ppmysy49yq' %rt_id
        data2 = requests.get(url2).text
        data2 = json.loads(data2)

        if data2.has_key('reviews'):
            if len(data2['reviews']) > 0:
                df = json_normalize(data2,'reviews')
                df['title'] = data['title']
                df['rtid'] = data['id']
                df['imdb_title'] = movies.irow(row)['title']
                df['imdb'] = movies.irow(row)['imdbID']
                if 'original_score' in df.columns:
                    df.drop(['original_score'],inplace=True,axis=1)
                if 'links' in df.columns:
                    df.drop(['links'],inplace=True,axis=1)
                df.rename(columns={'date': 'review_date', 'freshness': 'fresh'}, inplace=True)
                df = df.reindex_axis(sorted(df.columns), axis=1)
                return df
    else:
        return None

In [289]:
#fetch_reviews(movies, 108)

P1.2

Use the function you wrote to retrieve reviews for the first 3,000 movies in the movies dataframe.

Hints
  • Rotten Tomatoes limits you to 10,000 API requests a day. Be careful about this limit! Test your code on smaller inputs before scaling. You are responsible if you hit the limit the day the assignment is due :)
  • This will take a while to download. If you don't want to re-run this function every time you restart the notebook, you can save and re-load this data as a CSV file. However, please don't submit this file

In [230]:
"""
Function
--------
build_table

Parameters
----------
movies : DataFrame
  The movies data above
rows : int
  The number of rows to extract reviews for
  
Returns
--------
A dataframe
  The data obtained by repeatedly calling `fetch_reviews` on the first `rows`
  of `movies`, discarding the `None`s,
  and concatenating the results into a single DataFrame
"""

def build_table(movies, rows):
    
    l = list(range(rows))
    pd_init = fetch_reviews(movies, 0)
    l.pop(0)

    for num in l:
        print ("Checking Index: " + str(num))
        pd_init = pd.concat([pd_init, fetch_reviews(movies, num)])

    return pd_init

#build_table(movies, 3)

In [145]:
# SCRATCH SPACE

#pd_init.tail()
#pd_init.to_csv("/Users/xbsd/python/rt_movies.csv", index=False)
#pd_init.tail()

movies[movies.title=="The Closer You Get"]
movies[movies.index==2999]
z = fetch_reviews(movies, 2999)
print z


None

In [238]:
# SCRATCH SPACE

row = 108
movie_id = str(movies.irow(row)['imdbID']).zfill(7)
url = 'http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id=%s&type=imdb&apikey=en3yzpn423n4q9ppmysy49yq' % movie_id
data = requests.get(url).text
data = json.loads(data)

rt_id = data['id']

url2 = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json?review_type=top_critic&page_limit=20&page=1&country=us&apikey=en3yzpn423n4q9ppmysy49yq' %rt_id
data2 = requests.get(url2).text
data2 = json.loads(data2)

df = json_normalize(data2,'reviews')
data['id']

df['title'] = "test"
df['title'] = data['title']
df['rtid'] = data['id']
df['imdb'] = movies.irow(row)['title']

df.drop(['links','original_score'],inplace=True,axis=1)
df.rename(columns={'date': 'review_date', 'freshness': 'fresh'}, inplace=True)
#df = df.reindex_axis(sorted(df.columns), axis=1)

#df

In [290]:
#you can toggle which lines are commented, if you
#want to re-load your results to avoid repeatedly calling this function

#critics = build_table(movies, 3000)
#critics.to_csv('critics.csv', index=False)
critics = pd.read_csv('/Users/xbsd/python/rt_movies.csv')

critics.drop(['imdb_title'],inplace=True,axis=1)

#for this assignment, let's drop rows with missing data
critics = critics[-critics.quote.isnull()]
#critics = critics[critics.fresh != 'none']
#critics = critics[critics.quote.str.len() > 0]

In [291]:
#critics.dropna()
#critics[-critics.quote.isnull()].shape

A quick sanity check that everything looks ok at this point


In [239]:
assert set(critics.columns) == set('critic fresh imdb publication '
                                   'quote review_date rtid title'.split())
assert len(critics) > 10000

Part 2: Explore

Before delving into analysis, get a sense of what these data look like. Answer the following questions. Include your code!

2.1 How many reviews, critics, and movies are in this dataset?


In [292]:
#your code here
num_reviews = len(critics)
num_critics = len(critics.critic.unique())
num_movies = len(critics.rtid.unique())

print ("Num Reviews: " + str(num_reviews) + " Num Critics: " + str(num_critics) + " Num Movies: " + str(num_movies))


Num Reviews: 15807 Num Critics: 639 Num Movies: 1910

2.2 What does the distribution of number of reviews per reviewer look like? Make a histogram


In [293]:
#Your code here
def histogram_style():
    remove_border(left=False)
    plt.grid(False)
    plt.grid(axis='y', color='w', linestyle='-', lw=1)

critics.groupby('critic').rtid.count().hist(log=True, bins=range(20), edgecolor='white')
plt.xlabel("Number of reviews per critic")
plt.ylabel("N")
histogram_style()



In [241]:
#Your code here
#critics.critic.hist(figsize=(10,10))
critics.critic.value_counts().plot(kind='bar')


Out[241]:
<matplotlib.axes.AxesSubplot at 0x1d07cf590>

2.3 List the 5 critics with the most reviews, along with the publication they write for


In [242]:
#Your code here
gb = critics.groupby(["critic"])

z = gb.agg({'critic':np.count_nonzero,'publication':np.unique})
top_critics = z.sort(columns="critic", ascending=False)[0:5]

z2 = gb.agg({'critic':np.unique,'publication':np.unique})
top_critics['criticname']=top_critics.index.values

print z2[z2['critic'].isin(list(top_critics.criticname))]
top_critics


                                critic                                        publication
critic                                                                                   
James Berardinelli  James Berardinelli                                          ReelViews
Janet Maslin              Janet Maslin                                     New York Times
Jonathan Rosenbaum  Jonathan Rosenbaum                                     Chicago Reader
Roger Ebert                Roger Ebert  [At the Movies, Chicago Sun-Times, RogerEbert....
Variety Staff            Variety Staff                                            Variety
Out[242]:
critic publication criticname
critic
Roger Ebert 1129 [At the Movies, Chicago Sun-Times, RogerEbert.... Roger Ebert
James Berardinelli 800 ReelViews James Berardinelli
Janet Maslin 525 New York Times Janet Maslin
Variety Staff 446 Variety Variety Staff
Jonathan Rosenbaum 411 Chicago Reader Jonathan Rosenbaum

2.4 Of the critics with > 100 reviews, plot the distribution of average "freshness" rating per critic


In [243]:
#Your code here
gb = critics.groupby(["critic"])
z = gb.agg({'critic':np.count_nonzero,'publication':np.unique, \
            'fresh': lambda x: sum(x=="fresh")})
z['criticname'] = z.index.values
z['freshness'] = z.fresh/z.critic

top_critics = z.sort(columns="critic", ascending=False)
top_critics['criticname'] = top_critics.index.values


li = top_critics.query('critic > 100').index.values

res = z[z['criticname'].isin(list(li))]
#res.sort(columns="freshness",inplace=True)
res.plot(x="criticname", y="freshness", rot=90)


Out[243]:
<matplotlib.axes.AxesSubplot at 0x1d0fe5dd0>

2.5 Using the original movies dataframe, plot the rotten tomatoes Top Critics Rating as a function of year. Overplot the average for each year, ignoring the score=0 examples (some of these are missing data). Comment on the result -- is there a trend? What do you think it means?


In [244]:
sub = movies[['rtTopCriticsRating', 'year']]
sub = sub[(sub.rtTopCriticsRating.values != "\N")]
sub[['rtTopCriticsRating']] = sub[['rtTopCriticsRating']].astype('float')
gb2 = sub.groupby("year")
sub2 = gb2.agg({'rtTopCriticsRating':np.mean})

plt.scatter(x=sub.year, y=sub.rtTopCriticsRating.values, c='r', alpha=0.5)
plt.plot(sub2.index, sub2.rtTopCriticsRating,'r-', c='b', alpha=0.8)


Out[244]:
[<matplotlib.lines.Line2D at 0x1d0f1dcd0>]

Your Comment Here

Part 3: Sentiment Analysis

You will now use a Naive Bayes classifier to build a prediction model for whether a review is fresh or rotten, depending on the text of the review. See Lecture 9 for a discussion of Naive Bayes.

Most models work with numerical data, so we need to convert the textual collection of reviews to something numerical. A common strategy for text classification is to represent each review as a "bag of words" vector -- a long vector of numbers encoding how many times a particular word appears in a blurb.

Scikit-learn has an object called a CountVectorizer that turns text into a bag of words. Here's a quick tutorial:


In [245]:
from sklearn.feature_extraction.text import CountVectorizer

text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print "Original text is\n", '\n'.join(text)

vectorizer = CountVectorizer(min_df=0)

# call `fit` to build the vocabulary
vectorizer.fit(text)

# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x = x.toarray()

print
print "Transformed text vector is \n", x

# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print "Words for each feature:"
print vectorizer.get_feature_names()

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency


Original text is
Hop on pop
Hop off pop
Hop Hop hop

Transformed text vector is 
[[1 0 1 1]
 [1 1 0 1]
 [3 0 0 0]]

Words for each feature:
[u'hop', u'off', u'on', u'pop']

3.1

Using the critics dataframe, compute a pair of numerical X, Y arrays where:

  • X is a (nreview, nwords) array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to your model.
  • Y is a nreview-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired output from your model.

In [246]:
#hint: Consult the scikit-learn documentation to
#      learn about what these classes do do
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

"""
Function
--------
make_xy

Build a bag-of-words training set for the review data

Parameters
-----------
critics : Pandas DataFrame
    The review data from above
    
vectorizer : CountVectorizer object (optional)
    A CountVectorizer object to use. If None,
    then create and fit a new CountVectorizer.
    Otherwise, re-fit the provided CountVectorizer
    using the critics data
    
Returns
-------
X : numpy array (dims: nreview, nwords)
    Bag-of-words representation for each review.
Y : numpy array (dims: nreview)
    1/0 array. 1 = fresh review, 0 = rotten review

Examples
--------
X, Y = make_xy(critics)
"""
def make_xy(critics, vectorizer=None):
    #Your code here

    if vectorizer == None:
        vectorizer = CountVectorizer(min_df=0)
    text = critics.quote.ravel()
    vectorizer.fit(text)
    X = vectorizer.transform(text)
    X = X.toarray()
    Y = np.array(1 * (critics.fresh=="fresh"))
    return (X, Y)

In [247]:
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))

In [248]:
np.sum(Y - (1 * (critics.fresh=="fresh")))


Out[248]:
0

3.2 Next, randomly split the data into two groups: a training set and a validation set.

Use the training set to train a MultinomialNB classifier, and print the accuracy of this model on the validation set

Hint You can use train_test_split to split up the training data


In [250]:
#Your code here
X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB()
clf.fit(X_train, y_train)
predicted_train = clf.predict(X_train)
predicted_test = clf.predict(X_test)
trains=X_train.reshape(1,-1).flatten()
tests=X_test.reshape(1,-1).flatten()

In [251]:
X_test


Out[251]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [252]:
X.shape


Out[252]:
(15048, 2199)

In [253]:
'''
>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2
'''
from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(y_train, predicted_train)
accuracy_test  = accuracy_score(y_test, predicted_test)

print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))


Train Accuracy: 0.792929292929 Test Accuracy: 0.750664540138

3.3:

We say a model is overfit if it performs better on the training data than on the test data. Is this model overfit? If so, how much more accurate is the model on the training data compared to the test data?


In [254]:
# Your code here. Print the accuracy on the test and training dataset
print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))
print ("Training Accuracy is better than on the Test Set by " + str(accuracy_train - accuracy_test))


Train Accuracy: 0.792929292929 Test Accuracy: 0.750664540138
Training Accuracy is better than on the Test Set by 0.0422647527911

In [255]:
'''
In [120]: rows = random.sample(df.index, 10)
In [121]: df_10 = df.ix[rows]
'''

# Select Random Samples

fraction = 0.75 # Change this for your examples !!!

num = np.round(len(X)*fraction).astype(int)
z = range(len(X))
ind = np.random.choice(z, num, replace=False)

samp_X = X[ind]
samp_Y = Y[ind]

samp_X_proba = clf.predict_proba(samp_X)

In [255]:

Interpret these numbers in a few sentences here

3.4: Model Calibration

Bayesian models like the Naive Bayes classifier have the nice property that they compute probabilities of a particular classification -- the predict_proba and predict_log_proba methods of MultinomialNB compute these probabilities.

Being the respectable Bayesian that you are, you should always assess whether these probabilities are calibrated -- that is, whether a prediction made with a confidence of x% is correct approximately x% of the time. We care about calibration because it tells us whether we can trust the probabilities computed by a model. If we can trust model probabilities, we can make better decisions using them (for example, we can calculate how much we should bet or invest in a given prediction).

Let's make a plot to assess model calibration. Schematically, we want something like this:

In words, we want to:

  • Take a collection of examples, and compute the freshness probability for each using clf.predict_proba
  • Gather examples into bins of similar freshness probability (the diagram shows 5 groups -- you should use something closer to 20)
  • For each bin, count the number of examples in that bin, and compute the fraction of examples in the bin which are fresh
  • In the upper plot, graph the expected P(Fresh) (x axis) and observed freshness fraction (Y axis). Estimate the uncertainty in observed freshness fraction $F$ via the equation $\sigma = \sqrt{F (1-F) / N}$
  • Overplot the line y=x. This is the trend we would expect if the model is calibrated
  • In the lower plot, show the number of examples in each bin

Hints

The output of clf.predict_proba(X) is a (N example, 2) array. The first column gives the probability $P(Y=0)$ or $P(Rotten)$, and the second gives $P(Y=1)$ or $P(Fresh)$.

The above image is just a guideline -- feel free to explore other options!


In [256]:
df = pd.DataFrame({'samp_X_proba_1':samp_X_proba[:,1], 'samp_Y':samp_Y})
#df.sort(columns="samp_X_proba_1",inplace=True)

bin1 = np.arange(0, 101, 5)/100.

df['bin_num'] = np.digitize(df.samp_X_proba_1, bin1)
gb = df.groupby('bin_num')
gb_agg = gb.agg({'samp_Y':lambda x: np.sum(x == 1), 'bin_num':np.count_nonzero})
gb_agg['fresh_pct'] = gb_agg.samp_Y/gb_agg.bin_num
gb_agg['uncertainty'] = np.sqrt(gb_agg.fresh_pct * (1 - gb_agg.fresh_pct) / gb_agg.bin_num)
gb_agg
plt.plot(gb_agg.index.values, gb_agg.fresh_pct,'ro-')

plt.plot(np.arange(21), np.arange(21)/20.)


Out[256]:
[<matplotlib.lines.Line2D at 0x1d0ff7750>]

In [257]:
plt.hist(df.bin_num, bins=20,rwidth=0.9)


Out[257]:
(array([  958.,   495.,   421.,   376.,   330.,   348.,   330.,   316.,
          372.,   301.,   337.,   340.,   405.,   426.,   444.,   486.,
          541.,   694.,   944.,  2422.]),
 array([  1.  ,   1.95,   2.9 ,   3.85,   4.8 ,   5.75,   6.7 ,   7.65,
          8.6 ,   9.55,  10.5 ,  11.45,  12.4 ,  13.35,  14.3 ,  15.25,
         16.2 ,  17.15,  18.1 ,  19.05,  20.  ]),
 <a list of 20 Patch objects>)

In [258]:
X


Out[258]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [259]:
"""
Function
--------
calibration_plot

Builds a plot like the one above, from a classifier and review data

Inputs
-------
clf : Classifier object
    A MultinomialNB classifier
X : (Nexample, Nfeature) array
    The bag-of-words data
Y : (Nexample) integer array
    1 if a review is Fresh
"""    
#your code here

def calibration_plot(clf, X, Y):
    
    # Prepare the Data

    fraction = 1.0 # Change this for your examples !!!
    num = np.round(len(X)*fraction).astype(int)
    z = range(len(X))
    ind = np.random.choice(z, num, replace=False)

    samp_X = X[ind]
    samp_Y = Y[ind]
    samp_X_proba = clf.predict_proba(samp_X)


    # Prepare the DataFrame

    df = pd.DataFrame({'samp_X_proba_1':samp_X_proba[:,1], 'samp_Y':samp_Y})
    #df.sort(columns="samp_X_proba_1",inplace=True)
    bin1 = np.arange(0, 101, 5)/100.
    df['bin_num'] = np.digitize(df.samp_X_proba_1, bin1)
    gb = df.groupby('bin_num')
    gb_agg = gb.agg({'samp_Y':lambda x: np.sum(x == 1), 'bin_num':np.count_nonzero})
    gb_agg['fresh_pct'] = gb_agg.samp_Y/gb_agg.bin_num
    gb_agg['uncertainty'] = np.sqrt(gb_agg.fresh_pct * (1 - gb_agg.fresh_pct) / gb_agg.bin_num)
    # gb_agg

    # Plot Figures
    
    plt.figure(0)
    plt.plot(gb_agg.index.values, gb_agg.fresh_pct,'ro-')
    plt.plot(np.arange(21), np.arange(21)/20.)

    plt.figure(1)
    plt.hist(df.bin_num, bins=20,rwidth=0.9)

    plt.show
    return None

In [260]:
calibration_plot(clf, X_test, y_test)


3.5 We might say a model is over-confident if the freshness fraction is usually closer to 0.5 than expected (that is, there is more uncertainty than the model predicted). Likewise, a model is under-confident if the probabilities are usually further away from 0.5. Is this model generally over- or under-confident?

Your Answer Here

Cross Validation

Our classifier has a few free parameters. The two most important are:

  1. The min_df keyword in CountVectorizer, which will ignore words which appear in fewer than min_df fraction of reviews. Words that appear only once or twice can lead to overfitting, since words which occur only a few times might correlate very well with Fresh/Rotten reviews by chance in the training dataset.

  2. The alpha keyword in the Bayesian classifier is a "smoothing parameter" -- increasing the value decreases the sensitivity to any single feature, and tends to pull prediction probabilities closer to 50%.

As discussed in lecture and HW2, a common technique for choosing appropriate values for these parameters is cross-validation. Let's choose good parameters by maximizing the cross-validated log-likelihood.

3.6 Using clf.predict_log_proba, write a function that computes the log-likelihood of a dataset


In [261]:
"""
Function
--------
log_likelihood

Compute the log likelihood of a dataset according to a bayesian classifier. 
The Log Likelihood is defined by

L = Sum_fresh(logP(fresh)) + Sum_rotten(logP(rotten))

Where Sum_fresh indicates a sum over all fresh reviews, 
and Sum_rotten indicates a sum over rotten reviews
    
Parameters
----------
clf : Bayesian classifier
x : (nexample, nfeature) array
    The input data
y : (nexample) integer array
    Whether each review is Fresh
"""
#your code here

def log_likelihood(clf, x, y):
    x_proba = clf.predict_log_proba(x)

    prob_rotten = x_proba[:,0]
    prob_fresh = x_proba[:,1]

    df = pd.DataFrame({'y':y, 'logP_fresh':prob_fresh, 'logP_rotten':prob_rotten})
    df['y']=df.y.astype("str")
    gb = df.groupby('y')
    res = gb.agg({'y':np.count_nonzero, 'logP_fresh': np.sum, 'logP_rotten': np.sum})

    logP_fresh  = res[res.index=="1"][['logP_fresh']].values
    logP_rotten = res[res.index=="0"][['logP_rotten']].values

    L = (logP_fresh + logP_rotten).ravel()
    #print ("LogP Fresh: " + str(logP_fresh) + " LogP Rotten: " + \
    #       str(logP_rotten) + " L: " + str(L))

    return L

#x = log_likelihood(clf, X_test, y_test)
#x

In [262]:
x < 10


Out[262]:
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

Here's a function to estimate the cross-validated value of a scoring function, given a classifier and data


In [263]:
from sklearn.cross_validation import KFold

def cv_score(clf, x, y, score_func):
    """
    Uses 5-fold cross validation to estimate a score of a classifier
    
    Inputs
    ------
    clf : Classifier object
    x : Input feature vector
    y : Input class labels
    score_func : Function like log_likelihood, that takes (clf, x, y) as input,
                 and returns a score
                 
    Returns
    -------
    The average score obtained by randomly splitting (x, y) into training and 
    test sets, fitting on the training set, and evaluating score_func on the test set
    
    Examples
    cv_score(clf, x, y, log_likelihood)
    """
    result = 0
    nfold = 5
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

# --- Very Important -- Existing Functionality ---
# as a side note, this function is builtin to the newest version of sklearn. We could just write
# sklearn.cross_validation.cross_val_score(clf, x, y, scorer=log_likelihood).
# ------------------------------------------------


# cv_score(clf, X, Y, log_likelihood)

3.7

Fill in the remaining code in this block, to loop over many values of alpha and min_df to determine which settings are "best" in the sense of maximizing the cross-validated log-likelihood


In [264]:
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

# alphas = [.1]    # For Testing
# min_dfs = [1e-5] # For Testing

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf

i = 0
iterations = len(alphas) * len(min_dfs)

for alpha in alphas:
    for min_df in min_dfs:
        print ("Starting Iteration: " + str(i) + " of " + str(iterations))
        print ("Alpha: " + str(alpha) + " Min DF: " + str(min_df))
        vectorizer = CountVectorizer(min_df = min_df)
        X, Y = make_xy(critics, vectorizer)
        #your code here
        clf = MultinomialNB(alpha=alpha)
        new_score = cv_score(clf, X, Y, log_likelihood)
        if new_score > max_loglike:
            max_loglike = new_score
            best_alpha = alpha
            best_min_df = min_df
        print ("LogLik Score = " + str(new_score))
        print (" ---")
        i = i + 1


Starting Iteration: 0 of 30
Alpha: 0 Min DF: 1e-05
LogLik Score = [ nan]
 ---
Starting Iteration: 1 of 30
Alpha: 0 Min DF: 0.0001
LogLik Score = [ nan]
 ---
Starting Iteration: 2 of 30
Alpha: 0 Min DF: 0.001
LogLik Score = [ nan]
 ---
Starting Iteration: 3 of 30
Alpha: 0 Min DF: 0.01
LogLik Score = [-1898.11491671]
 ---
Starting Iteration: 4 of 30
Alpha: 0 Min DF: 0.1
LogLik Score = [-1989.59368911]
 ---
Starting Iteration: 5 of 30
Alpha: 0.1 Min DF: 1e-05
LogLik Score = [-2493.18492826]
 ---
Starting Iteration: 6 of 30
Alpha: 0.1 Min DF: 0.0001
LogLik Score = [-2492.37906681]
 ---
Starting Iteration: 7 of 30
Alpha: 0.1 Min DF: 0.001
LogLik Score = [-1762.94184335]
 ---
Starting Iteration: 8 of 30
Alpha: 0.1 Min DF: 0.01
LogLik Score = [-1898.02270029]
 ---
Starting Iteration: 9 of 30
Alpha: 0.1 Min DF: 0.1
LogLik Score = [-1989.59327999]
 ---
Starting Iteration: 10 of 30
Alpha: 1 Min DF: 1e-05
LogLik Score = [-1728.4714172]
 ---
Starting Iteration: 11 of 30
Alpha: 1 Min DF: 0.0001
LogLik Score = [-1718.73721745]
 ---
Starting Iteration: 12 of 30
Alpha: 1 Min DF: 0.001
LogLik Score = [-1699.31137285]
 ---
Starting Iteration: 13 of 30
Alpha: 1 Min DF: 0.01
LogLik Score = [-1897.22622562]
 ---
Starting Iteration: 14 of 30
Alpha: 1 Min DF: 0.1
LogLik Score = [-1989.58968387]
 ---
Starting Iteration: 15 of 30
Alpha: 5 Min DF: 1e-05
LogLik Score = [-2496.67053748]
 ---
Starting Iteration: 16 of 30
Alpha: 5 Min DF: 0.0001
LogLik Score = [-1865.42188979]
 ---
Starting Iteration: 17 of 30
Alpha: 5 Min DF: 0.001
LogLik Score = [-1635.75018493]
 ---
Starting Iteration: 18 of 30
Alpha: 5 Min DF: 0.01
LogLik Score = [-1894.33497364]
 ---
Starting Iteration: 19 of 30
Alpha: 5 Min DF: 0.1
LogLik Score = [-1989.57554692]
 ---
Starting Iteration: 20 of 30
Alpha: 10 Min DF: 1e-05
LogLik Score = [-3467.63016835]
 ---
Starting Iteration: 21 of 30
Alpha: 10 Min DF: 0.0001
LogLik Score = [-2550.7832299]
 ---
Starting Iteration: 22 of 30
Alpha: 10 Min DF: 0.001
LogLik Score = [-1640.77581971]
 ---
Starting Iteration: 23 of 30
Alpha: 10 Min DF: 0.01
LogLik Score = [-1891.93110673]
 ---
Starting Iteration: 24 of 30
Alpha: 10 Min DF: 0.1
LogLik Score = [-1989.56199785]
 ---
Starting Iteration: 25 of 30
Alpha: 50 Min DF: 1e-05
LogLik Score = [-4743.14970851]
 ---
Starting Iteration: 26 of 30
Alpha: 50 Min DF: 0.0001
LogLik Score = [-4248.99709581]
 ---
Starting Iteration: 27 of 30
Alpha: 50 Min DF: 0.001
LogLik Score = [-2370.95425129]
 ---
Starting Iteration: 28 of 30
Alpha: 50 Min DF: 0.01
LogLik Score = [-1899.2963795]
 ---
Starting Iteration: 29 of 30
Alpha: 50 Min DF: 0.1
LogLik Score = [-1989.59809826]
 ---

In [265]:
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df


alpha: 5.000000
min_df: 0.001000

3.8 Now that you've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the steps in 3.1, 3.2, and 3.4 to train a final classifier with these parameters, re-evaluate the accuracy, and draw a new calibration plot.


In [266]:
#Your code here
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))

X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB(alpha = best_alpha)
clf.fit(X_train, y_train)
predicted_train = clf.predict(X_train)
predicted_test = clf.predict(X_test)
trains=X_train.reshape(1,-1).flatten()
tests=X_test.reshape(1,-1).flatten()

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(y_train, predicted_train)
accuracy_test  = accuracy_score(y_test, predicted_test)

print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))

calibration_plot(clf, X_test, y_test)
log_likelihood(clf, X_test, y_test)
cv_score(clf, X, Y, log_likelihood)


Train Accuracy: 0.786461102251 Test Accuracy: 0.736310473153
Out[266]:
array([-1635.75018493])

In [267]:
log_likelihood(clf, X_test, y_test)


Out[267]:
array([-1807.26436408])

In [268]:
cv_score(clf, X, Y, log_likelihood)


Out[268]:
array([-1635.75018493])

3.9 Discuss the various ways in which Cross-Validation has affected the model. Is the new model more or less accurate? Is overfitting better or worse? Is the model more or less calibrated?

Your Answer Here

To think about/play with, but not to hand in: What would happen if you tried this again using a function besides the log-likelihood -- for example, the classification accuracy?

Part 4: Interpretation. What words best predict a fresh or rotten review?

4.1 Using your classifier and the vectorizer.get_feature_names method, determine which words best predict a positive or negative review. Print the 10 words that best predict a "fresh" review, and the 10 words that best predict a "rotten" review. For each word, what is the model's probability of freshness if the word appears one time?

Hints

  • Try computing the classification probability for a feature vector which consists of all 0s, except for a single 1. What does this probability refer to?

  • np.eye generates a matrix where the ith row is all 0s, except for the ith column which is 1.


In [288]:
# Your code here

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f \t%-15s \t\t %.4f \t% -15s" % (coef_1, fn_1, coef_2, fn_2)

n = 10
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_, feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])

show_most_informative_features(vectorizer, clf, n=20)
#coefs_with_fns[:]
top


words = np.array(vectorizer.get_feature_names())

x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

[ind[:10]]
vectorizer.get_feature_names()


	-9.5976 	it              		 -5.5612 	that           
	-9.3745 	as              		 -8.1806 	the            
	-9.2792 	and             		 -8.3449 	to             
	-9.2792 	in              		 -8.4190 	of             
	-9.2792 	its             		 -8.5416 	this           
	-9.2792 	with            		 -8.6326 	an             
	-9.1121 	for             		 -8.6814 	but            
	-9.1121 	is              		 -8.8439 	movie          
	-8.9045 	film            		 -8.9045 	film           
	-8.8439 	movie           		 -9.1121 	is             
	-8.6814 	but             		 -9.1121 	for            
	-8.6326 	an              		 -9.2792 	with           
	-8.5416 	this            		 -9.2792 	its            
	-8.4190 	of              		 -9.2792 	in             
	-8.3449 	to              		 -9.2792 	and            
	-8.1806 	the             		 -9.3745 	as             
	-5.5612 	that            		 -9.5976 	it             
Out[288]:
[u'an',
 u'and',
 u'as',
 u'but',
 u'film',
 u'for',
 u'in',
 u'is',
 u'it',
 u'its',
 u'movie',
 u'of',
 u'that',
 u'the',
 u'this',
 u'to',
 u'with']

In [270]:
def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

(clf.coef_.shape)


Out[270]:
(1, 2199)

4.2

One of the best sources for inspiration when trying to improve a model is to look at examples where the model performs poorly.

Find 5 fresh and rotten reviews where your model performs particularly poorly. Print each review.


In [271]:
make_xy(critics)


Out[271]:
(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), array([1, 1, 1, ..., 1, 1, 1]))

In [272]:
critics[critics.imdb==110955]
#Y[5039:5052]


Out[272]:
critic fresh imdb publication quote review_date rtid title
5039 Jeff Shannon fresh 110955 Seattle Times It's miraculous casting, and the Australian Da... 2013-12-06 13147 The Ref
5040 Kenneth Turan fresh 110955 Los Angeles Times The Ref benefits from having actor's actors li... 2013-12-06 13147 The Ref
5041 Michael Wilmington rotten 110955 Chicago Tribune It's not a bad idea, but it's not a good movie... 2013-12-06 13147 The Ref
5042 Steven Rea rotten 110955 Philadelphia Inquirer Whether it's a function of sloppy editing or s... 2013-12-06 13147 The Ref
5043 Owen Gleiberman rotten 110955 Entertainment Weekly A foulmouthed sitcom of a film. 2011-09-07 13147 The Ref
5044 Variety Staff rotten 110955 Variety The Ref works virtually none of the miracles o... 2009-03-26 13147 The Ref
5045 Jonathan Rosenbaum fresh 110955 Chicago Reader What makes most of this work is the brio of th... 2007-11-27 13147 The Ref
5046 Geoff Andrew fresh 110955 Time Out In his first starring role, comedian Leary mak... 2006-02-09 13147 The Ref
5047 Caryn James fresh 110955 New York Times Staying clear of any mean-spirited attitudes, ... 2003-05-20 13147 The Ref
5048 Peter Travers fresh 110955 Rolling Stone Demme brings out the comic ease in Leary. 2001-05-12 13147 The Ref
5049 Hal Hinson rotten 110955 Washington Post The Ref is one of those rare movies that seem ... 2000-01-01 13147 The Ref
5050 James Berardinelli rotten 110955 ReelViews This is not a seamlessly constructed movie, bu... 2000-01-01 13147 The Ref
5051 Desson Thomson rotten 110955 Washington Post This is one holiday party you'll want to miss. 2000-01-01 13147 The Ref
5052 Roger Ebert fresh 110955 Chicago Sun-Times Material like this is only as good as the acti... 2000-01-01 13147 The Ref

In [275]:
#Your code here
#Your code here
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))

X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB(alpha = best_alpha)
clf.fit(X_train, y_train)
all_pred = clf.predict(X)
all_act  = Y
alldata  = critics

yzero = clf.predict_proba(X)[:,0]
yone  = clf.predict_proba(X)[:,1]

temp = pd.DataFrame({'actual':all_act, 'pred': all_pred, 'proba0': yzero, 'proba1':yone})
temp


Out[275]:
actual pred proba0 proba1
0 1 1 0.197385 0.802615
1 1 1 0.154380 0.845620
2 1 1 0.135735 0.864265
3 1 1 0.029535 0.970465
4 1 1 0.013373 0.986627
5 1 1 0.001520 0.998480
6 1 1 0.105968 0.894032
7 1 1 0.044276 0.955724
8 1 1 0.023551 0.976449
9 1 1 0.019731 0.980269
10 1 1 0.291099 0.708901
11 1 1 0.008019 0.991981
12 1 1 0.133066 0.866934
13 1 1 0.339067 0.660933
14 1 1 0.135005 0.864995
15 1 1 0.154506 0.845494
16 1 1 0.027510 0.972490
17 1 1 0.075791 0.924209
18 0 1 0.116994 0.883006
19 1 0 0.719362 0.280638
20 0 0 0.678488 0.321512
21 1 0 0.660961 0.339039
22 1 1 0.035450 0.964550
23 0 0 0.982669 0.017331
24 1 0 0.881747 0.118253
25 0 1 0.333123 0.666877
26 1 1 0.381607 0.618393
27 0 0 0.632378 0.367622
28 1 1 0.080019 0.919981
29 0 0 0.932979 0.067021
... ... ... ... ...
15018 1 1 0.117484 0.882516
15019 1 1 0.328170 0.671830
15020 1 1 0.010874 0.989126
15021 1 1 0.013037 0.986963
15022 1 1 0.220809 0.779191
15023 1 0 0.552798 0.447202
15024 0 1 0.406322 0.593678
15025 1 1 0.267228 0.732772
15026 1 0 0.562191 0.437809
15027 1 0 0.667167 0.332833
15028 0 0 0.502714 0.497286
15029 1 1 0.128733 0.871267
15030 0 1 0.389252 0.610748
15031 0 1 0.243758 0.756242
15032 0 0 0.955301 0.044699
15033 0 0 0.949083 0.050917
15034 0 0 0.956573 0.043427
15035 0 1 0.469129 0.530871
15036 0 1 0.064388 0.935612
15037 0 1 0.431358 0.568642
15038 1 1 0.025647 0.974353
15039 0 0 0.590650 0.409350
15040 1 1 0.120616 0.879384
15041 1 1 0.139223 0.860777
15042 0 1 0.491241 0.508759
15043 1 1 0.198314 0.801686
15044 1 1 0.213662 0.786338
15045 1 1 0.429152 0.570848
15046 1 1 0.047312 0.952688
15047 1 0 0.717020 0.282980

15048 rows × 4 columns


In [279]:
comb = pd.concat([temp, critics], axis=1)
fresh = comb[comb.actual == 1]
#fresh.sort(columns="proba1", inplace=True)
print "Actual Was Fresh, But Predicted Rotten"
fresh[fresh.quote.notnull()].head()
comb


Actual Was Fresh, But Predicted Rotten
Out[279]:
actual pred proba0 proba1 critic fresh imdb publication quote review_date rtid title
0 1 1 0.197385 0.802615 NaN NaN NaN NaN NaN NaN NaN NaN
1 1 1 0.154380 0.845620 Derek Adams fresh 114709 Time Out So ingenious in concept, design and execution ... 2009-10-04 9559 Toy Story
2 1 1 0.135735 0.864265 Richard Corliss fresh 114709 TIME Magazine The year's most inventive comedy. 2008-08-31 9559 Toy Story
3 1 1 0.029535 0.970465 David Ansen fresh 114709 Newsweek A winning animated feature that has something ... 2008-08-18 9559 Toy Story
4 1 1 0.013373 0.986627 Leonard Klady fresh 114709 Variety The film sports a provocative and appealing st... 2008-06-09 9559 Toy Story
5 1 1 0.001520 0.998480 Jonathan Rosenbaum fresh 114709 Chicago Reader An entertaining computer-generated, hyperreali... 2008-03-10 9559 Toy Story
6 1 1 0.105968 0.894032 Michael Booth fresh 114709 Denver Post As Lion King did before it, Toy Story revived ... 2007-05-03 9559 Toy Story
7 1 1 0.044276 0.955724 Geoff Andrew fresh 114709 Time Out The film will probably be more fully appreciat... 2006-06-24 9559 Toy Story
8 1 1 0.023551 0.976449 Janet Maslin fresh 114709 New York Times Children will enjoy a new take on the irresist... 2003-05-20 9559 Toy Story
9 1 1 0.019731 0.980269 Kenneth Turan fresh 114709 Los Angeles Times Although its computer-generated imagery is imp... 2001-02-13 9559 Toy Story
10 1 1 0.291099 0.708901 Susan Wloszczyna fresh 114709 USA Today How perfect that two of the most popular funny... 2000-01-01 9559 Toy Story
11 1 1 0.008019 0.991981 Roger Ebert fresh 114709 Chicago Sun-Times The result is a visionary roller-coaster ride ... 2000-01-01 9559 Toy Story
12 1 1 0.133066 0.866934 John Hartl fresh 114709 Film.com Disney's witty, wondrously imaginative, all-co... 2000-01-01 9559 Toy Story
13 1 1 0.339067 0.660933 Susan Stark fresh 114709 Detroit News Disney's first computer-made animated feature ... 2000-01-01 9559 Toy Story
14 1 1 0.135005 0.864995 Peter Stack fresh 114709 San Francisco Chronicle The script, by Lasseter, Pete Docter, Andrew S... 2000-01-01 9559 Toy Story
15 1 1 0.154506 0.845494 James Berardinelli fresh 114709 ReelViews The one big negative about Toy Story involves ... 2000-01-01 9559 Toy Story
16 1 1 0.027510 0.972490 Sean Means fresh 114709 Film.com Technically, Toy Story is nearly flawless. 2000-01-01 9559 Toy Story
17 1 1 0.075791 0.924209 Rita Kempley fresh 114709 Washington Post It's a nice change of pace to see the studio d... 2000-01-01 9559 Toy Story
18 0 1 0.116994 0.883006 NaN NaN NaN NaN NaN NaN NaN NaN
19 1 0 0.719362 0.280638 Roger Moore fresh 114709 Orlando Sentinel The great voice acting, the visual puns, all a... 1995-11-22 9559 Toy Story
20 0 0 0.678488 0.321512 NaN NaN NaN NaN NaN NaN NaN NaN
21 1 0 0.660961 0.339039 NaN NaN NaN NaN NaN NaN NaN NaN
22 1 1 0.035450 0.964550 NaN NaN NaN NaN NaN NaN NaN NaN
23 0 0 0.982669 0.017331 NaN NaN NaN NaN NaN NaN NaN NaN
24 1 0 0.881747 0.118253 NaN NaN NaN NaN NaN NaN NaN NaN
25 0 1 0.333123 0.666877 NaN NaN NaN NaN NaN NaN NaN NaN
26 1 1 0.381607 0.618393 NaN NaN NaN NaN NaN NaN NaN NaN
27 0 0 0.632378 0.367622 Roger Ebert rotten 113497 Chicago Sun-Times A gloomy special-effects extravaganza filled w... 2000-01-01 12436 Jumanji
28 1 1 0.080019 0.919981 NaN NaN NaN NaN NaN NaN NaN NaN
29 0 0 0.932979 0.067021 NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
26947 NaN NaN NaN NaN Michael Wilmington fresh 165798 Chicago Tribune Ghost Dog is... a delight for those who know h... 2000-01-01 13267 Ghost Dog - The Way of the Samurai
26948 NaN NaN NaN NaN Roger Ebert fresh 165798 Chicago Sun-Times By the end, Whitaker's character has generated... 2000-01-01 13267 Ghost Dog - The Way of the Samurai
26949 NaN NaN NaN NaN Variety Staff fresh 94347 Variety The characters are memorable ones, and beautif... 2007-12-18 770674061 The Year My Voice Broke
26951 NaN NaN NaN NaN Caryn James fresh 94347 New York Times It is so pleasant and unpretentious that we ca... 2004-08-30 770674061 The Year My Voice Broke
26952 NaN NaN NaN NaN Jonathan Rosenbaum fresh 94347 Chicago Reader Although most of this is rather familiar stuff... 2000-01-01 770674061 The Year My Voice Broke
26953 NaN NaN NaN NaN Hal Hinson fresh 94347 Washington Post This isn't an adolescent wish-fulfillment fant... 2000-01-01 770674061 The Year My Voice Broke
26958 NaN NaN NaN NaN Pat Graham rotten 63185 Chicago Reader Robert Aldrich's "daring" 1968 mating of lesbi... 2009-04-24 747403171 The Killing of Sister George
26962 NaN NaN NaN NaN Dave Kehr fresh 40506 Chicago Reader A little windy and rhetorical for my taste, bu... 2008-04-08 18375 Key Largo
26963 NaN NaN NaN NaN Variety Staff fresh 40506 Variety Emphasis is on tension in the telling, and eff... 2008-04-08 18375 Key Largo
26964 NaN NaN NaN NaN Tom Milne fresh 40506 Time Out Although the characters are basically stereoty... 2006-06-24 18375 Key Largo
26965 NaN NaN NaN NaN Bosley Crowther rotten 40506 New York Times The script prepared by Mr. Huston and Richard ... 2006-03-25 18375 Key Largo
26966 NaN NaN NaN NaN Bob Longino fresh 364435 Atlanta Journal-Constitution Disturbing and affecting. 2006-08-31 748169836 Jailbait
26967 NaN NaN NaN NaN Carrie Rickey rotten 364435 Philadelphia Inquirer Claustrophobic and overwrought, Jailbait is an... 2006-08-18 748169836 Jailbait
26968 NaN NaN NaN NaN Frank Scheck rotten 364435 Hollywood Reporter While the stars deliver highly committed perfo... 2006-08-17 748169836 Jailbait
26969 NaN NaN NaN NaN Laura Kern rotten 364435 New York Times A stagy, only mildly compelling prison drama t... 2006-08-04 748169836 Jailbait
26970 NaN NaN NaN NaN Lou Lumenick rotten 364435 New York Post I wouldn't have thought it was possible to mak... 2006-08-04 748169836 Jailbait
26971 NaN NaN NaN NaN Jack Mathews rotten 364435 New York Daily News The cruelty of the law has been better demonst... 2006-08-04 748169836 Jailbait
26972 NaN NaN NaN NaN Jim Ridley rotten 364435 Village Voice ... the umpteenth prison drama to focus on the... 2006-08-02 748169836 Jailbait
26985 NaN NaN NaN NaN Vincent Canby rotten 74695 New York Times Mr. Peckinpah's least interesting, least perso... 2005-05-09 13694 Cross of Iron
26987 NaN NaN NaN NaN Dave Kehr rotten 42276 Chicago Reader George Cukor directed, a little impersonally f... 2008-01-11 18781 Born Yesterday
26989 NaN NaN NaN NaN Bosley Crowther fresh 42276 New York Times More firm in its social implications than ever... 2003-05-20 18781 Born Yesterday
26990 NaN NaN NaN NaN Variety Staff rotten 86969 Variety Belying the lightheartedness of its title, Bir... 2008-09-16 15889 Birdy
26992 NaN NaN NaN NaN Roger Ebert fresh 86969 Chicago Sun-Times A very strange and beautiful movie. 2004-10-23 15889 Birdy
26993 NaN NaN NaN NaN Janet Maslin fresh 86969 New York Times Most of Birdy is enchanting. 2003-05-20 15889 Birdy
26995 NaN NaN NaN NaN Bosley Crowther rotten 49189 New York Times We can't recommend this little item as a sampl... 2006-10-30 11854 ...And God Created Woman
26997 NaN NaN NaN NaN Richard Schickel fresh 86005 TIME Magazine Ballard and his masterly crew of film makers h... 2009-03-09 12606 Never Cry Wolf
26998 NaN NaN NaN NaN Ronald Holloway fresh 86005 Variety Measures up to the promise Ballard amply provi... 2008-07-23 12606 Never Cry Wolf
27000 NaN NaN NaN NaN Vincent Canby fresh 86005 New York Times Perhaps the best thing about the film is that ... 2004-08-30 12606 Never Cry Wolf
27001 NaN NaN NaN NaN Dave Kehr fresh 86005 Chicago Reader The film is still memorable for its compassion... 2000-01-01 12606 Never Cry Wolf
27008 NaN NaN NaN NaN Don Druker fresh 55353 Chicago Reader It does have enough gritty insights and (for t... 2007-11-13 18541 A Raisin in the Sun

22007 rows × 12 columns


In [199]:
Y[1000:1010]
critics[1000:1010]
df2 = pd.DataFrame({'Y':Y})


Out[199]:
Y critic fresh imdb publication quote review_date rtid title
0 NaN Derek Adams fresh 114709 Time Out So ingenious in concept, design and execution ... 2009-10-04 9559 Toy Story
1 NaN Richard Corliss fresh 114709 TIME Magazine The year's most inventive comedy. 2008-08-31 9559 Toy Story
2 NaN David Ansen fresh 114709 Newsweek A winning animated feature that has something ... 2008-08-18 9559 Toy Story
3 NaN Leonard Klady fresh 114709 Variety The film sports a provocative and appealing st... 2008-06-09 9559 Toy Story
4 NaN Jonathan Rosenbaum fresh 114709 Chicago Reader An entertaining computer-generated, hyperreali... 2008-03-10 9559 Toy Story
5 NaN Michael Booth fresh 114709 Denver Post As Lion King did before it, Toy Story revived ... 2007-05-03 9559 Toy Story
6 NaN Geoff Andrew fresh 114709 Time Out The film will probably be more fully appreciat... 2006-06-24 9559 Toy Story
7 NaN Janet Maslin fresh 114709 New York Times Children will enjoy a new take on the irresist... 2003-05-20 9559 Toy Story
8 NaN Kenneth Turan fresh 114709 Los Angeles Times Although its computer-generated imagery is imp... 2001-02-13 9559 Toy Story
9 NaN Susan Wloszczyna fresh 114709 USA Today How perfect that two of the most popular funny... 2000-01-01 9559 Toy Story
10 NaN Roger Ebert fresh 114709 Chicago Sun-Times The result is a visionary roller-coaster ride ... 2000-01-01 9559 Toy Story
11 NaN John Hartl fresh 114709 Film.com Disney's witty, wondrously imaginative, all-co... 2000-01-01 9559 Toy Story
12 NaN Susan Stark fresh 114709 Detroit News Disney's first computer-made animated feature ... 2000-01-01 9559 Toy Story
13 NaN Peter Stack fresh 114709 San Francisco Chronicle The script, by Lasseter, Pete Docter, Andrew S... 2000-01-01 9559 Toy Story
14 NaN James Berardinelli fresh 114709 ReelViews The one big negative about Toy Story involves ... 2000-01-01 9559 Toy Story
15 NaN Sean Means fresh 114709 Film.com Technically, Toy Story is nearly flawless. 2000-01-01 9559 Toy Story
16 NaN Rita Kempley fresh 114709 Washington Post It's a nice change of pace to see the studio d... 2000-01-01 9559 Toy Story
17 NaN NaN fresh 114709 Entertainment Weekly I can hardly imagine having more fun at the mo... 1995-11-22 9559 Toy Story
18 NaN Roger Moore fresh 114709 Orlando Sentinel The great voice acting, the visual puns, all a... 1995-11-22 9559 Toy Story
19 NaN Roger Ebert rotten 113497 Chicago Sun-Times A gloomy special-effects extravaganza filled w... 2000-01-01 12436 Jumanji
20 NaN NaN fresh 113497 USA Today A calculated but very entertaining special eff... 2000-01-01 12436 Jumanji
21 NaN Richard Schickel fresh 107050 TIME Magazine Walter Matthau and Jack Lemmon are awfully goo... 2008-08-24 10498 Grumpy Old Men
22 NaN Derek Adams rotten 107050 Time Out Mediocre, regrettably. 2006-06-24 10498 Grumpy Old Men
23 NaN Caryn James fresh 107050 New York Times Just don't expect their bickering to be on the... 2003-05-20 10498 Grumpy Old Men
24 NaN James Berardinelli fresh 107050 ReelViews While it won't come close to my top 10 best li... 2000-01-01 10498 Grumpy Old Men
25 NaN Roger Ebert rotten 107050 Chicago Sun-Times The movie is too pat and practiced to really b... 2000-01-01 10498 Grumpy Old Men
26 NaN Desson Thomson fresh 107050 Washington Post If you poke through Grumpy's cheap sentimental... 2000-01-01 10498 Grumpy Old Men
27 NaN Liam Lacey rotten 114885 Globe and Mail Never escapes the queasy aura of Melrose Place... 2002-04-12 16697 Waiting to Exhale
28 NaN Kenneth Turan fresh 114885 Los Angeles Times A pleasant if undemanding piece of work that i... 2001-02-13 16697 Waiting to Exhale
29 NaN Edward Guthmann rotten 114885 San Francisco Chronicle You want the movie to stomp and rejoice and cr... 2000-01-01 16697 Waiting to Exhale
... ... ... ... ... ... ... ... ... ...
31528 1 NaN NaN NaN NaN NaN NaN NaN NaN
31529 1 NaN NaN NaN NaN NaN NaN NaN NaN
31530 0 NaN NaN NaN NaN NaN NaN NaN NaN
31531 0 NaN NaN NaN NaN NaN NaN NaN NaN
31532 1 NaN NaN NaN NaN NaN NaN NaN NaN
31533 1 NaN NaN NaN NaN NaN NaN NaN NaN
31534 1 NaN NaN NaN NaN NaN NaN NaN NaN
31535 0 NaN NaN NaN NaN NaN NaN NaN NaN
31536 1 NaN NaN NaN NaN NaN NaN NaN NaN
31537 0 NaN NaN NaN NaN NaN NaN NaN NaN
31538 0 NaN NaN NaN NaN NaN NaN NaN NaN
31539 0 NaN NaN NaN NaN NaN NaN NaN NaN
31540 0 NaN NaN NaN NaN NaN NaN NaN NaN
31541 0 NaN NaN NaN NaN NaN NaN NaN NaN
31542 0 NaN NaN NaN NaN NaN NaN NaN NaN
31543 0 NaN NaN NaN NaN NaN NaN NaN NaN
31544 0 NaN NaN NaN NaN NaN NaN NaN NaN
31545 1 NaN NaN NaN NaN NaN NaN NaN NaN
31546 1 NaN NaN NaN NaN NaN NaN NaN NaN
31547 0 NaN NaN NaN NaN NaN NaN NaN NaN
31548 1 NaN NaN NaN NaN NaN NaN NaN NaN
31549 1 NaN NaN NaN NaN NaN NaN NaN NaN
31550 1 NaN NaN NaN NaN NaN NaN NaN NaN
31551 0 NaN NaN NaN NaN NaN NaN NaN NaN
31552 1 NaN NaN NaN NaN NaN NaN NaN NaN
31553 1 NaN NaN NaN NaN NaN NaN NaN NaN
31554 1 NaN NaN NaN NaN NaN NaN NaN NaN
31555 1 NaN NaN NaN NaN NaN NaN NaN NaN
31556 1 NaN NaN NaN NaN NaN NaN NaN NaN
31557 1 NaN NaN NaN NaN NaN NaN NaN NaN

31558 rows × 9 columns


In [ ]:


In [135]:
rotten = comb[comb.actual == 0]
rotten.sort(columns="proba0", inplace=True)
print "Actual Was Rotten, But Predicted Fresh"
rotten[rotten.quote.notnull()].head()


Actual Was Rotten, But Predicted Fresh
Out[135]:
actual pred proba0 proba1 critic fresh imdb publication quote review_date rtid title
8290 0 1 0.004051 0.995949 Lawrence Van Gelder fresh 117011 New York Times From start to finish, Maximum Risk presents sp... 2000-01-01 15020 Maximum Risk
4723 0 1 0.004158 0.995842 Owen Gleiberman fresh 108551 Entertainment Weekly A splashy, volatile, crowd- pleasing rock-star... 2011-09-07 13236 What's Love Got To Do With It?
15171 0 1 0.006613 0.993387 Dennis Harvey rotten 138563 Variety In the end, too, we've learned very little abo... 2009-03-26 13804 Kurt & Courtney
14177 0 1 0.008705 0.991295 Todd McCarthy fresh 120347 Variety A solid but somewhat by-the-numbers entry in t... 2008-07-28 11715 Tomorrow Never Dies
2886 0 1 0.011203 0.988797 Kevin Crust fresh 331933 Los Angeles Times Herek keeps things moving and throws in some l... 2005-02-28 136262883 Man of the House

4.3 What do you notice about these mis-predictions? Naive Bayes classifiers assume that every word affects the probability independently of other words. In what way is this a bad assumption? In your answer, report your classifier's Freshness probability for the review "This movie is not remarkable, touching, or superb in any way".

Your answer here

4.4 If this was your final project, what are 3 things you would try in order to build a more effective review classifier? What other exploratory or explanatory visualizations do you think might be helpful?

Your answer here

How to Submit

Restart and run your notebook one last time, to make sure the output from each cell is up to date. To submit your homework, create a folder named lastname_firstinitial_hw3 and place your solutions in the folder. Double check that the file is still called HW3.ipynb, and that it contains your code. Please do not include the critics.csv data file, if you created one. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work!


css tweaks in this cell