Homework 3. Bayesian Tomatoes

Due Thursday, October 17, 11:59pm

In this assignment, you'll be analyzing movie reviews from Rotten Tomatoes. This assignment will cover:

Working with web APIs
Making and interpreting predictions from a Bayesian perspective
Using the Naive Bayes algorithm to predict whether a movie review is positive or negative
Using cross validation to optimize models

Useful libraries for this assignment

numpy, for arrays
scikit-learn, for machine learning
json for parsing JSON data from the web.
pandas, for data frames
matplotlib, for plotting
requests, for downloading web content



In [207]:

    
%matplotlib inline

import json

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 30)

# set some nicer defaults for matplotlib
from matplotlib import rcParams

#these colors come from colorbrewer2.org. Each is an RGB triplet
dark2_colors = [(0.10588235294117647, 0.6196078431372549, 0.4666666666666667),
                (0.8509803921568627, 0.37254901960784315, 0.00784313725490196),
                (0.4588235294117647, 0.4392156862745098, 0.7019607843137254),
                (0.9058823529411765, 0.1607843137254902, 0.5411764705882353),
                (0.4, 0.6509803921568628, 0.11764705882352941),
                (0.9019607843137255, 0.6705882352941176, 0.00784313725490196),
                (0.6509803921568628, 0.4627450980392157, 0.11372549019607843),
                (0.4, 0.4, 0.4)]

rcParams['figure.figsize'] = (10, 6)
rcParams['figure.dpi'] = 150
rcParams['axes.color_cycle'] = dark2_colors
rcParams['lines.linewidth'] = 2
rcParams['axes.grid'] = False
rcParams['axes.facecolor'] = 'white'
rcParams['font.size'] = 14
rcParams['patch.edgecolor'] = 'none'


def remove_border(axes=None, top=False, right=False, left=True, bottom=True):
    """
    Minimize chartjunk by stripping out unnecessary plot borders and axis ticks
    
    The top/right/left/bottom keywords toggle whether the corresponding plot border is drawn
    """
    ax = axes or plt.gca()
    ax.spines['top'].set_visible(top)
    ax.spines['right'].set_visible(right)
    ax.spines['left'].set_visible(left)
    ax.spines['bottom'].set_visible(bottom)
    
    #turn off all ticks
    ax.yaxis.set_ticks_position('none')
    ax.xaxis.set_ticks_position('none')
    
    #now re-enable visibles
    if top:
        ax.xaxis.tick_top()
    if bottom:
        ax.xaxis.tick_bottom()
    if left:
        ax.yaxis.tick_left()
    if right:
        ax.yaxis.tick_right()



In [208]:

    
pd.version.version









    Out[208]:





'0.14.0'

Introduction

Rotten Tomatoes gathers movie reviews from critics. An entry on the website typically consists of a short quote, a link to the full review, and a Fresh/Rotten classification which summarizes whether the critic liked/disliked the movie.

When critics give quantitative ratings (say 3/4 stars, Thumbs up, etc.), determining the Fresh/Rotten classification is easy. However, publications like the New York Times don't assign numerical ratings to movies, and thus the Fresh/Rotten classification must be inferred from the text of the review itself.

This basic task of categorizing text has many applications. All of the following questions boil down to text classification:

Is a movie review positive or negative?
Is an email spam, or not?
Is a comment on a blog discussion board appropriate, or not?
Is a tweet about your company positive, or not?

Language is incredibly nuanced, and there is an entire field of computer science dedicated to the topic (Natural Language Processing). Nevertheless, we can construct basic language models using fairly straightforward techniques.

The Data

You will be starting with a database of Movies, derived from the MovieLens dataset. This dataset includes information for about 10,000 movies, including the IMDB id for each movie.

Your first task is to download Rotten Tomatoes reviews from 3000 of these movies, using the Rotten Tomatoes API (Application Programming Interface).

Working with Web APIs

Web APIs are a more convenient way for programs to interact with websites. Rotten Tomatoes has a nice API that gives access to its data in JSON format.

To use this, you will first need to register for an API key. For "application URL", you can use anything -- it doesn't matter.

After you have a key, the documentation page shows the various data you can fetch from Rotten Tomatoes -- each type of data lives at a different web address. The basic pattern for fetching this data with Python is as follows (compare this to the Movie Reviews tab on the documentation page):



In [209]:

    
api_key = 'en3yzpn423n4q9ppmysy49yq'
movie_id = '770672122'  # toy story 3
url = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json' % movie_id

#these are "get parameters"
options = {'review_type': 'top_critic', 'page_limit': 20, 'page': 1, 'apikey': api_key}
data = requests.get(url, params=options).text
data = json.loads(data)  # load a json string into a collection of lists and dicts

print json.dumps(data['reviews'][0], indent=2)  # dump an object into a json string
#data









    



{
  "publication": "Village Voice", 
  "links": {
    "review": "http://www.villagevoice.com/2010-06-15/film/toys-are-us-in-toy-story-3/full/"
  }, 
  "quote": "When teenaged Andy plops down on the grass to share his old toys with a shy little girl, the film spikes with sadness and layered pleasure -- a concise, deeply wise expression of the ephemeral that feels real and yet utterly transporting.", 
  "freshness": "fresh", 
  "critic": "Eric Hynes", 
  "date": "2013-08-04"
}

Part 1: Get the data

Here's a chunk of the MovieLens Dataset:



In [227]:

    
from io import StringIO  
movie_txt = requests.get('https://raw.github.com/cs109/cs109_data/master/movies.dat').text
movie_file = StringIO(movie_txt) # treat a string like a file
movies = pd.read_csv(movie_file, delimiter='\t')

#print the first row
movies[['id', 'title', 'imdbID', 'year']].irow(0)
movies.head()









    Out[227]:






  
    
      
      id
      title
      imdbID
      spanishTitle
      imdbPictureURL
      year
      rtID
      rtAllCriticsRating
      rtAllCriticsNumReviews
      rtAllCriticsNumFresh
      rtAllCriticsNumRotten
      rtAllCriticsScore
      rtTopCriticsRating
      rtTopCriticsNumReviews
      rtTopCriticsNumFresh
      rtTopCriticsNumRotten
      rtTopCriticsScore
      rtAudienceRating
      rtAudienceNumRatings
      rtAudienceScore
      rtPictureURL
    
  
  
    
      0
       1
                         Toy story
       114709
                                    Toy story (juguetes)
       http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...
       1995
                         toy_story
         9
       73
       73
        0
       100
       8.5
       17
       17
       0
       100
       3.7
       102338
       81
       http://content7.flixster.com/movie/10/93/63/10...
    
    
      1
       2
                           Jumanji
       113497
                                                 Jumanji
       http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...
       1995
                   1068044-jumanji
       5.6
       28
       13
       15
        46
       5.8
        5
        2
       3
        40
       3.2
        44587
       61
       http://content8.flixster.com/movie/56/79/73/56...
    
    
      2
       3
                    Grumpy Old Men
       107050
                                     Dos viejos gruñones
       http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...
       1993
                    grumpy_old_men
       5.9
       36
       24
       12
        66
         7
        6
        5
       1
        83
       3.2
        10489
       66
       http://content6.flixster.com/movie/25/60/25602...
    
    
      3
       4
                 Waiting to Exhale
       114885
                                    Esperando un respiro
       http://ia.media-imdb.com/images/M/MV5BMTczMTMy...
       1995
                 waiting_to_exhale
       5.6
       25
       14
       11
        56
       5.5
       11
        5
       6
        45
       3.3
         5666
       79
       http://content9.flixster.com/movie/10/94/17/10...
    
    
      4
       5
       Father of the Bride Part II
       113041
       Vuelve el padre de la novia (Ahora también abu...
       http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...
       1995
       father_of_the_bride_part_ii
       5.3
       19
        9
       10
        47
       5.4
        5
        1
       4
        20
         3
        13761
       64
       http://content8.flixster.com/movie/25/54/25542...



In [211]:

    
movies[['id', 'title', 'imdbID', 'year']].irow(0)









    Out[211]:





id                1
title     Toy story
imdbID       114709
year           1995
Name: 0, dtype: object



In [212]:

    
movies.irow(0)['id']









    Out[212]:





1

P1.1

We'd like you to write a function that looks up the first 20 Top Critic Rotten Tomatoes reviews for a movie in the movies dataframe. This involves two steps:

Use the Movie Alias API to look up the Rotten Tomatoes movie id from the IMDB id
Use the Movie Reviews API to fetch the first 20 top-critic reviews for this movie

Not all movies have Rotten Tomatoes IDs. In these cases, your function should return None. The detailed spec is below. We are giving you some freedom with how you implement this, but you'll probably want to break this task up into several small functions.

Hint In some situations, the leading 0s in front of IMDB ids are important. IMDB ids have 7 digits



In [142]:

    
"""
Function
--------
fetch_reviews(movies, row)

Use the Rotten Tomatoes web API to fetch reviews for a particular movie

Parameters
----------
movies : DataFrame 
  The movies data above
row : int
  The row of the movies DataFrame to use
  
Returns
-------
If you can match the IMDB id to a Rotten Tomatoes ID:
  A DataFrame, containing the first 20 Top Critic reviews 
  for the movie. If a movie has less than 20 total reviews, return them all.
  This should have the following columns:
    critic : Name of the critic
    fresh  : 'fresh' or 'rotten'
    imdb   : IMDB id for the movie
    publication: Publication that the critic writes for
    quote  : string containing the movie review quote
    review_data: Date of review
    rtid   : Rotten Tomatoes ID for the movie
    title  : Name of the movie
    
If you cannot match the IMDB id to a Rotten Tomatoes ID, return None

Examples
--------
>>> reviews = fetch_reviews(movies, 0)
>>> print len(reviews)
20
>>> print reviews.irow(1)
critic                                               Derek Adams
fresh                                                      fresh
imdb                                                      114709
publication                                             Time Out
quote          So ingenious in concept, design and execution ...
review_date                                           2009-10-04
rtid                                                        9559
title                                                  Toy story
Name: 1, dtype: object
"""

from pandas.io.json import json_normalize

def fetch_reviews(movies, row):
    movie_id = str(movies.irow(row)['imdbID']).zfill(7)
    url = 'http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id=%s&type=imdb&apikey=en3yzpn423n4q9ppmysy49yq' % movie_id
    data = requests.get(url).text
    data = json.loads(data)
    
    if not data.has_key('error'):
        rt_id = data['id']
        url2 = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json?review_type=top_critic&page_limit=20&page=1&country=us&apikey=en3yzpn423n4q9ppmysy49yq' %rt_id
        data2 = requests.get(url2).text
        data2 = json.loads(data2)

        if data2.has_key('reviews'):
            if len(data2['reviews']) > 0:
                df = json_normalize(data2,'reviews')
                df['title'] = data['title']
                df['rtid'] = data['id']
                df['imdb_title'] = movies.irow(row)['title']
                df['imdb'] = movies.irow(row)['imdbID']
                if 'original_score' in df.columns:
                    df.drop(['original_score'],inplace=True,axis=1)
                if 'links' in df.columns:
                    df.drop(['links'],inplace=True,axis=1)
                df.rename(columns={'date': 'review_date', 'freshness': 'fresh'}, inplace=True)
                df = df.reindex_axis(sorted(df.columns), axis=1)
                return df
    else:
        return None



In [289]:

    
#fetch_reviews(movies, 108)

P1.2

Use the function you wrote to retrieve reviews for the first 3,000 movies in the movies dataframe.

Hints

Rotten Tomatoes limits you to 10,000 API requests a day. Be careful about this limit! Test your code on smaller inputs before scaling. You are responsible if you hit the limit the day the assignment is due :)
This will take a while to download. If you don't want to re-run this function every time you restart the notebook, you can save and re-load this data as a CSV file. However, please don't submit this file



In [230]:

    
"""
Function
--------
build_table

Parameters
----------
movies : DataFrame
  The movies data above
rows : int
  The number of rows to extract reviews for
  
Returns
--------
A dataframe
  The data obtained by repeatedly calling `fetch_reviews` on the first `rows`
  of `movies`, discarding the `None`s,
  and concatenating the results into a single DataFrame
"""

def build_table(movies, rows):
    
    l = list(range(rows))
    pd_init = fetch_reviews(movies, 0)
    l.pop(0)

    for num in l:
        print ("Checking Index: " + str(num))
        pd_init = pd.concat([pd_init, fetch_reviews(movies, num)])

    return pd_init

#build_table(movies, 3)



In [145]:

    
# SCRATCH SPACE

#pd_init.tail()
#pd_init.to_csv("/Users/xbsd/python/rt_movies.csv", index=False)
#pd_init.tail()

movies[movies.title=="The Closer You Get"]
movies[movies.index==2999]
z = fetch_reviews(movies, 2999)
print z









    



None



In [238]:

    
# SCRATCH SPACE

row = 108
movie_id = str(movies.irow(row)['imdbID']).zfill(7)
url = 'http://api.rottentomatoes.com/api/public/v1.0/movie_alias.json?id=%s&type=imdb&apikey=en3yzpn423n4q9ppmysy49yq' % movie_id
data = requests.get(url).text
data = json.loads(data)

rt_id = data['id']

url2 = 'http://api.rottentomatoes.com/api/public/v1.0/movies/%s/reviews.json?review_type=top_critic&page_limit=20&page=1&country=us&apikey=en3yzpn423n4q9ppmysy49yq' %rt_id
data2 = requests.get(url2).text
data2 = json.loads(data2)

df = json_normalize(data2,'reviews')
data['id']

df['title'] = "test"
df['title'] = data['title']
df['rtid'] = data['id']
df['imdb'] = movies.irow(row)['title']

df.drop(['links','original_score'],inplace=True,axis=1)
df.rename(columns={'date': 'review_date', 'freshness': 'fresh'}, inplace=True)
#df = df.reindex_axis(sorted(df.columns), axis=1)

#df



In [290]:

    
#you can toggle which lines are commented, if you
#want to re-load your results to avoid repeatedly calling this function

#critics = build_table(movies, 3000)
#critics.to_csv('critics.csv', index=False)
critics = pd.read_csv('/Users/xbsd/python/rt_movies.csv')

critics.drop(['imdb_title'],inplace=True,axis=1)

#for this assignment, let's drop rows with missing data
critics = critics[-critics.quote.isnull()]
#critics = critics[critics.fresh != 'none']
#critics = critics[critics.quote.str.len() > 0]



In [291]:

    
#critics.dropna()
#critics[-critics.quote.isnull()].shape

A quick sanity check that everything looks ok at this point



In [239]:

    
assert set(critics.columns) == set('critic fresh imdb publication '
                                   'quote review_date rtid title'.split())
assert len(critics) > 10000

Part 2: Explore

Before delving into analysis, get a sense of what these data look like. Answer the following questions. Include your code!

2.1 How many reviews, critics, and movies are in this dataset?



In [292]:

    
#your code here
num_reviews = len(critics)
num_critics = len(critics.critic.unique())
num_movies = len(critics.rtid.unique())

print ("Num Reviews: " + str(num_reviews) + " Num Critics: " + str(num_critics) + " Num Movies: " + str(num_movies))









    



Num Reviews: 15807 Num Critics: 639 Num Movies: 1910

2.2 What does the distribution of number of reviews per reviewer look like? Make a histogram



In [293]:

    
#Your code here
def histogram_style():
    remove_border(left=False)
    plt.grid(False)
    plt.grid(axis='y', color='w', linestyle='-', lw=1)

critics.groupby('critic').rtid.count().hist(log=True, bins=range(20), edgecolor='white')
plt.xlabel("Number of reviews per critic")
plt.ylabel("N")
histogram_style()



In [241]:

    
#Your code here
#critics.critic.hist(figsize=(10,10))
critics.critic.value_counts().plot(kind='bar')









    Out[241]:





<matplotlib.axes.AxesSubplot at 0x1d07cf590>

2.3 List the 5 critics with the most reviews, along with the publication they write for



In [242]:

    
#Your code here
gb = critics.groupby(["critic"])

z = gb.agg({'critic':np.count_nonzero,'publication':np.unique})
top_critics = z.sort(columns="critic", ascending=False)[0:5]

z2 = gb.agg({'critic':np.unique,'publication':np.unique})
top_critics['criticname']=top_critics.index.values

print z2[z2['critic'].isin(list(top_critics.criticname))]
top_critics









    



                                critic                                        publication
critic                                                                                   
James Berardinelli  James Berardinelli                                          ReelViews
Janet Maslin              Janet Maslin                                     New York Times
Jonathan Rosenbaum  Jonathan Rosenbaum                                     Chicago Reader
Roger Ebert                Roger Ebert  [At the Movies, Chicago Sun-Times, RogerEbert....
Variety Staff            Variety Staff                                            Variety






    Out[242]:






  
    
      
      critic
      publication
      criticname
    
    
      critic
      
      
      
    
  
  
    
      Roger Ebert
       1129
       [At the Movies, Chicago Sun-Times, RogerEbert....
              Roger Ebert
    
    
      James Berardinelli
        800
                                               ReelViews
       James Berardinelli
    
    
      Janet Maslin
        525
                                          New York Times
             Janet Maslin
    
    
      Variety Staff
        446
                                                 Variety
            Variety Staff
    
    
      Jonathan Rosenbaum
        411
                                          Chicago Reader
       Jonathan Rosenbaum

2.4 Of the critics with > 100 reviews, plot the distribution of average "freshness" rating per critic



In [243]:

    
#Your code here
gb = critics.groupby(["critic"])
z = gb.agg({'critic':np.count_nonzero,'publication':np.unique, \
            'fresh': lambda x: sum(x=="fresh")})
z['criticname'] = z.index.values
z['freshness'] = z.fresh/z.critic

top_critics = z.sort(columns="critic", ascending=False)
top_critics['criticname'] = top_critics.index.values


li = top_critics.query('critic > 100').index.values

res = z[z['criticname'].isin(list(li))]
#res.sort(columns="freshness",inplace=True)
res.plot(x="criticname", y="freshness", rot=90)









    Out[243]:





<matplotlib.axes.AxesSubplot at 0x1d0fe5dd0>

2.5 Using the original movies dataframe, plot the rotten tomatoes Top Critics Rating as a function of year. Overplot the average for each year, ignoring the score=0 examples (some of these are missing data). Comment on the result -- is there a trend? What do you think it means?



In [244]:

    
sub = movies[['rtTopCriticsRating', 'year']]
sub = sub[(sub.rtTopCriticsRating.values != "\N")]
sub[['rtTopCriticsRating']] = sub[['rtTopCriticsRating']].astype('float')
gb2 = sub.groupby("year")
sub2 = gb2.agg({'rtTopCriticsRating':np.mean})

plt.scatter(x=sub.year, y=sub.rtTopCriticsRating.values, c='r', alpha=0.5)
plt.plot(sub2.index, sub2.rtTopCriticsRating,'r-', c='b', alpha=0.8)









    Out[244]:





[<matplotlib.lines.Line2D at 0x1d0f1dcd0>]

Your Comment Here

Part 3: Sentiment Analysis

You will now use a Naive Bayes classifier to build a prediction model for whether a review is fresh or rotten, depending on the text of the review. See Lecture 9 for a discussion of Naive Bayes.

Most models work with numerical data, so we need to convert the textual collection of reviews to something numerical. A common strategy for text classification is to represent each review as a "bag of words" vector -- a long vector of numbers encoding how many times a particular word appears in a blurb.

Scikit-learn has an object called a CountVectorizer that turns text into a bag of words. Here's a quick tutorial:



In [245]:

    
from sklearn.feature_extraction.text import CountVectorizer

text = ['Hop on pop', 'Hop off pop', 'Hop Hop hop']
print "Original text is\n", '\n'.join(text)

vectorizer = CountVectorizer(min_df=0)

# call `fit` to build the vocabulary
vectorizer.fit(text)

# call `transform` to convert text to a bag of words
x = vectorizer.transform(text)

# CountVectorizer uses a sparse array to save memory, but it's easier in this assignment to 
# convert back to a "normal" numpy array
x = x.toarray()

print
print "Transformed text vector is \n", x

# `get_feature_names` tracks which word is associated with each column of the transformed x
print
print "Words for each feature:"
print vectorizer.get_feature_names()

# Notice that the bag of words treatment doesn't preserve information about the *order* of words, 
# just their frequency









    



Original text is
Hop on pop
Hop off pop
Hop Hop hop

Transformed text vector is 
[[1 0 1 1]
 [1 1 0 1]
 [3 0 0 0]]

Words for each feature:
[u'hop', u'off', u'on', u'pop']

3.1

Using the critics dataframe, compute a pair of numerical X, Y arrays where:

X is a (nreview, nwords) array. Each row corresponds to a bag-of-words representation for a single review. This will be the input to your model.
Y is a nreview-element 1/0 array, encoding whether a review is Fresh (1) or Rotten (0). This is the desired output from your model.



In [246]:

    
#hint: Consult the scikit-learn documentation to
#      learn about what these classes do do
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

"""
Function
--------
make_xy

Build a bag-of-words training set for the review data

Parameters
-----------
critics : Pandas DataFrame
    The review data from above
    
vectorizer : CountVectorizer object (optional)
    A CountVectorizer object to use. If None,
    then create and fit a new CountVectorizer.
    Otherwise, re-fit the provided CountVectorizer
    using the critics data
    
Returns
-------
X : numpy array (dims: nreview, nwords)
    Bag-of-words representation for each review.
Y : numpy array (dims: nreview)
    1/0 array. 1 = fresh review, 0 = rotten review

Examples
--------
X, Y = make_xy(critics)
"""
def make_xy(critics, vectorizer=None):
    #Your code here

    if vectorizer == None:
        vectorizer = CountVectorizer(min_df=0)
    text = critics.quote.ravel()
    vectorizer.fit(text)
    X = vectorizer.transform(text)
    X = X.toarray()
    Y = np.array(1 * (critics.fresh=="fresh"))
    return (X, Y)



In [247]:

    
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))



In [248]:

    
np.sum(Y - (1 * (critics.fresh=="fresh")))









    Out[248]:





0

3.2 Next, randomly split the data into two groups: a training set and a validation set.

Use the training set to train a MultinomialNB classifier, and print the accuracy of this model on the validation set

Hint You can use train_test_split to split up the training data



In [250]:

    
#Your code here
X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB()
clf.fit(X_train, y_train)
predicted_train = clf.predict(X_train)
predicted_test = clf.predict(X_test)
trains=X_train.reshape(1,-1).flatten()
tests=X_test.reshape(1,-1).flatten()



In [251]:

    
X_test









    Out[251]:





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])



In [252]:

    
X.shape









    Out[252]:





(15048, 2199)



In [253]:

    
'''
>>> import numpy as np
>>> from sklearn.metrics import accuracy_score
>>> y_pred = [0, 2, 1, 3]
>>> y_true = [0, 1, 2, 3]
>>> accuracy_score(y_true, y_pred)
0.5
>>> accuracy_score(y_true, y_pred, normalize=False)
2
'''
from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(y_train, predicted_train)
accuracy_test  = accuracy_score(y_test, predicted_test)

print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))









    



Train Accuracy: 0.792929292929 Test Accuracy: 0.750664540138

3.3:

We say a model is overfit if it performs better on the training data than on the test data. Is this model overfit? If so, how much more accurate is the model on the training data compared to the test data?



In [254]:

    
# Your code here. Print the accuracy on the test and training dataset
print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))
print ("Training Accuracy is better than on the Test Set by " + str(accuracy_train - accuracy_test))









    



Train Accuracy: 0.792929292929 Test Accuracy: 0.750664540138
Training Accuracy is better than on the Test Set by 0.0422647527911



In [255]:

    
'''
In [120]: rows = random.sample(df.index, 10)
In [121]: df_10 = df.ix[rows]
'''

# Select Random Samples

fraction = 0.75 # Change this for your examples !!!

num = np.round(len(X)*fraction).astype(int)
z = range(len(X))
ind = np.random.choice(z, num, replace=False)

samp_X = X[ind]
samp_Y = Y[ind]

samp_X_proba = clf.predict_proba(samp_X)



In [255]:

Interpret these numbers in a few sentences here

3.4: Model Calibration

Bayesian models like the Naive Bayes classifier have the nice property that they compute probabilities of a particular classification -- the predict_proba and predict_log_proba methods of MultinomialNB compute these probabilities.

Being the respectable Bayesian that you are, you should always assess whether these probabilities are calibrated -- that is, whether a prediction made with a confidence of x% is correct approximately x% of the time. We care about calibration because it tells us whether we can trust the probabilities computed by a model. If we can trust model probabilities, we can make better decisions using them (for example, we can calculate how much we should bet or invest in a given prediction).

Let's make a plot to assess model calibration. Schematically, we want something like this:

In words, we want to:

Take a collection of examples, and compute the freshness probability for each using clf.predict_proba
Gather examples into bins of similar freshness probability (the diagram shows 5 groups -- you should use something closer to 20)
For each bin, count the number of examples in that bin, and compute the fraction of examples in the bin which are fresh
In the upper plot, graph the expected P(Fresh) (x axis) and observed freshness fraction (Y axis). Estimate the uncertainty in observed freshness fraction $F$ via the equation $\sigma = \sqrt{F (1-F) / N}$
Overplot the line y=x. This is the trend we would expect if the model is calibrated
In the lower plot, show the number of examples in each bin

Hints

The output of clf.predict_proba(X) is a (N example, 2) array. The first column gives the probability $P(Y=0)$ or $P(Rotten)$, and the second gives $P(Y=1)$ or $P(Fresh)$.

The above image is just a guideline -- feel free to explore other options!



In [256]:

    
df = pd.DataFrame({'samp_X_proba_1':samp_X_proba[:,1], 'samp_Y':samp_Y})
#df.sort(columns="samp_X_proba_1",inplace=True)

bin1 = np.arange(0, 101, 5)/100.

df['bin_num'] = np.digitize(df.samp_X_proba_1, bin1)
gb = df.groupby('bin_num')
gb_agg = gb.agg({'samp_Y':lambda x: np.sum(x == 1), 'bin_num':np.count_nonzero})
gb_agg['fresh_pct'] = gb_agg.samp_Y/gb_agg.bin_num
gb_agg['uncertainty'] = np.sqrt(gb_agg.fresh_pct * (1 - gb_agg.fresh_pct) / gb_agg.bin_num)
gb_agg
plt.plot(gb_agg.index.values, gb_agg.fresh_pct,'ro-')

plt.plot(np.arange(21), np.arange(21)/20.)









    Out[256]:





[<matplotlib.lines.Line2D at 0x1d0ff7750>]



In [257]:

    
plt.hist(df.bin_num, bins=20,rwidth=0.9)









    Out[257]:





(array([  958.,   495.,   421.,   376.,   330.,   348.,   330.,   316.,
          372.,   301.,   337.,   340.,   405.,   426.,   444.,   486.,
          541.,   694.,   944.,  2422.]),
 array([  1.  ,   1.95,   2.9 ,   3.85,   4.8 ,   5.75,   6.7 ,   7.65,
          8.6 ,   9.55,  10.5 ,  11.45,  12.4 ,  13.35,  14.3 ,  15.25,
         16.2 ,  17.15,  18.1 ,  19.05,  20.  ]),
 <a list of 20 Patch objects>)



In [258]:

    
X









    Out[258]:





array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])



In [259]:

    
"""
Function
--------
calibration_plot

Builds a plot like the one above, from a classifier and review data

Inputs
-------
clf : Classifier object
    A MultinomialNB classifier
X : (Nexample, Nfeature) array
    The bag-of-words data
Y : (Nexample) integer array
    1 if a review is Fresh
"""    
#your code here

def calibration_plot(clf, X, Y):
    
    # Prepare the Data

    fraction = 1.0 # Change this for your examples !!!
    num = np.round(len(X)*fraction).astype(int)
    z = range(len(X))
    ind = np.random.choice(z, num, replace=False)

    samp_X = X[ind]
    samp_Y = Y[ind]
    samp_X_proba = clf.predict_proba(samp_X)


    # Prepare the DataFrame

    df = pd.DataFrame({'samp_X_proba_1':samp_X_proba[:,1], 'samp_Y':samp_Y})
    #df.sort(columns="samp_X_proba_1",inplace=True)
    bin1 = np.arange(0, 101, 5)/100.
    df['bin_num'] = np.digitize(df.samp_X_proba_1, bin1)
    gb = df.groupby('bin_num')
    gb_agg = gb.agg({'samp_Y':lambda x: np.sum(x == 1), 'bin_num':np.count_nonzero})
    gb_agg['fresh_pct'] = gb_agg.samp_Y/gb_agg.bin_num
    gb_agg['uncertainty'] = np.sqrt(gb_agg.fresh_pct * (1 - gb_agg.fresh_pct) / gb_agg.bin_num)
    # gb_agg

    # Plot Figures
    
    plt.figure(0)
    plt.plot(gb_agg.index.values, gb_agg.fresh_pct,'ro-')
    plt.plot(np.arange(21), np.arange(21)/20.)

    plt.figure(1)
    plt.hist(df.bin_num, bins=20,rwidth=0.9)

    plt.show
    return None



In [260]:

    
calibration_plot(clf, X_test, y_test)

3.5 We might say a model is over-confident if the freshness fraction is usually closer to 0.5 than expected (that is, there is more uncertainty than the model predicted). Likewise, a model is under-confident if the probabilities are usually further away from 0.5. Is this model generally over- or under-confident?

Your Answer Here

Cross Validation

Our classifier has a few free parameters. The two most important are:

The min_df keyword in CountVectorizer, which will ignore words which appear in fewer than min_df fraction of reviews. Words that appear only once or twice can lead to overfitting, since words which occur only a few times might correlate very well with Fresh/Rotten reviews by chance in the training dataset.
The alpha keyword in the Bayesian classifier is a "smoothing parameter" -- increasing the value decreases the sensitivity to any single feature, and tends to pull prediction probabilities closer to 50%.

As discussed in lecture and HW2, a common technique for choosing appropriate values for these parameters is cross-validation. Let's choose good parameters by maximizing the cross-validated log-likelihood.

3.6 Using clf.predict_log_proba, write a function that computes the log-likelihood of a dataset



In [261]:

    
"""
Function
--------
log_likelihood

Compute the log likelihood of a dataset according to a bayesian classifier. 
The Log Likelihood is defined by

L = Sum_fresh(logP(fresh)) + Sum_rotten(logP(rotten))

Where Sum_fresh indicates a sum over all fresh reviews, 
and Sum_rotten indicates a sum over rotten reviews
    
Parameters
----------
clf : Bayesian classifier
x : (nexample, nfeature) array
    The input data
y : (nexample) integer array
    Whether each review is Fresh
"""
#your code here

def log_likelihood(clf, x, y):
    x_proba = clf.predict_log_proba(x)

    prob_rotten = x_proba[:,0]
    prob_fresh = x_proba[:,1]

    df = pd.DataFrame({'y':y, 'logP_fresh':prob_fresh, 'logP_rotten':prob_rotten})
    df['y']=df.y.astype("str")
    gb = df.groupby('y')
    res = gb.agg({'y':np.count_nonzero, 'logP_fresh': np.sum, 'logP_rotten': np.sum})

    logP_fresh  = res[res.index=="1"][['logP_fresh']].values
    logP_rotten = res[res.index=="0"][['logP_rotten']].values

    L = (logP_fresh + logP_rotten).ravel()
    #print ("LogP Fresh: " + str(logP_fresh) + " LogP Rotten: " + \
    #       str(logP_rotten) + " L: " + str(L))

    return L

#x = log_likelihood(clf, X_test, y_test)
#x



In [262]:

    
x < 10









    Out[262]:





array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

Here's a function to estimate the cross-validated value of a scoring function, given a classifier and data



In [263]:

    
from sklearn.cross_validation import KFold

def cv_score(clf, x, y, score_func):
    """
    Uses 5-fold cross validation to estimate a score of a classifier
    
    Inputs
    ------
    clf : Classifier object
    x : Input feature vector
    y : Input class labels
    score_func : Function like log_likelihood, that takes (clf, x, y) as input,
                 and returns a score
                 
    Returns
    -------
    The average score obtained by randomly splitting (x, y) into training and 
    test sets, fitting on the training set, and evaluating score_func on the test set
    
    Examples
    cv_score(clf, x, y, log_likelihood)
    """
    result = 0
    nfold = 5
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

# --- Very Important -- Existing Functionality ---
# as a side note, this function is builtin to the newest version of sklearn. We could just write
# sklearn.cross_validation.cross_val_score(clf, x, y, scorer=log_likelihood).
# ------------------------------------------------


# cv_score(clf, X, Y, log_likelihood)

3.7

Fill in the remaining code in this block, to loop over many values of alpha and min_df to determine which settings are "best" in the sense of maximizing the cross-validated log-likelihood



In [264]:

    
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
min_dfs = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1]

# alphas = [.1]    # For Testing
# min_dfs = [1e-5] # For Testing

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
best_min_df = None
max_loglike = -np.inf

i = 0
iterations = len(alphas) * len(min_dfs)

for alpha in alphas:
    for min_df in min_dfs:
        print ("Starting Iteration: " + str(i) + " of " + str(iterations))
        print ("Alpha: " + str(alpha) + " Min DF: " + str(min_df))
        vectorizer = CountVectorizer(min_df = min_df)
        X, Y = make_xy(critics, vectorizer)
        #your code here
        clf = MultinomialNB(alpha=alpha)
        new_score = cv_score(clf, X, Y, log_likelihood)
        if new_score > max_loglike:
            max_loglike = new_score
            best_alpha = alpha
            best_min_df = min_df
        print ("LogLik Score = " + str(new_score))
        print (" ---")
        i = i + 1









    



Starting Iteration: 0 of 30
Alpha: 0 Min DF: 1e-05
LogLik Score = [ nan]
 ---
Starting Iteration: 1 of 30
Alpha: 0 Min DF: 0.0001
LogLik Score = [ nan]
 ---
Starting Iteration: 2 of 30
Alpha: 0 Min DF: 0.001
LogLik Score = [ nan]
 ---
Starting Iteration: 3 of 30
Alpha: 0 Min DF: 0.01
LogLik Score = [-1898.11491671]
 ---
Starting Iteration: 4 of 30
Alpha: 0 Min DF: 0.1
LogLik Score = [-1989.59368911]
 ---
Starting Iteration: 5 of 30
Alpha: 0.1 Min DF: 1e-05
LogLik Score = [-2493.18492826]
 ---
Starting Iteration: 6 of 30
Alpha: 0.1 Min DF: 0.0001
LogLik Score = [-2492.37906681]
 ---
Starting Iteration: 7 of 30
Alpha: 0.1 Min DF: 0.001
LogLik Score = [-1762.94184335]
 ---
Starting Iteration: 8 of 30
Alpha: 0.1 Min DF: 0.01
LogLik Score = [-1898.02270029]
 ---
Starting Iteration: 9 of 30
Alpha: 0.1 Min DF: 0.1
LogLik Score = [-1989.59327999]
 ---
Starting Iteration: 10 of 30
Alpha: 1 Min DF: 1e-05
LogLik Score = [-1728.4714172]
 ---
Starting Iteration: 11 of 30
Alpha: 1 Min DF: 0.0001
LogLik Score = [-1718.73721745]
 ---
Starting Iteration: 12 of 30
Alpha: 1 Min DF: 0.001
LogLik Score = [-1699.31137285]
 ---
Starting Iteration: 13 of 30
Alpha: 1 Min DF: 0.01
LogLik Score = [-1897.22622562]
 ---
Starting Iteration: 14 of 30
Alpha: 1 Min DF: 0.1
LogLik Score = [-1989.58968387]
 ---
Starting Iteration: 15 of 30
Alpha: 5 Min DF: 1e-05
LogLik Score = [-2496.67053748]
 ---
Starting Iteration: 16 of 30
Alpha: 5 Min DF: 0.0001
LogLik Score = [-1865.42188979]
 ---
Starting Iteration: 17 of 30
Alpha: 5 Min DF: 0.001
LogLik Score = [-1635.75018493]
 ---
Starting Iteration: 18 of 30
Alpha: 5 Min DF: 0.01
LogLik Score = [-1894.33497364]
 ---
Starting Iteration: 19 of 30
Alpha: 5 Min DF: 0.1
LogLik Score = [-1989.57554692]
 ---
Starting Iteration: 20 of 30
Alpha: 10 Min DF: 1e-05
LogLik Score = [-3467.63016835]
 ---
Starting Iteration: 21 of 30
Alpha: 10 Min DF: 0.0001
LogLik Score = [-2550.7832299]
 ---
Starting Iteration: 22 of 30
Alpha: 10 Min DF: 0.001
LogLik Score = [-1640.77581971]
 ---
Starting Iteration: 23 of 30
Alpha: 10 Min DF: 0.01
LogLik Score = [-1891.93110673]
 ---
Starting Iteration: 24 of 30
Alpha: 10 Min DF: 0.1
LogLik Score = [-1989.56199785]
 ---
Starting Iteration: 25 of 30
Alpha: 50 Min DF: 1e-05
LogLik Score = [-4743.14970851]
 ---
Starting Iteration: 26 of 30
Alpha: 50 Min DF: 0.0001
LogLik Score = [-4248.99709581]
 ---
Starting Iteration: 27 of 30
Alpha: 50 Min DF: 0.001
LogLik Score = [-2370.95425129]
 ---
Starting Iteration: 28 of 30
Alpha: 50 Min DF: 0.01
LogLik Score = [-1899.2963795]
 ---
Starting Iteration: 29 of 30
Alpha: 50 Min DF: 0.1
LogLik Score = [-1989.59809826]
 ---



In [265]:

    
print "alpha: %f" % best_alpha
print "min_df: %f" % best_min_df









    



alpha: 5.000000
min_df: 0.001000

3.8 Now that you've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the steps in 3.1, 3.2, and 3.4 to train a final classifier with these parameters, re-evaluate the accuracy, and draw a new calibration plot.



In [266]:

    
#Your code here
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))

X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB(alpha = best_alpha)
clf.fit(X_train, y_train)
predicted_train = clf.predict(X_train)
predicted_test = clf.predict(X_test)
trains=X_train.reshape(1,-1).flatten()
tests=X_test.reshape(1,-1).flatten()

from sklearn.metrics import accuracy_score
accuracy_train = accuracy_score(y_train, predicted_train)
accuracy_test  = accuracy_score(y_test, predicted_test)

print ("Train Accuracy: " + str(accuracy_train) + " Test Accuracy: " + str(accuracy_test))

calibration_plot(clf, X_test, y_test)
log_likelihood(clf, X_test, y_test)
cv_score(clf, X, Y, log_likelihood)









    



Train Accuracy: 0.786461102251 Test Accuracy: 0.736310473153






    Out[266]:





array([-1635.75018493])



In [267]:

    
log_likelihood(clf, X_test, y_test)









    Out[267]:





array([-1807.26436408])



In [268]:

    
cv_score(clf, X, Y, log_likelihood)









    Out[268]:





array([-1635.75018493])

3.9 Discuss the various ways in which Cross-Validation has affected the model. Is the new model more or less accurate? Is overfitting better or worse? Is the model more or less calibrated?

Your Answer Here

To think about/play with, but not to hand in: What would happen if you tried this again using a function besides the log-likelihood -- for example, the classification accuracy?

Part 4: Interpretation. What words best predict a fresh or rotten review?

4.1 Using your classifier and the vectorizer.get_feature_names method, determine which words best predict a positive or negative review. Print the 10 words that best predict a "fresh" review, and the 10 words that best predict a "rotten" review. For each word, what is the model's probability of freshness if the word appears one time?

Hints

Try computing the classification probability for a feature vector which consists of all 0s, except for a single 1. What does this probability refer to?
np.eye generates a matrix where the ith row is all 0s, except for the ith column which is 1.



In [288]:

    
# Your code here

def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
        print "\t%.4f \t%-15s \t\t %.4f \t% -15s" % (coef_1, fn_1, coef_2, fn_2)

n = 10
feature_names = vectorizer.get_feature_names()
coefs_with_fns = sorted(zip(clf.coef_, feature_names))
top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])

show_most_informative_features(vectorizer, clf, n=20)
#coefs_with_fns[:]
top


words = np.array(vectorizer.get_feature_names())

x = np.eye(X_test.shape[1])
probs = clf.predict_log_proba(x)[:, 0]
ind = np.argsort(probs)

[ind[:10]]
vectorizer.get_feature_names()









    



	-9.5976 	it              		 -5.5612 	that           
	-9.3745 	as              		 -8.1806 	the            
	-9.2792 	and             		 -8.3449 	to             
	-9.2792 	in              		 -8.4190 	of             
	-9.2792 	its             		 -8.5416 	this           
	-9.2792 	with            		 -8.6326 	an             
	-9.1121 	for             		 -8.6814 	but            
	-9.1121 	is              		 -8.8439 	movie          
	-8.9045 	film            		 -8.9045 	film           
	-8.8439 	movie           		 -9.1121 	is             
	-8.6814 	but             		 -9.1121 	for            
	-8.6326 	an              		 -9.2792 	with           
	-8.5416 	this            		 -9.2792 	its            
	-8.4190 	of              		 -9.2792 	in             
	-8.3449 	to              		 -9.2792 	and            
	-8.1806 	the             		 -9.3745 	as             
	-5.5612 	that            		 -9.5976 	it             






    Out[288]:





[u'an',
 u'and',
 u'as',
 u'but',
 u'film',
 u'for',
 u'in',
 u'is',
 u'it',
 u'its',
 u'movie',
 u'of',
 u'that',
 u'the',
 u'this',
 u'to',
 u'with']



In [270]:

    
def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " ".join(feature_names[j] for j in top10)))

(clf.coef_.shape)









    Out[270]:





(1, 2199)

4.2

One of the best sources for inspiration when trying to improve a model is to look at examples where the model performs poorly.

Find 5 fresh and rotten reviews where your model performs particularly poorly. Print each review.



In [271]:

    
make_xy(critics)









    Out[271]:





(array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]]), array([1, 1, 1, ..., 1, 1, 1]))



In [272]:

    
critics[critics.imdb==110955]
#Y[5039:5052]









    Out[272]:






  
    
      
      critic
      fresh
      imdb
      publication
      quote
      review_date
      rtid
      title
    
  
  
    
      5039
             Jeff Shannon
        fresh
       110955
               Seattle Times
       It's miraculous casting, and the Australian Da...
       2013-12-06
       13147
       The Ref
    
    
      5040
            Kenneth Turan
        fresh
       110955
           Los Angeles Times
       The Ref benefits from having actor's actors li...
       2013-12-06
       13147
       The Ref
    
    
      5041
       Michael Wilmington
       rotten
       110955
             Chicago Tribune
       It's not a bad idea, but it's not a good movie...
       2013-12-06
       13147
       The Ref
    
    
      5042
               Steven Rea
       rotten
       110955
       Philadelphia Inquirer
       Whether it's a function of sloppy editing or s...
       2013-12-06
       13147
       The Ref
    
    
      5043
          Owen Gleiberman
       rotten
       110955
        Entertainment Weekly
                         A foulmouthed sitcom of a film.
       2011-09-07
       13147
       The Ref
    
    
      5044
            Variety Staff
       rotten
       110955
                     Variety
       The Ref works virtually none of the miracles o...
       2009-03-26
       13147
       The Ref
    
    
      5045
       Jonathan Rosenbaum
        fresh
       110955
              Chicago Reader
       What makes most of this work is the brio of th...
       2007-11-27
       13147
       The Ref
    
    
      5046
             Geoff Andrew
        fresh
       110955
                    Time Out
       In his first starring role, comedian Leary mak...
       2006-02-09
       13147
       The Ref
    
    
      5047
              Caryn James
        fresh
       110955
              New York Times
       Staying clear of any mean-spirited attitudes, ...
       2003-05-20
       13147
       The Ref
    
    
      5048
            Peter Travers
        fresh
       110955
               Rolling Stone
               Demme brings out the comic ease in Leary.
       2001-05-12
       13147
       The Ref
    
    
      5049
               Hal Hinson
       rotten
       110955
             Washington Post
       The Ref is one of those rare movies that seem ...
       2000-01-01
       13147
       The Ref
    
    
      5050
       James Berardinelli
       rotten
       110955
                   ReelViews
       This is not a seamlessly constructed movie, bu...
       2000-01-01
       13147
       The Ref
    
    
      5051
           Desson Thomson
       rotten
       110955
             Washington Post
          This is one holiday party you'll want to miss.
       2000-01-01
       13147
       The Ref
    
    
      5052
              Roger Ebert
        fresh
       110955
           Chicago Sun-Times
       Material like this is only as good as the acti...
       2000-01-01
       13147
       The Ref



In [275]:

    
#Your code here
#Your code here
X, Y = make_xy(critics, vectorizer = CountVectorizer(min_df = best_min_df))

X_train, X_test, y_train, y_test = train_test_split(X, Y)

clf = MultinomialNB(alpha = best_alpha)
clf.fit(X_train, y_train)
all_pred = clf.predict(X)
all_act  = Y
alldata  = critics

yzero = clf.predict_proba(X)[:,0]
yone  = clf.predict_proba(X)[:,1]

temp = pd.DataFrame({'actual':all_act, 'pred': all_pred, 'proba0': yzero, 'proba1':yone})
temp









    Out[275]:






  
    
      
      actual
      pred
      proba0
      proba1
    
  
  
    
      0    
       1
       1
       0.197385
       0.802615
    
    
      1    
       1
       1
       0.154380
       0.845620
    
    
      2    
       1
       1
       0.135735
       0.864265
    
    
      3    
       1
       1
       0.029535
       0.970465
    
    
      4    
       1
       1
       0.013373
       0.986627
    
    
      5    
       1
       1
       0.001520
       0.998480
    
    
      6    
       1
       1
       0.105968
       0.894032
    
    
      7    
       1
       1
       0.044276
       0.955724
    
    
      8    
       1
       1
       0.023551
       0.976449
    
    
      9    
       1
       1
       0.019731
       0.980269
    
    
      10   
       1
       1
       0.291099
       0.708901
    
    
      11   
       1
       1
       0.008019
       0.991981
    
    
      12   
       1
       1
       0.133066
       0.866934
    
    
      13   
       1
       1
       0.339067
       0.660933
    
    
      14   
       1
       1
       0.135005
       0.864995
    
    
      15   
       1
       1
       0.154506
       0.845494
    
    
      16   
       1
       1
       0.027510
       0.972490
    
    
      17   
       1
       1
       0.075791
       0.924209
    
    
      18   
       0
       1
       0.116994
       0.883006
    
    
      19   
       1
       0
       0.719362
       0.280638
    
    
      20   
       0
       0
       0.678488
       0.321512
    
    
      21   
       1
       0
       0.660961
       0.339039
    
    
      22   
       1
       1
       0.035450
       0.964550
    
    
      23   
       0
       0
       0.982669
       0.017331
    
    
      24   
       1
       0
       0.881747
       0.118253
    
    
      25   
       0
       1
       0.333123
       0.666877
    
    
      26   
       1
       1
       0.381607
       0.618393
    
    
      27   
       0
       0
       0.632378
       0.367622
    
    
      28   
       1
       1
       0.080019
       0.919981
    
    
      29   
       0
       0
       0.932979
       0.067021
    
    
      ...
      ...
      ...
      ...
      ...
    
    
      15018
       1
       1
       0.117484
       0.882516
    
    
      15019
       1
       1
       0.328170
       0.671830
    
    
      15020
       1
       1
       0.010874
       0.989126
    
    
      15021
       1
       1
       0.013037
       0.986963
    
    
      15022
       1
       1
       0.220809
       0.779191
    
    
      15023
       1
       0
       0.552798
       0.447202
    
    
      15024
       0
       1
       0.406322
       0.593678
    
    
      15025
       1
       1
       0.267228
       0.732772
    
    
      15026
       1
       0
       0.562191
       0.437809
    
    
      15027
       1
       0
       0.667167
       0.332833
    
    
      15028
       0
       0
       0.502714
       0.497286
    
    
      15029
       1
       1
       0.128733
       0.871267
    
    
      15030
       0
       1
       0.389252
       0.610748
    
    
      15031
       0
       1
       0.243758
       0.756242
    
    
      15032
       0
       0
       0.955301
       0.044699
    
    
      15033
       0
       0
       0.949083
       0.050917
    
    
      15034
       0
       0
       0.956573
       0.043427
    
    
      15035
       0
       1
       0.469129
       0.530871
    
    
      15036
       0
       1
       0.064388
       0.935612
    
    
      15037
       0
       1
       0.431358
       0.568642
    
    
      15038
       1
       1
       0.025647
       0.974353
    
    
      15039
       0
       0
       0.590650
       0.409350
    
    
      15040
       1
       1
       0.120616
       0.879384
    
    
      15041
       1
       1
       0.139223
       0.860777
    
    
      15042
       0
       1
       0.491241
       0.508759
    
    
      15043
       1
       1
       0.198314
       0.801686
    
    
      15044
       1
       1
       0.213662
       0.786338
    
    
      15045
       1
       1
       0.429152
       0.570848
    
    
      15046
       1
       1
       0.047312
       0.952688
    
    
      15047
       1
       0
       0.717020
       0.282980
    
  

15048 rows × 4 columns



In [279]:

    
comb = pd.concat([temp, critics], axis=1)
fresh = comb[comb.actual == 1]
#fresh.sort(columns="proba1", inplace=True)
print "Actual Was Fresh, But Predicted Rotten"
fresh[fresh.quote.notnull()].head()
comb









    



Actual Was Fresh, But Predicted Rotten






    Out[279]:






  
    
      
      actual
      pred
      proba0
      proba1
      critic
      fresh
      imdb
      publication
      quote
      review_date
      rtid
      title
    
  
  
    
      0    
        1
        1
       0.197385
       0.802615
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      1    
        1
        1
       0.154380
       0.845620
              Derek Adams
        fresh
       114709
                           Time Out
       So ingenious in concept, design and execution ...
       2009-10-04
            9559
                                Toy Story
    
    
      2    
        1
        1
       0.135735
       0.864265
          Richard Corliss
        fresh
       114709
                      TIME Magazine
                       The year's most inventive comedy.
       2008-08-31
            9559
                                Toy Story
    
    
      3    
        1
        1
       0.029535
       0.970465
              David Ansen
        fresh
       114709
                           Newsweek
       A winning animated feature that has something ...
       2008-08-18
            9559
                                Toy Story
    
    
      4    
        1
        1
       0.013373
       0.986627
            Leonard Klady
        fresh
       114709
                            Variety
       The film sports a provocative and appealing st...
       2008-06-09
            9559
                                Toy Story
    
    
      5    
        1
        1
       0.001520
       0.998480
       Jonathan Rosenbaum
        fresh
       114709
                     Chicago Reader
       An entertaining computer-generated, hyperreali...
       2008-03-10
            9559
                                Toy Story
    
    
      6    
        1
        1
       0.105968
       0.894032
            Michael Booth
        fresh
       114709
                        Denver Post
       As Lion King did before it, Toy Story revived ...
       2007-05-03
            9559
                                Toy Story
    
    
      7    
        1
        1
       0.044276
       0.955724
             Geoff Andrew
        fresh
       114709
                           Time Out
       The film will probably be more fully appreciat...
       2006-06-24
            9559
                                Toy Story
    
    
      8    
        1
        1
       0.023551
       0.976449
             Janet Maslin
        fresh
       114709
                     New York Times
       Children will enjoy a new take on the irresist...
       2003-05-20
            9559
                                Toy Story
    
    
      9    
        1
        1
       0.019731
       0.980269
            Kenneth Turan
        fresh
       114709
                  Los Angeles Times
       Although its computer-generated imagery is imp...
       2001-02-13
            9559
                                Toy Story
    
    
      10   
        1
        1
       0.291099
       0.708901
         Susan Wloszczyna
        fresh
       114709
                          USA Today
       How perfect that two of the most popular funny...
       2000-01-01
            9559
                                Toy Story
    
    
      11   
        1
        1
       0.008019
       0.991981
              Roger Ebert
        fresh
       114709
                  Chicago Sun-Times
       The result is a visionary roller-coaster ride ...
       2000-01-01
            9559
                                Toy Story
    
    
      12   
        1
        1
       0.133066
       0.866934
               John Hartl
        fresh
       114709
                           Film.com
       Disney's witty, wondrously imaginative, all-co...
       2000-01-01
            9559
                                Toy Story
    
    
      13   
        1
        1
       0.339067
       0.660933
              Susan Stark
        fresh
       114709
                       Detroit News
       Disney's first computer-made animated feature ...
       2000-01-01
            9559
                                Toy Story
    
    
      14   
        1
        1
       0.135005
       0.864995
              Peter Stack
        fresh
       114709
            San Francisco Chronicle
       The script, by Lasseter, Pete Docter, Andrew S...
       2000-01-01
            9559
                                Toy Story
    
    
      15   
        1
        1
       0.154506
       0.845494
       James Berardinelli
        fresh
       114709
                          ReelViews
       The one big negative about Toy Story involves ...
       2000-01-01
            9559
                                Toy Story
    
    
      16   
        1
        1
       0.027510
       0.972490
               Sean Means
        fresh
       114709
                           Film.com
              Technically, Toy Story is nearly flawless.
       2000-01-01
            9559
                                Toy Story
    
    
      17   
        1
        1
       0.075791
       0.924209
             Rita Kempley
        fresh
       114709
                    Washington Post
       It's a nice change of pace to see the studio d...
       2000-01-01
            9559
                                Toy Story
    
    
      18   
        0
        1
       0.116994
       0.883006
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      19   
        1
        0
       0.719362
       0.280638
              Roger Moore
        fresh
       114709
                   Orlando Sentinel
       The great voice acting, the visual puns, all a...
       1995-11-22
            9559
                                Toy Story
    
    
      20   
        0
        0
       0.678488
       0.321512
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      21   
        1
        0
       0.660961
       0.339039
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      22   
        1
        1
       0.035450
       0.964550
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      23   
        0
        0
       0.982669
       0.017331
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      24   
        1
        0
       0.881747
       0.118253
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      25   
        0
        1
       0.333123
       0.666877
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      26   
        1
        1
       0.381607
       0.618393
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      27   
        0
        0
       0.632378
       0.367622
              Roger Ebert
       rotten
       113497
                  Chicago Sun-Times
       A gloomy special-effects extravaganza filled w...
       2000-01-01
           12436
                                  Jumanji
    
    
      28   
        1
        1
       0.080019
       0.919981
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      29   
        0
        0
       0.932979
       0.067021
                      NaN
          NaN
          NaN
                                NaN
                                                     NaN
              NaN
             NaN
                                      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      26947
      NaN
      NaN
            NaN
            NaN
       Michael Wilmington
        fresh
       165798
                    Chicago Tribune
       Ghost Dog is... a delight for those who know h...
       2000-01-01
           13267
       Ghost Dog - The Way of the Samurai
    
    
      26948
      NaN
      NaN
            NaN
            NaN
              Roger Ebert
        fresh
       165798
                  Chicago Sun-Times
       By the end, Whitaker's character has generated...
       2000-01-01
           13267
       Ghost Dog - The Way of the Samurai
    
    
      26949
      NaN
      NaN
            NaN
            NaN
            Variety Staff
        fresh
        94347
                            Variety
       The characters are memorable ones, and beautif...
       2007-12-18
       770674061
                  The Year My Voice Broke
    
    
      26951
      NaN
      NaN
            NaN
            NaN
              Caryn James
        fresh
        94347
                     New York Times
       It is so pleasant and unpretentious that we ca...
       2004-08-30
       770674061
                  The Year My Voice Broke
    
    
      26952
      NaN
      NaN
            NaN
            NaN
       Jonathan Rosenbaum
        fresh
        94347
                     Chicago Reader
       Although most of this is rather familiar stuff...
       2000-01-01
       770674061
                  The Year My Voice Broke
    
    
      26953
      NaN
      NaN
            NaN
            NaN
               Hal Hinson
        fresh
        94347
                    Washington Post
       This isn't an adolescent wish-fulfillment fant...
       2000-01-01
       770674061
                  The Year My Voice Broke
    
    
      26958
      NaN
      NaN
            NaN
            NaN
               Pat Graham
       rotten
        63185
                     Chicago Reader
       Robert Aldrich's "daring" 1968 mating of lesbi...
       2009-04-24
       747403171
             The Killing of Sister George
    
    
      26962
      NaN
      NaN
            NaN
            NaN
                Dave Kehr
        fresh
        40506
                     Chicago Reader
       A little windy and rhetorical for my taste, bu...
       2008-04-08
           18375
                                Key Largo
    
    
      26963
      NaN
      NaN
            NaN
            NaN
            Variety Staff
        fresh
        40506
                            Variety
       Emphasis is on tension in the telling, and eff...
       2008-04-08
           18375
                                Key Largo
    
    
      26964
      NaN
      NaN
            NaN
            NaN
                Tom Milne
        fresh
        40506
                           Time Out
       Although the characters are basically stereoty...
       2006-06-24
           18375
                                Key Largo
    
    
      26965
      NaN
      NaN
            NaN
            NaN
          Bosley Crowther
       rotten
        40506
                     New York Times
       The script prepared by Mr. Huston and Richard ...
       2006-03-25
           18375
                                Key Largo
    
    
      26966
      NaN
      NaN
            NaN
            NaN
              Bob Longino
        fresh
       364435
       Atlanta Journal-Constitution
                               Disturbing and affecting.
       2006-08-31
       748169836
                                 Jailbait
    
    
      26967
      NaN
      NaN
            NaN
            NaN
            Carrie Rickey
       rotten
       364435
              Philadelphia Inquirer
       Claustrophobic and overwrought, Jailbait is an...
       2006-08-18
       748169836
                                 Jailbait
    
    
      26968
      NaN
      NaN
            NaN
            NaN
             Frank Scheck
       rotten
       364435
                 Hollywood Reporter
       While the stars deliver highly committed perfo...
       2006-08-17
       748169836
                                 Jailbait
    
    
      26969
      NaN
      NaN
            NaN
            NaN
               Laura Kern
       rotten
       364435
                     New York Times
       A stagy, only mildly compelling prison drama t...
       2006-08-04
       748169836
                                 Jailbait
    
    
      26970
      NaN
      NaN
            NaN
            NaN
             Lou Lumenick
       rotten
       364435
                      New York Post
       I wouldn't have thought it was possible to mak...
       2006-08-04
       748169836
                                 Jailbait
    
    
      26971
      NaN
      NaN
            NaN
            NaN
             Jack Mathews
       rotten
       364435
                New York Daily News
       The cruelty of the law has been better demonst...
       2006-08-04
       748169836
                                 Jailbait
    
    
      26972
      NaN
      NaN
            NaN
            NaN
               Jim Ridley
       rotten
       364435
                      Village Voice
       ... the umpteenth prison drama to focus on the...
       2006-08-02
       748169836
                                 Jailbait
    
    
      26985
      NaN
      NaN
            NaN
            NaN
            Vincent Canby
       rotten
        74695
                     New York Times
       Mr. Peckinpah's least interesting, least perso...
       2005-05-09
           13694
                            Cross of Iron
    
    
      26987
      NaN
      NaN
            NaN
            NaN
                Dave Kehr
       rotten
        42276
                     Chicago Reader
       George Cukor directed, a little impersonally f...
       2008-01-11
           18781
                           Born Yesterday
    
    
      26989
      NaN
      NaN
            NaN
            NaN
          Bosley Crowther
        fresh
        42276
                     New York Times
       More firm in its social implications than ever...
       2003-05-20
           18781
                           Born Yesterday
    
    
      26990
      NaN
      NaN
            NaN
            NaN
            Variety Staff
       rotten
        86969
                            Variety
       Belying the lightheartedness of its title, Bir...
       2008-09-16
           15889
                                    Birdy
    
    
      26992
      NaN
      NaN
            NaN
            NaN
              Roger Ebert
        fresh
        86969
                  Chicago Sun-Times
                     A very strange and beautiful movie.
       2004-10-23
           15889
                                    Birdy
    
    
      26993
      NaN
      NaN
            NaN
            NaN
             Janet Maslin
        fresh
        86969
                     New York Times
                            Most of Birdy is enchanting.
       2003-05-20
           15889
                                    Birdy
    
    
      26995
      NaN
      NaN
            NaN
            NaN
          Bosley Crowther
       rotten
        49189
                     New York Times
       We can't recommend this little item as a sampl...
       2006-10-30
           11854
                 ...And God Created Woman
    
    
      26997
      NaN
      NaN
            NaN
            NaN
         Richard Schickel
        fresh
        86005
                      TIME Magazine
       Ballard and his masterly crew of film makers h...
       2009-03-09
           12606
                           Never Cry Wolf
    
    
      26998
      NaN
      NaN
            NaN
            NaN
          Ronald Holloway
        fresh
        86005
                            Variety
       Measures up to the promise Ballard amply provi...
       2008-07-23
           12606
                           Never Cry Wolf
    
    
      27000
      NaN
      NaN
            NaN
            NaN
            Vincent Canby
        fresh
        86005
                     New York Times
       Perhaps the best thing about the film is that ...
       2004-08-30
           12606
                           Never Cry Wolf
    
    
      27001
      NaN
      NaN
            NaN
            NaN
                Dave Kehr
        fresh
        86005
                     Chicago Reader
       The film is still memorable for its compassion...
       2000-01-01
           12606
                           Never Cry Wolf
    
    
      27008
      NaN
      NaN
            NaN
            NaN
               Don Druker
        fresh
        55353
                     Chicago Reader
       It does have enough gritty insights and (for t...
       2007-11-13
           18541
                      A Raisin in the Sun
    
  

22007 rows × 12 columns



In [199]:

    
Y[1000:1010]
critics[1000:1010]
df2 = pd.DataFrame({'Y':Y})









    Out[199]:






  
    
      
      Y
      critic
      fresh
      imdb
      publication
      quote
      review_date
      rtid
      title
    
  
  
    
      0    
      NaN
              Derek Adams
        fresh
       114709
                      Time Out
       So ingenious in concept, design and execution ...
       2009-10-04
        9559
               Toy Story
    
    
      1    
      NaN
          Richard Corliss
        fresh
       114709
                 TIME Magazine
                       The year's most inventive comedy.
       2008-08-31
        9559
               Toy Story
    
    
      2    
      NaN
              David Ansen
        fresh
       114709
                      Newsweek
       A winning animated feature that has something ...
       2008-08-18
        9559
               Toy Story
    
    
      3    
      NaN
            Leonard Klady
        fresh
       114709
                       Variety
       The film sports a provocative and appealing st...
       2008-06-09
        9559
               Toy Story
    
    
      4    
      NaN
       Jonathan Rosenbaum
        fresh
       114709
                Chicago Reader
       An entertaining computer-generated, hyperreali...
       2008-03-10
        9559
               Toy Story
    
    
      5    
      NaN
            Michael Booth
        fresh
       114709
                   Denver Post
       As Lion King did before it, Toy Story revived ...
       2007-05-03
        9559
               Toy Story
    
    
      6    
      NaN
             Geoff Andrew
        fresh
       114709
                      Time Out
       The film will probably be more fully appreciat...
       2006-06-24
        9559
               Toy Story
    
    
      7    
      NaN
             Janet Maslin
        fresh
       114709
                New York Times
       Children will enjoy a new take on the irresist...
       2003-05-20
        9559
               Toy Story
    
    
      8    
      NaN
            Kenneth Turan
        fresh
       114709
             Los Angeles Times
       Although its computer-generated imagery is imp...
       2001-02-13
        9559
               Toy Story
    
    
      9    
      NaN
         Susan Wloszczyna
        fresh
       114709
                     USA Today
       How perfect that two of the most popular funny...
       2000-01-01
        9559
               Toy Story
    
    
      10   
      NaN
              Roger Ebert
        fresh
       114709
             Chicago Sun-Times
       The result is a visionary roller-coaster ride ...
       2000-01-01
        9559
               Toy Story
    
    
      11   
      NaN
               John Hartl
        fresh
       114709
                      Film.com
       Disney's witty, wondrously imaginative, all-co...
       2000-01-01
        9559
               Toy Story
    
    
      12   
      NaN
              Susan Stark
        fresh
       114709
                  Detroit News
       Disney's first computer-made animated feature ...
       2000-01-01
        9559
               Toy Story
    
    
      13   
      NaN
              Peter Stack
        fresh
       114709
       San Francisco Chronicle
       The script, by Lasseter, Pete Docter, Andrew S...
       2000-01-01
        9559
               Toy Story
    
    
      14   
      NaN
       James Berardinelli
        fresh
       114709
                     ReelViews
       The one big negative about Toy Story involves ...
       2000-01-01
        9559
               Toy Story
    
    
      15   
      NaN
               Sean Means
        fresh
       114709
                      Film.com
              Technically, Toy Story is nearly flawless.
       2000-01-01
        9559
               Toy Story
    
    
      16   
      NaN
             Rita Kempley
        fresh
       114709
               Washington Post
       It's a nice change of pace to see the studio d...
       2000-01-01
        9559
               Toy Story
    
    
      17   
      NaN
                      NaN
        fresh
       114709
          Entertainment Weekly
       I can hardly imagine having more fun at the mo...
       1995-11-22
        9559
               Toy Story
    
    
      18   
      NaN
              Roger Moore
        fresh
       114709
              Orlando Sentinel
       The great voice acting, the visual puns, all a...
       1995-11-22
        9559
               Toy Story
    
    
      19   
      NaN
              Roger Ebert
       rotten
       113497
             Chicago Sun-Times
       A gloomy special-effects extravaganza filled w...
       2000-01-01
       12436
                 Jumanji
    
    
      20   
      NaN
                      NaN
        fresh
       113497
                     USA Today
       A calculated but very entertaining special eff...
       2000-01-01
       12436
                 Jumanji
    
    
      21   
      NaN
         Richard Schickel
        fresh
       107050
                 TIME Magazine
       Walter Matthau and Jack Lemmon are awfully goo...
       2008-08-24
       10498
          Grumpy Old Men
    
    
      22   
      NaN
              Derek Adams
       rotten
       107050
                      Time Out
                                  Mediocre, regrettably.
       2006-06-24
       10498
          Grumpy Old Men
    
    
      23   
      NaN
              Caryn James
        fresh
       107050
                New York Times
       Just don't expect their bickering to be on the...
       2003-05-20
       10498
          Grumpy Old Men
    
    
      24   
      NaN
       James Berardinelli
        fresh
       107050
                     ReelViews
       While it won't come close to my top 10 best li...
       2000-01-01
       10498
          Grumpy Old Men
    
    
      25   
      NaN
              Roger Ebert
       rotten
       107050
             Chicago Sun-Times
       The movie is too pat and practiced to really b...
       2000-01-01
       10498
          Grumpy Old Men
    
    
      26   
      NaN
           Desson Thomson
        fresh
       107050
               Washington Post
       If you poke through Grumpy's cheap sentimental...
       2000-01-01
       10498
          Grumpy Old Men
    
    
      27   
      NaN
               Liam Lacey
       rotten
       114885
                Globe and Mail
       Never escapes the queasy aura of Melrose Place...
       2002-04-12
       16697
       Waiting to Exhale
    
    
      28   
      NaN
            Kenneth Turan
        fresh
       114885
             Los Angeles Times
       A pleasant if undemanding piece of work that i...
       2001-02-13
       16697
       Waiting to Exhale
    
    
      29   
      NaN
          Edward Guthmann
       rotten
       114885
       San Francisco Chronicle
       You want the movie to stomp and rejoice and cr...
       2000-01-01
       16697
       Waiting to Exhale
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      31528
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31529
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31530
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31531
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31532
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31533
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31534
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31535
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31536
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31537
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31538
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31539
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31540
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31541
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31542
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31543
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31544
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31545
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31546
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31547
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31548
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31549
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31550
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31551
        0
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31552
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31553
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31554
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31555
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31556
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
    
      31557
        1
                      NaN
          NaN
          NaN
                           NaN
                                                     NaN
              NaN
         NaN
                     NaN
    
  

31558 rows × 9 columns



In [ ]:



In [135]:

    
rotten = comb[comb.actual == 0]
rotten.sort(columns="proba0", inplace=True)
print "Actual Was Rotten, But Predicted Fresh"
rotten[rotten.quote.notnull()].head()









    



Actual Was Rotten, But Predicted Fresh






    Out[135]:






  
    
      
      actual
      pred
      proba0
      proba1
      critic
      fresh
      imdb
      publication
      quote
      review_date
      rtid
      title
    
  
  
    
      8290 
       0
       1
       0.004051
       0.995949
       Lawrence Van Gelder
        fresh
       117011
             New York Times
       From start to finish, Maximum Risk presents sp...
       2000-01-01
           15020
                         Maximum Risk
    
    
      4723 
       0
       1
       0.004158
       0.995842
           Owen Gleiberman
        fresh
       108551
       Entertainment Weekly
       A splashy, volatile, crowd- pleasing rock-star...
       2011-09-07
           13236
       What's Love Got To Do With It?
    
    
      15171
       0
       1
       0.006613
       0.993387
             Dennis Harvey
       rotten
       138563
                    Variety
       In the end, too, we've learned very little abo...
       2009-03-26
           13804
                      Kurt & Courtney
    
    
      14177
       0
       1
       0.008705
       0.991295
             Todd McCarthy
        fresh
       120347
                    Variety
       A solid but somewhat by-the-numbers entry in t...
       2008-07-28
           11715
                  Tomorrow Never Dies
    
    
      2886 
       0
       1
       0.011203
       0.988797
               Kevin Crust
        fresh
       331933
          Los Angeles Times
       Herek keeps things moving and throws in some l...
       2005-02-28
       136262883
                     Man of the House

4.3 What do you notice about these mis-predictions? Naive Bayes classifiers assume that every word affects the probability independently of other words. In what way is this a bad assumption? In your answer, report your classifier's Freshness probability for the review "This movie is not remarkable, touching, or superb in any way".

Your answer here

4.4 If this was your final project, what are 3 things you would try in order to build a more effective review classifier? What other exploratory or explanatory visualizations do you think might be helpful?

Your answer here

How to Submit

Restart and run your notebook one last time, to make sure the output from each cell is up to date. To submit your homework, create a folder named lastname_firstinitial_hw3 and place your solutions in the folder. Double check that the file is still called HW3.ipynb, and that it contains your code. Please do not include the critics.csv data file, if you created one. Compress the folder (please use .zip compression) and submit to the CS109 dropbox in the appropriate folder. If we cannot access your work because these directions are not followed correctly, we will not grade your work!

css tweaks in this cell

	id	title	imdbID	spanishTitle	imdbPictureURL	year	rtID	rtAllCriticsRating	rtAllCriticsNumReviews	rtAllCriticsNumFresh	rtAllCriticsNumRotten	rtAllCriticsScore	rtTopCriticsRating	rtTopCriticsNumReviews	rtTopCriticsNumFresh	rtTopCriticsNumRotten	rtTopCriticsScore	rtAudienceRating	rtAudienceNumRatings	rtAudienceScore	rtPictureURL
0	1	Toy story	114709	Toy story (juguetes)	http://ia.media-imdb.com/images/M/MV5BMTMwNDU0...	1995	toy_story	9	73	73	0	100	8.5	17	17	0	100	3.7	102338	81	http://content7.flixster.com/movie/10/93/63/10...
1	2	Jumanji	113497	Jumanji	http://ia.media-imdb.com/images/M/MV5BMzM5NjE1...	1995	1068044-jumanji	5.6	28	13	15	46	5.8	5	2	3	40	3.2	44587	61	http://content8.flixster.com/movie/56/79/73/56...
2	3	Grumpy Old Men	107050	Dos viejos gruñones	http://ia.media-imdb.com/images/M/MV5BMTI5MTgy...	1993	grumpy_old_men	5.9	36	24	12	66	7	6	5	1	83	3.2	10489	66	http://content6.flixster.com/movie/25/60/25602...
3	4	Waiting to Exhale	114885	Esperando un respiro	http://ia.media-imdb.com/images/M/MV5BMTczMTMy...	1995	waiting_to_exhale	5.6	25	14	11	56	5.5	11	5	6	45	3.3	5666	79	http://content9.flixster.com/movie/10/94/17/10...
4	5	Father of the Bride Part II	113041	Vuelve el padre de la novia (Ahora también abu...	http://ia.media-imdb.com/images/M/MV5BMTg1NDc2...	1995	father_of_the_bride_part_ii	5.3	19	9	10	47	5.4	5	1	4	20	3	13761	64	http://content8.flixster.com/movie/25/54/25542...

	critic	publication	criticname
critic
Roger Ebert	1129	[At the Movies, Chicago Sun-Times, RogerEbert....	Roger Ebert
James Berardinelli	800	ReelViews	James Berardinelli
Janet Maslin	525	New York Times	Janet Maslin
Variety Staff	446	Variety	Variety Staff
Jonathan Rosenbaum	411	Chicago Reader	Jonathan Rosenbaum

	critic	fresh	imdb	publication	quote	review_date	rtid	title
5039	Jeff Shannon	fresh	110955	Seattle Times	It's miraculous casting, and the Australian Da...	2013-12-06	13147	The Ref
5040	Kenneth Turan	fresh	110955	Los Angeles Times	The Ref benefits from having actor's actors li...	2013-12-06	13147	The Ref
5041	Michael Wilmington	rotten	110955	Chicago Tribune	It's not a bad idea, but it's not a good movie...	2013-12-06	13147	The Ref
5042	Steven Rea	rotten	110955	Philadelphia Inquirer	Whether it's a function of sloppy editing or s...	2013-12-06	13147	The Ref
5043	Owen Gleiberman	rotten	110955	Entertainment Weekly	A foulmouthed sitcom of a film.	2011-09-07	13147	The Ref
5044	Variety Staff	rotten	110955	Variety	The Ref works virtually none of the miracles o...	2009-03-26	13147	The Ref
5045	Jonathan Rosenbaum	fresh	110955	Chicago Reader	What makes most of this work is the brio of th...	2007-11-27	13147	The Ref
5046	Geoff Andrew	fresh	110955	Time Out	In his first starring role, comedian Leary mak...	2006-02-09	13147	The Ref
5047	Caryn James	fresh	110955	New York Times	Staying clear of any mean-spirited attitudes, ...	2003-05-20	13147	The Ref
5048	Peter Travers	fresh	110955	Rolling Stone	Demme brings out the comic ease in Leary.	2001-05-12	13147	The Ref
5049	Hal Hinson	rotten	110955	Washington Post	The Ref is one of those rare movies that seem ...	2000-01-01	13147	The Ref
5050	James Berardinelli	rotten	110955	ReelViews	This is not a seamlessly constructed movie, bu...	2000-01-01	13147	The Ref
5051	Desson Thomson	rotten	110955	Washington Post	This is one holiday party you'll want to miss.	2000-01-01	13147	The Ref
5052	Roger Ebert	fresh	110955	Chicago Sun-Times	Material like this is only as good as the acti...	2000-01-01	13147	The Ref

	actual	pred	proba0	proba1
0	1	1	0.197385	0.802615
1	1	1	0.154380	0.845620
2	1	1	0.135735	0.864265
3	1	1	0.029535	0.970465
4	1	1	0.013373	0.986627
5	1	1	0.001520	0.998480
6	1	1	0.105968	0.894032
7	1	1	0.044276	0.955724
8	1	1	0.023551	0.976449
9	1	1	0.019731	0.980269
10	1	1	0.291099	0.708901
11	1	1	0.008019	0.991981
12	1	1	0.133066	0.866934
13	1	1	0.339067	0.660933
14	1	1	0.135005	0.864995
15	1	1	0.154506	0.845494
16	1	1	0.027510	0.972490
17	1	1	0.075791	0.924209
18	0	1	0.116994	0.883006
19	1	0	0.719362	0.280638
20	0	0	0.678488	0.321512
21	1	0	0.660961	0.339039
22	1	1	0.035450	0.964550
23	0	0	0.982669	0.017331
24	1	0	0.881747	0.118253
25	0	1	0.333123	0.666877
26	1	1	0.381607	0.618393
27	0	0	0.632378	0.367622
28	1	1	0.080019	0.919981
29	0	0	0.932979	0.067021
...	...	...	...	...
15018	1	1	0.117484	0.882516
15019	1	1	0.328170	0.671830
15020	1	1	0.010874	0.989126
15021	1	1	0.013037	0.986963
15022	1	1	0.220809	0.779191
15023	1	0	0.552798	0.447202
15024	0	1	0.406322	0.593678
15025	1	1	0.267228	0.732772
15026	1	0	0.562191	0.437809
15027	1	0	0.667167	0.332833
15028	0	0	0.502714	0.497286
15029	1	1	0.128733	0.871267
15030	0	1	0.389252	0.610748
15031	0	1	0.243758	0.756242
15032	0	0	0.955301	0.044699
15033	0	0	0.949083	0.050917
15034	0	0	0.956573	0.043427
15035	0	1	0.469129	0.530871
15036	0	1	0.064388	0.935612
15037	0	1	0.431358	0.568642
15038	1	1	0.025647	0.974353
15039	0	0	0.590650	0.409350
15040	1	1	0.120616	0.879384
15041	1	1	0.139223	0.860777
15042	0	1	0.491241	0.508759
15043	1	1	0.198314	0.801686
15044	1	1	0.213662	0.786338
15045	1	1	0.429152	0.570848
15046	1	1	0.047312	0.952688
15047	1	0	0.717020	0.282980

	Y	critic	fresh	imdb	publication	quote	review_date	rtid	title
0	NaN	Derek Adams	fresh	114709	Time Out	So ingenious in concept, design and execution ...	2009-10-04	9559	Toy Story
1	NaN	Richard Corliss	fresh	114709	TIME Magazine	The year's most inventive comedy.	2008-08-31	9559	Toy Story
2	NaN	David Ansen	fresh	114709	Newsweek	A winning animated feature that has something ...	2008-08-18	9559	Toy Story
3	NaN	Leonard Klady	fresh	114709	Variety	The film sports a provocative and appealing st...	2008-06-09	9559	Toy Story
4	NaN	Jonathan Rosenbaum	fresh	114709	Chicago Reader	An entertaining computer-generated, hyperreali...	2008-03-10	9559	Toy Story
5	NaN	Michael Booth	fresh	114709	Denver Post	As Lion King did before it, Toy Story revived ...	2007-05-03	9559	Toy Story
6	NaN	Geoff Andrew	fresh	114709	Time Out	The film will probably be more fully appreciat...	2006-06-24	9559	Toy Story
7	NaN	Janet Maslin	fresh	114709	New York Times	Children will enjoy a new take on the irresist...	2003-05-20	9559	Toy Story
8	NaN	Kenneth Turan	fresh	114709	Los Angeles Times	Although its computer-generated imagery is imp...	2001-02-13	9559	Toy Story
9	NaN	Susan Wloszczyna	fresh	114709	USA Today	How perfect that two of the most popular funny...	2000-01-01	9559	Toy Story
10	NaN	Roger Ebert	fresh	114709	Chicago Sun-Times	The result is a visionary roller-coaster ride ...	2000-01-01	9559	Toy Story
11	NaN	John Hartl	fresh	114709	Film.com	Disney's witty, wondrously imaginative, all-co...	2000-01-01	9559	Toy Story
12	NaN	Susan Stark	fresh	114709	Detroit News	Disney's first computer-made animated feature ...	2000-01-01	9559	Toy Story
13	NaN	Peter Stack	fresh	114709	San Francisco Chronicle	The script, by Lasseter, Pete Docter, Andrew S...	2000-01-01	9559	Toy Story
14	NaN	James Berardinelli	fresh	114709	ReelViews	The one big negative about Toy Story involves ...	2000-01-01	9559	Toy Story
15	NaN	Sean Means	fresh	114709	Film.com	Technically, Toy Story is nearly flawless.	2000-01-01	9559	Toy Story
16	NaN	Rita Kempley	fresh	114709	Washington Post	It's a nice change of pace to see the studio d...	2000-01-01	9559	Toy Story
17	NaN	NaN	fresh	114709	Entertainment Weekly	I can hardly imagine having more fun at the mo...	1995-11-22	9559	Toy Story
18	NaN	Roger Moore	fresh	114709	Orlando Sentinel	The great voice acting, the visual puns, all a...	1995-11-22	9559	Toy Story
19	NaN	Roger Ebert	rotten	113497	Chicago Sun-Times	A gloomy special-effects extravaganza filled w...	2000-01-01	12436	Jumanji
20	NaN	NaN	fresh	113497	USA Today	A calculated but very entertaining special eff...	2000-01-01	12436	Jumanji
21	NaN	Richard Schickel	fresh	107050	TIME Magazine	Walter Matthau and Jack Lemmon are awfully goo...	2008-08-24	10498	Grumpy Old Men
22	NaN	Derek Adams	rotten	107050	Time Out	Mediocre, regrettably.	2006-06-24	10498	Grumpy Old Men
23	NaN	Caryn James	fresh	107050	New York Times	Just don't expect their bickering to be on the...	2003-05-20	10498	Grumpy Old Men
24	NaN	James Berardinelli	fresh	107050	ReelViews	While it won't come close to my top 10 best li...	2000-01-01	10498	Grumpy Old Men
25	NaN	Roger Ebert	rotten	107050	Chicago Sun-Times	The movie is too pat and practiced to really b...	2000-01-01	10498	Grumpy Old Men
26	NaN	Desson Thomson	fresh	107050	Washington Post	If you poke through Grumpy's cheap sentimental...	2000-01-01	10498	Grumpy Old Men
27	NaN	Liam Lacey	rotten	114885	Globe and Mail	Never escapes the queasy aura of Melrose Place...	2002-04-12	16697	Waiting to Exhale
28	NaN	Kenneth Turan	fresh	114885	Los Angeles Times	A pleasant if undemanding piece of work that i...	2001-02-13	16697	Waiting to Exhale
29	NaN	Edward Guthmann	rotten	114885	San Francisco Chronicle	You want the movie to stomp and rejoice and cr...	2000-01-01	16697	Waiting to Exhale
...	...	...	...	...	...	...	...	...	...
31528	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31529	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31530	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31531	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31532	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31533	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31534	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31535	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31536	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31537	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31538	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31539	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31540	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31541	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31542	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31543	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31544	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31545	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31546	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31547	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31548	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31549	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31550	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31551	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31552	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31553	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31554	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31555	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31556	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
31557	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	pred	proba0	proba1	critic	fresh	imdb	publication	quote	review_date	rtid	title
8290	1	0.004051	0.995949	Lawrence Van Gelder	fresh	117011	New York Times	From start to finish, Maximum Risk presents sp...	2000-01-01	15020	Maximum Risk
4723	1	0.004158	0.995842	Owen Gleiberman	fresh	108551	Entertainment Weekly	A splashy, volatile, crowd- pleasing rock-star...	2011-09-07	13236	What's Love Got To Do With It?
15171	1	0.006613	0.993387	Dennis Harvey	rotten	138563	Variety	In the end, too, we've learned very little abo...	2009-03-26	13804	Kurt & Courtney
14177	1	0.008705	0.991295	Todd McCarthy	fresh	120347	Variety	A solid but somewhat by-the-numbers entry in t...	2008-07-28	11715	Tomorrow Never Dies
2886	1	0.011203	0.988797	Kevin Crust	fresh	331933	Los Angeles Times	Herek keeps things moving and throws in some l...	2005-02-28	136262883	Man of the House