Take 1: A story of david and goliath

Our first attempt can be read about and accessed at this notebook: LINK HERE

Take 2: Gathering and analyzing better data

About halfway into our project, we realized that our data was less than ideal for a few reasons. First, our Box Office Mojo data originally only scraped the top 100 ranked movies for each year. This represented only the upper eschelon of movie incomes. The low range of incomes, we presume, had less power to pick up a trend or correlation with valence statistics. We realized that Box Office Mojo actually includes a lot more movies! Rather than just the top 100, Box Office Mojo actually provides grossing information on all of the major movies for the year at hand, which is usally above 500. Second, the overlap between our IMDB data set from Stanford and our Box Office Mojo was just over 300 movies representing 4,111 overlapping reviews. We decided that we needed more data to better determine predictive power and the presence of trends. Third, our IMDB data represented only polarized reviews, i.e. those with stars below 4 or above 7. We decided that this would decrease ability to test one particuar hypothesis: that polarizing movies earn more money. As a result, we needed to scrape a balanced sample of reviews from IMDB for each movie, not discriminating based on number of stars. We want reviews in in the 4-7 star range.

Our new approach, as charted below, was to scrape all data from Box Office Mojo from 1990-2015, resulting in about 11,000 movies. For each of those movies, we scraped a max of 100 reviews from IMDB's website. This resulted in a final data set containing over half a million reviews, which were no longer only polarized. We feel that the new data is more fit to truly convey the presence or absence of trends and predictive power of gross income on review sentiment.



In [1]:

    
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
import json
import statsmodels.api as sm
from statsmodels.formula.api import logit, glm, ols
from random import random
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import math
from sklearn.svm import LinearSVC

Scraping and cleaning the data

As mentioned above, we now scrape all data from Box Office Mojo from 1990-2015, resulting in about 11,000 movies. For each of those movies, we scraped a max of 100 reviews from IMDB's website. This resulted in a final data set containing over half a million reviews, which were no longer only polarized. We feel that the new data is more fit to truly convey the presence or absence of trends and predictive power of gross income on review sentiment.

Scraping BOM for box office data



In [2]:

    
from bs4 import BeautifulSoup
import requests

Let's create a data frame bom_df to contain the scraped Box Office Mojo data.



In [3]:

    
bom_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])

The function rowInfoGrabber retrieves relevant values from each row in a Box Office Mojo table.



In [4]:

    
def rowInfoGrabber(r):
    info = []
    # Ranking
    info.append(int(r.find("font").get_text()))
    # Title
    info.append(r.find("a").get_text())
    # Gross
    info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
    '''
    For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
    In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
    if right above it in the movie table and differs in gross income by about $1 million, which is a small 
    difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
    '''
    # Total number of theaters
    if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
    # Opening Gross
    if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
        info.append(10015822)
    else: 
        info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
    # Opening Number of Theaters
    if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
    # Date of Opening
    if not r.find_all("td", attrs={"align":"right"})[4].find("a"):
        info.append("-")
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
    # Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
    if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
        info.append('-')
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
    return info

Let's scrape all movies from Box Office Mojo from 1990-2015. This takes about 4 minutes.



In [5]:

    
%%time
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]
years = [1990 + i for i in range(26)]
for year in years:
    print year
    pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
    soup = BeautifulSoup(pageText.text, "html.parser")
    movieTable = soup.find("td", attrs={"colspan":"3"})
    movieRows = movieTable.find("table").find_all("tr")[2:102]
    movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
    year_df = pd.DataFrame(movie_dicts)
    year_df['year'] = year
    bom_df = bom_df.append(year_df, ignore_index=True)
    # See how many pages of movies there are for this particular year
    extraPageLinks = soup.find("center").find("font", attrs={"face":"Verdana"}).find("b").find_all("a")
    time.sleep(1)
    for p in extraPageLinks:
        pageText = requests.get('http://www.boxofficemojo.com' + p['href'])
        soup = BeautifulSoup(pageText.text, "html.parser")
        movieTable = soup.find("td", attrs={"colspan":"3"})
        movieRows = movieTable.find("table").find_all("tr")
        movieRows = movieRows[2:len(movieRows)-4]
        movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
        year_df = pd.DataFrame(movie_dicts)
        year_df['year'] = year
        bom_df = bom_df.append(year_df, ignore_index=True)
        time.sleep(1)









    



1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
CPU times: user 58.8 s, sys: 893 ms, total: 59.7 s
Wall time: 3min 57s

Let's see how it looks.



In [6]:

    
bom_df.head()









    Out[6]:






  
    
      
      close_date
      gross
      open_date
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
    
  
  
    
      0
       -
       285761243
       11/16
       17081997
       1202
       1
                         Home Alone
       2173
       1990
    
    
      1
       -
       217631306
        7/13
       12191540
       1101
       2
                              Ghost
       1766
       1990
    
    
      2
       -
       184208848
        11/9
         598257
         14
       3
                 Dances with Wolves
       1636
       1990
    
    
      3
       -
       178406268
        3/23
       11280591
       1325
       4
                       Pretty Woman
       1811
       1990
    
    
      4
       -
       135265915
        3/30
       25398367
       2006
       5
       Teenage Mutant Ninja Turtles
       2377
       1990

Now we'll save and reload so we don't have to do that again.



In [7]:

    
with open('newBOM.json','r') as fp:
    bom_dict = json.load(fp)
bom_df = pd.DataFrame(bom_dict)

Scraping IMDB for reviews

We want a less polarized review set in order to more accurately determine the effect of review polarity on gross income. For this we'll have to part with Andrew Maas's Large Movie Review Data Set and scrape IMDB ourselves for a mixture of polarized and non-polarized reviews for each movie. Unfortunately, IMDB's URLs contain a movie ID rather than a movie title, so we cannot request each page's HTML using the mere title. Fortunately, the IMDB API allows us to obtain a movie's IMDB ID based on its title by returning a JSON object.

Let's create a helper function that returns the relevant url used to query the OMDB API given the movie's title.
An example url is http://www.omdbapi.com/?t=Jurassic%20World.



In [8]:

    
'''
Name: urlOMDB
Input: Movie name
Output: url used to Query OMDB API
'''
def urlOMDB(title):
    base = "http://www.omdbapi.com/?t="
    base += title.split()[0]
    for word in title.split()[1:]:
        base += "%20" + word
    return base

Let's test our function by getting the IMDB ID for "Jurassic World".



In [9]:

    
requests.get(urlOMDB("Jurassic World")).json()['imdbID']









    Out[9]:





u'tt0369610'

Perfect. Now we'll use these IDs to scrape IMDB for each movie in our Box Office Mojo page. First, let's write a function reviewsGrabber which returns a list of (title of review, stars, text of review) tuples for a given IMDB webpage. The function cleans the review texts upon collecting them by removing punctuation and converting to lower case.



In [10]:

    
'''
Name: reviewsGrabber
Input: url (webpage)
Output: [(titleOfReview1, stars1, review1), (titleOfReview2, stars2, review2), ... (titleofReviewn, starsn, reviewn)]
'''
punctuation = set('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')

def reviewsGrabber(url):
    pageText = requests.get(url)
    soup = BeautifulSoup(pageText.text, "html.parser")
    reviewTable = soup.find("div", attrs={"id":"tn15content"})
    divs = []
    for rev in reviewTable.find_all("div", attrs={"class": None}):
        if len(rev.find_all("img")) > 1: 
            divs.append((rev.find("h2").get_text(), int(rev.find_all("img")[1]['alt'].split("/")[0])))
        else: 
            divs.append((rev.find("h2").get_text(), 'N/A'))
    reviews = []
    for rev in reviewTable.find_all("p")[:len(reviewTable.find_all("p"))-1]:
        if len(rev.find_all("b")) == 0:
            text = rev.get_text()
            text = ''.join(ch for ch in text if ch not in punctuation).replace("\n", " ").lower()
            reviews.append(text)
    return [(title,stars,text) for ((title,stars), text) in zip(divs,reviews)]

Now let's scrape each web page based on the movies in bom_df. We'll take a max of 100 movies reviews for each movie. Some movies have less than 100 reviews. IMDB lists reviews by default from best to worst. We don't want that. Fortunately, they offer the option to sort in chronological order, this works particularly well for our project, because we want to see whether the earliest reviews are a good indicator for box office performance over the opening weekend.

The following code took about 24 hours to run on my computer. Do not run it again.

I would frequenty see an error, account for it by tweaking the code (maybe throw in a try catch here and there), then continue where the code left off by changing the variable "count". (You can see that the most recent stopping point was 11,011.) Every once in awhile, the code spits out my progress as a percentage. Babysitting my computer was quite the experience.



In [11]:

    
# fields = ['title','text','stars','close_date','open_date','opening_gross','gross','rank','total_theaters','year']
# count = 11011
# for index, row in bom_df[count:].iterrows():
#     close, gross, op, op_gross, open_th, r, title, thea, yr = tuple([row[i] for i in range(9)])  
#     if random() < 0.02:
#         print str(count*100.0 / 11858) + "%"
#     try:
#         omdb = requests.get(urlOMDB(title)).json()
#     except ValueError:
#         continue
        
#     omdbJSON = dict(omdb)
#     if 'Error' in omdbJSON.keys():
#         continue
#     else:
#         imdbID = str(omdb['imdbID'])
#     time.sleep(0.1)
#     reviews = []
#     baseURL = 'http://www.imdb.com/title/%(imdbID)s/reviews?filter=chrono' % {'imdbID':imdbID}
#     pageText = requests.get(baseURL)
#     soup = BeautifulSoup(pageText.text, "html.parser")
#     if soup.find("div", attrs={'class':'error_code_404'}):
#         continue
#     try:
#         reviews = reviews + reviewsGrabber(baseURL)
#     except AttributeError:
#         continue
#     # Let's scrape the number of reviews for this given movie.
#     table = soup.find("div", attrs={"id":"tn15content"}).find_all("table", attrs={'border':'0'})
#     if table == []:
#         continue
#     td = table[1].find("td", attrs = {'align':'right'})
#     if td == None:
#         continue
#     else:
#         text = td.get_text()
#     n_reviews = int(text.split("reviews")[0].split("\n")[1])
#     # This number represents the page: page p is represented in the url as the integer 10(p-1)
#     p = 10
#     while len(reviews) < min(n_reviews, 100):
#         url = baseURL + ";filter=chrono" + str(";start=%(page)d" % {'page': p})
#         reviews = list(reviews + reviewsGrabber(url))
#         p += 10
        
#     reviews = reviews[:100]
# #     review_dict = [dict(zip(fields, [[] for i in range(len(fields))]))]
#     for review in reviews:
#         review_dict = {'title':[title], 'text': [review[2]], 'review_title': [review[0]], 'stars': [review[1]],
#                        'close_date': [close],'gross': [gross], 'open_date': [op], 'opening_gross':[op_gross],
#                        'opening_theaters': [open_th], 'rank':[r], 'theaters': [thea], 'year': [yr]}
        
#         combined_df = combined_df.append(pd.DataFrame(review_dict), ignore_index=True)
#     count += 1

Let's save our finally scraped movie reviews.



In [ ]:

    
# fp = open("complete.json","w")
# json.dump(combined_df.to_dict(), fp)
# fp.close()

with open("complete.json", "r") as fd:
    combined_dict = json.load(fd)
combined_df = pd.DataFrame(combined_dict)



In [ ]:

    
print "Length of dataframe: ", len(combined_df.gross.values)

Because some films do not have a close date, we will have to be careful with the close date! Next, we combine the close_date, open_date and year columns into two columns close_date and open_date that are time series. This will make it easier for us to work with the data in the future.



In [ ]:

    
def lam(x, month=False):
    if x == '-' or x =='':
        return '1'
    try:
        dummy = int(x)
    except ValueError:
        return '1'
    else:
        return x[:x.find('/')]
    
# splitting the close_date and open_date into the respective month and day
combined_df['close_month'] = combined_df['close_date'].map(lam)
combined_df['close_day'] = combined_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
combined_df['open_month'] = combined_df['open_date'].map(lam)
combined_df['open_day'] = combined_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])

# dropping the old close_date and open_date
combined_df = combined_df.drop('close_date', 1)
combined_df = combined_df.drop('open_date', 1)

# creating an open_year by turning the year column into a string and getting rid of trailing bits
combined_df['open_year'] = combined_df.year.astype(str)
combined_df['open_year'] = combined_df.open_year.map(lambda x: x[:x.find('.')])

# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = combined_df.close_month.astype(int)
open_month = combined_df.open_month.astype(int)
year = combined_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
    if close_month[i] >= open_month[i]:
        close_year.append(year[i])
    else:
        close_year.append(year[i]+1) 
combined_df['close_year'] = close_year
combined_df['close_year'] = combined_df['close_year'].astype(str)



In [ ]:

    
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in combined_df.iterrows():
    if row.close_day != '0':
        close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
    else: 
        close_date.append(None)
combined_df['close_date'] = close_date
combined_df['open_date']=combined_df.open_year + '-' + combined_df.open_month + '-' + combined_df.open_day
def lam2(x):
    if x[7] == '-' or x[7:10] == 'Jan':
        return x[:7] + "1"
    else:
        return x
combined_df['open_date']=combined_df['open_date'].map(lam2).apply(pd.datetools.parse)

Let's drop obselete columns.



In [ ]:

    
combined_df = combined_df.drop('close_day', 1)
combined_df = combined_df.drop('close_month', 1)
combined_df = combined_df.drop('open_day', 1)
combined_df = combined_df.drop('open_month', 1)

And now let's save our results again.



In [ ]:

    
# import pymongo
# from bson import json_util

# fp = open("complete_dates.json","w")
# json.dump(combined_df.to_dict(), fp, default=json_util.default)
# fp.close()

with open("complete_dates.json", "r") as fd:
    complete_dates = json.load(fd)
combined_df = pd.DataFrame(complete_dates)

Datatime objects are not JSON serializable. We had to make them such, now we need to map them back to normal.



In [ ]:

    
def getdate(dic):
    if dic != None:
        return int(dic["$date"])
    else: 
        return None
    
combined_df.open_date = combined_df.open_date.map(getdate)
combined_df.close_date = combined_df.close_date.map(getdate)

Computing valence statistics by review

It's now time to collect valence statistics on our reviews in order to do sentiment analysis. We'll be using labMT (language assessment by Mechanical Turk), a word score list for sentiment analysis containing over 10,000 words available in the qdap Dictionaries package in R (http://www.inside-r.org/packages/cran/qdapDictionaries/docs/labMT). The file contains a "happiness" value, and ranks words by their happiness. It also includes mean and standard deviation, Twitter rank and Google rank.



In [ ]:

    
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)



In [ ]:

    
labmt.head()

Let's make a happiness dictionary consisting of the word's happiness minus the average of all words.



In [ ]:

    
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()

Let's look at the happiness values to find a narrow middle range. We'll want to disregard all words in this middle range so they do not overpower the polarized words.



In [ ]:

    
plt.hist(happiness.values())
plt.axvline(np.percentile(happiness.values(),35), color='r')
plt.axvline(np.percentile(happiness.values(),65), color='r')

We'll consider words below the 35th and above the 65th percentiles.



In [ ]:

    
lower_bound = np.percentile(happiness.values(),35)
upper_bound = np.percentile(happiness.values(),65)
print lower_bound, upper_bound

Now we'll write a function to remove stop words, words that have little to do with sentiment analysis. We use the sklearn package to obtain these words.



In [ ]:

    
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')

def removeStopWords(text, stopwords = stopwords):
    new_text = ""
    for word in text.split():
        if word not in stopwords:
            while len(word) != 0 and word[-1] in punctuation:
                word = word[:len(word)-1]
            new_text += word + ' '
    return new_text

Finally we're ready to analyze a particular review given our valence dictionary. So let's put it in a function.



In [10]:

    
'''
Name: getValenceInfo()
Inputs: review text, dictionary of happiness
Returns: a 4-tuple of (happiness total, happiness average, total # of scorable words, % of scorable words)
'''
def getValenceInfo(text, valenceDict):
    total_words = len(text.split())
    happiness_total, count_relevant = 0, 0
    for word in text.split():
        try:
            # Only consider words below the 35th and above the 65th percentile
            if valenceDict[word] <= lower_bound or valenceDict[word] >= upper_bound:
                happiness_total += valenceDict[word]
                count_relevant += 1
        except KeyError:
            continue

    if count_relevant != 0: 
        avg_valence = 1.*happiness_total/count_relevant
    else: 
        avg_valence = 0
    return happiness_total, avg_valence, total_words, 1.*count_relevant / total_words

Before we run this through our data frame, let's reset the index.



In [ ]:

    
'''
Name: getAllInfo
Input: data frame, happiness dictionary, list of stop words
Returns: a new data frame with 4 new columns: valence_sum, valence_avg, n_scorables, pct_scorables
'''
def getAllInfo(df, valenceDict, stopwords): 
    valence_suml, valence_avgl, review_lenl, review_fractionl = [], [], [], []
    for i, row in df.iterrows():
        if random() < 0.00001:
            print i
        cleaned_review = removeStopWords(row['text'], stopwords)
        valence_sum, valence_avg, review_len, review_fraction = getValenceInfo(cleaned_review, valenceDict)
        valence_suml.append(valence_sum)
        valence_avgl.append(valence_avg)
        review_lenl.append(review_len)
        review_fractionl.append(review_fraction)
    conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl, 
                         'pct_scorables': review_fractionl})
    return pd.concat([df, conc], axis=1)

The code takes about 5 minutes to run. We save the results below.



In [ ]:

    
%%time
valence_df = getAllInfo(combined_df, happiness, stopwords)

Let's take a look, then save again. The file is almost 1GB in size, so it takes a minute to save.



In [ ]:

    
valence_df.head()



In [ ]:

    
# import pymongo
# from bson import json_util

# fp = open("complete_valence.json","w")
# json.dump(valence_df.to_dict(), fp)
# fp.close()

Adjusting for inflation



In [40]:

    
#Loading the latest complete dataset
with open("complete_valence.json", "r") as fp:
    valence_df_dict = json.load(fp)
dftouse = pd.DataFrame(valence_df_dict)



In [41]:

    
dftouse = dftouse.reset_index().drop('index',1)



In [42]:

    
inflation = pd.read_csv("inf.csv")



In [43]:

    
years_90 = range(1990,2015)
infdict = {}

infindex = 0
infvalue = 1

testlist = []
for row in inflation.values:
    currentval = 1 + (row[1]/100)
    cuminf = infvalue*currentval
    infdict[years_90[infindex]] = cuminf
    infindex += 1
    infvalue = cuminf
    testlist.append(cuminf)

inframe = pd.DataFrame(data=testlist, index=range(1990,2015))



In [44]:

    
infdict









    Out[44]:





{1990: 1.0539795644,
 1991: 1.098615219150804,
 1992: 1.1318902930939463,
 1993: 1.1652998117775315,
 1994: 1.1956843237413164,
 1995: 1.229228287177842,
 1996: 1.2652594783591868,
 1997: 1.2948373218617284,
 1998: 1.3149368109750392,
 1999: 1.3437079860225376,
 2000: 1.3890830868495474,
 2001: 1.4283409518690031,
 2002: 1.4509948911070383,
 2003: 1.4839338531885458,
 2004: 1.5236622748059583,
 2005: 1.5753562785628925,
 2006: 1.626176391500925,
 2007: 1.6725658779300527,
 2008: 1.7367773595171863,
 2009: 1.7306023124666896,
 2010: 1.7589849421993997,
 2011: 1.814513310047201,
 2012: 1.852061710150393,
 2013: 1.8791913148899477,
 2014: 1.909675988181881}



In [45]:

    
#Creating two new lists of adjusted gross values. 
#If the open year of the movie was 2015, we use the 2014 inflation data, since 2015 data is not released until 2016
newgross, newopengross = [], []
for gross, opengross, openyr in zip(dftouse.gross, dftouse.opening_gross, dftouse.open_year):
    if int(openyr) == 2015:
        openyr = 2014
    newgross.append(infdict[int(openyr)]*gross)
    newopengross.append(infdict[int(openyr)]*opengross)



In [46]:

    
#dftouse = flattened_df
dftouse['adj_gross'] = newgross
dftouse['adj_opening_gross'] = newopengross



In [47]:

    
# creating binary variables of adjusted grossing for classifications
dftouse['adj_gross_bin'] = 1*(dftouse.adj_gross>np.mean(dftouse.adj_gross))
dftouse['adj_opening_gross_bin'] = 1*(dftouse.adj_opening_gross>np.mean(dftouse.adj_opening_gross))



In [48]:

    
dftouse.head()









    Out[48]:






  
    
      
      close_date
      close_year
      gross
      n_scorables
      open_date
      open_year
      opening_gross
      opening_theaters
      pct_scorables
      rank
      review_title
      stars
      text
      theaters
      title
      valence_avg
      valence_sum
      year
      adj_gross
      adj_opening_gross
      adj_gross_bin
      adj_opening_gross_bin
    
  
  
    
      0
      NaN
       1990
       285761243
       144
       632448000000
       1990
       17081997
       1202
       0.701389
         1
                           My favourite Christmas movie!
        10
        it just isn't christmas if i don't watch home...
       2173
                Home Alone
       0.758622
       76.620792
       1990
       3.011865e+08
       18004075.757142
       1
       1
    
    
      1
      NaN
       1990
       285761243
        65
       632448000000
       1990
       17081997
       1202
       0.738462
         1
                                 Very funny family movie
         8
        this is a very funny movie for kids and peopl...
       2173
                Home Alone
       1.408094
       67.588495
       1990
       3.011865e+08
       18004075.757142
       1
       1
    
    
      2
      NaN
       2001
           55835
        82
       978739200000
       2001
           9134
          7
       0.719512
       304
       The worst book to screen adaption I have ever ...
         1
        i've read all of these books and perhaps that...
         12
       Beautiful Creatures
       0.446455
       26.340859
       2001
       7.975142e+04
          13046.466254
       0
       0
    
    
      3
      NaN
       1996
         2491989
        23
       822960000000
       1996
        1593929
       1292
       0.695652
       174
       Cheesy! Cameos! And learn how to be your own g...
         8
         this is one of those movies which is so bad ...
       1292
               The Stupids
       0.292260
        4.676165
       1996
       3.153013e+06
        2016733.775082
       0
       0
    
    
      4
      NaN
       1996
         1408575
        30
       821318400000
       1996
          33843
          2
       0.533333
       189
                    Not a good induction to Shakespeare.
       N/A
         the film would have been better if the actor...
         81
       Looking for Richard
       1.055385
       16.886165
       1996
       1.782213e+06
          42820.176526
       0
       0

Now that we have adjusted for inflation, we can remove the previous gross and all columns pertaining to date (except perhaps year, since it's interesting).



In [49]:

    
keep = ['close_date','close_year', 'adj_gross','adj_gross_bin','adj_opening_gross_bin', 'n_scorables','open_date','open_year',
        'adj_opening_gross','opening_theaters','pct_scorables', 'rank', 'review_title','stars',
       'text','theaters','title','valence_avg','valence_sum','year']
dftouse = dftouse[keep]

Let's save



In [50]:

    
fp = open("dftouse.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()

# with open("dftouse.json","r") as fp:
#     dftouse_dict = json.load(fp)
# dftouse = pd.DataFrame(dftouse_dict)

Natural language processing

Now we have a dataframe of all the reviews, we want to see if we can use sentiment analysis to better describe the text. Namely we want to emulate the method used in HW5 to see what each review has to say on different topics about movies.

Create a list of nouns and adjectives for each review



In [51]:

    
with open("dftouse.json", "r") as fp:
    dftouse = json.load(fp)
maas_IMDB = pd.DataFrame(dftouse)



In [ ]:

    
maas_IMDB=maas_IMDB.reset_index()
maas_IMDB.drop('index',1)



In [ ]:

    
maas_IMDB.shape

First we do some Natural language processing this will take the form of a few steps.

Punctuation: we will remove punctuation from reviews, so that all words can be recognized
Lemmatization: we will reduce words to lemmas
Removing stop words: we will remove common words from our vocabulary, because they are unlikely to have been used to express sentiment -- which is what we are interested in



In [ ]:

    
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text 
stopwords=text.ENGLISH_STOP_WORDS



In [ ]:

    
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")

Define a function get_parts which will take in an input review and returns a tuple of nouns and adjectives. Each member of the tuple is a list of lists. In next steps we will use the nouns used to find the topic (using LDA), and the adjectives to do sentiment analysis (via Naive Bayes).



In [ ]:

    
def get_parts(thetext):
    thetext=re.sub(regex1, ' ', thetext)
    thetext=re.sub(regex2, ' ', thetext)
    nouns=[]
    descriptives=[]
    for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
        nouns.append([])
        descriptives.append([])
        for token in sentence:
            #print token
            if len(token[4]) >0:
                if token[1] in ['JJ', 'JJR', 'JJS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    descriptives[i].append(token[4])
                elif token[1] in ['NN', 'NNS']:
                    if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
                        continue
                    nouns[i].append(token[4])
    out=zip(nouns, descriptives)
    nouns2=[]
    descriptives2=[]
    for n,d in out:
        if len(n)!=0 and len(d)!=0:
            nouns2.append(n)
            descriptives2.append(d)
    return nouns2, descriptives2

DO NOT RERUN THIS CODE, IT TAKES ABOUT 4 HOURS TO RUN

Now we call this function on our data



In [ ]:

    
# %%time
# # review_parts = maas_IMDB.map(lambda x: get_parts(x.text))
# review_parts = []
# for index, row in maas_IMDB.iterrows():
#     review_parts.append(get_parts(row.text))
#     if random() < 0.0001: 
#         print str(int(index)*100./537550) + "%"

Save it so we do not need to run this again:



In [14]:

    
# fp = open("review_parts.json","w")
# json.dump(review_parts, fp)
# fp.close()

with open("review_parts.json","r") as fp:
    review_parts = json.load(fp)

Perform LDA to determine if there are topics

We will now look specifically at the nouns in each sentence to see if that sentence is talking about actor quality, directing, scenery, music etc. The nouns are the first elements of all the tuples we created, so we take the first element in each tuple, remembering that this will be a list of lists of lists.



In [21]:

    
ldadata = [ele[0] for ele in review_parts]

First we want to create a vocabulary of all the nouns that are used in our database. To assist us we define a function flatten_unique which will flatten any list of lists into one list of unique values. This function can be called upon twice in order to flatten a list of lists of lists into one single list.



In [5]:

    
import itertools as IT
import collections

def flatten_unique(outer_list):
    flattened_list = []
    for inner_list in outer_list: # into a review
        for element in inner_list:
                if element not in flattened_list: 
                    flattened_list.append(element)
    return flattened_list

Next we define a function save_vocab which calls flatten_unique onto the data; however, partitions the data into nth sized buckets so that we are able to see our progress.



In [365]:

    
# def save_vocab(data, n):
#     ln = len(data)
#     count = 0
#     vocab = {}
#     while count < n: 
#         lower = int(count*ln/n)
#         upper = int((count+1)*ln/n - 1)
#         if upper < ln-1:
#             data_segment = data[lower:upper]
#             data_raw = flatten_unique(flatten_unique(data_segment))
#             vocab[count] = data_raw
#         else: 
#             data_segment = data[lower: ln]
#             data_raw = flatten_unique(flatten_unique(data_segment))
#             vocab[count] = data_raw
#         print str(int(count+1)*100./n) + "%"
#         count += 1
#     return vocab



In [366]:

    
# %%time
# vocab_unique = save_vocab(ldadata, 20)

We actually decided to run save_vocab outside of the function loop so that we could save things into the dictionary vocab_unique as the coputer ran. Such that if the kernel died we could continue where it left off, rather than having to start again.



In [ ]:

    
%%time
n = 20
ln = len(ldadata)
count = 0
vocab_unique = {}
while count < n: 
    lower = int(count*ln/n)
    upper = int((count+1)*ln/n - 1)
    if upper < ln-1:
        data_segment = ldadata[lower:upper]
        data_raw = flatten_unique(flatten_unique(data_segment))
        vocab_unique[count] = data_raw
    else: 
        data_segment = ldadata[lower: ln]
        data_raw = flatten_unique(flatten_unique(data_segment))
        vocab_unique[count] = data_raw
    print str(int(count+1)*100./n) + "%", count
    count += 1

Saving it for use later on



In [23]:

    
# fp = open("vocab_unique.json","w")
# json.dump(vocab_unique, fp)
# fp.close()

with open("vocab_unique.json","r") as fp:
    vocab_unique_dict = json.load(fp)

We want to take the lists out of the dictionary and save it into a list. While the elements within each list are unique, because they ran independently of each other, the aggregate list vocab_raw may not be unique because a word may have appeared in the first list and second list. Therefore we also need to remove any duplicates.



In [ ]:

    
vocab_raw = []
for i in range(50):
    vocab_raw = vocab_raw + vocab_unique_dict[str(i)]



In [ ]:

    
# len(vocab_raw)



In [ ]:

    
# %%time
# vocab_unique = []
# counter = 0
# for i in vocab_raw:
#     if i not in vocab_unique:
#         vocab_unique.append(i)
#     if random() < 0.00001: 
#         print str(int(counter)*100./1807254) + "%"
#     counter += 1



In [ ]:

    
# fp = open("vocab_unique_2.json","w")
# json.dump(vocab_unique, fp)
# fp.close()

with open("vocab_unique_2.json","r") as fp:
    vocab_unique = json.load(fp)



In [ ]:

    
len(vocab_unique)

We apply the flatten_iter function onto the ldadata and then we comb through the output, vocab_raw, to delete any duplicates and produce vocab_unique.

We make a dictionary vocab which has as its keys the word, and the value an index. Then we also make a complementary dictionary id2word which has as its keys the index, and the value as the word.



In [ ]:

    
vocab = {}
for i in range(len(vocab_unique)):
    vocab[vocab_unique[i]] = i
    
id2word = {}
for i in range(len(vocab_unique)):
    id2word[i] = vocab_unique[i]



In [ ]:

    
# len(id2word), len(vocab)

The next thing is to create the bag of words corpus. We learnt from the PSET that we should run LDA on the granularity of sentences, because we need some data that has a very clear and strong cluster membership so that we can clearly delineate the clusters (topics). Thus we consider things at the level of sentences, where some sentences will very clearly only talk about one topic. (Reviews are too confusing and may not have enough samples that exhibit strong preference for a membership).
We write a helper function auxillary_function that can be called on each sentence to create a bag of words which is a list of tuples of the word id and count of the word.



In [ ]:

    
from collections import defaultdict

def auxillary_function(ldadata, vocab):
    d = defaultdict(int)
    for i in ldadata: 
        index = vocab[i]
        d[index] += 1
    return d.items()



In [ ]:

    
BOW_test = auxillary_function(tester_sentence, vocab)

In order to call auxillary_function we need to flatten ldadata once (recall that it is currently a list of lists of lists), so that it is just a lists of lists and each element represents a sentence. Once we flatten ldadata, we can call auxillary_function to make corpus.



In [ ]:

    
%%time
all_sentences = []
for review in ldadata: # into a review
    for sentence in review:
        this_sentence = []
        for word in sentence: 
            if word not in this_sentence: 
                this_sentence.append(word)
        all_sentences.append(this_sentence)



In [ ]:

    
%%time
corpus = []
counter = 0 
for sentence in all_sentences: 
    corpus.append(auxillary_function(sentence, vocab))
    if random() < 0.0001: 
        print str(int(counter)*100./537550) + "%"
    counter += 1



In [ ]:

    
# fp = open("corpus.json","w")
# json.dump(corpus, fp)
# fp.close()

with open("corpus.json","r") as fp:
    corpus = json.load(fp)



In [ ]:

    
import gensim
lda2 = gensim.models.ldamodel.LdaModel(corpus, num_topics=2,id2word = id2word,update_every=1,chunksize=20000,passes=1)



In [ ]:

    
lda2.print_topics()



In [ ]:

    
for bow in corpus[0:900:15]:
    print bow
    print lda2.get_document_topics(bow)
    print " ".join([id2word[e[0]] for e in bow])
    print "=========================================="

It looks like topic 2 is about the external factors of a movie -- who the actors are and how production went, and topic 1 is about the intrinsic factors of a movie -- the character development and plot. Examples:

* words like "movie" "actor" "film" or "budget" "film" have higher topic 2 ratings. 
* words like "relationship" "scenery" "personality" have a higher topic 1 rating

Run NB sentiment analysis on adjectives

Now we turn to running the sentiment analysis on adjectives. Firstly we need to get all the adjectives from review_parts.



In [ ]:

    
nbdata = [ele[1] for ele in review_parts]

Similar to our treatment of the nouns, we create a dictionary adjvocab with keys being the adjectives and values being an index.



In [367]:

    
# %%time
# adj_unique = save_vocab(nbdata, 10000)



In [ ]:

    
# fp = open("adj_unique.json","w")
# json.dump(adj_unique, fp)
# fp.close()

with open("adj_unique.json","r") as fp:
    adj_unique_dict = json.load(fp)



In [ ]:

    
adj_aggregate = []
for i in range(len(adj_unique_dict)):
    adj_aggregate = adj_aggregate + adj_unique_dict[str(i)]



In [ ]:

    
%%time
adj_unique = []
for word in adj_aggregate:
    if word not in adj_unique:
        adj_unique.append(word)



In [ ]:

    
len(adj_unique)



In [ ]:

    
# fp = open("adj_unique_2.json","w")
# json.dump(adj_unique, fp)
# fp.close()

with open("adj_unique_2.json","r") as fp:
    adj_unique = json.load(fp)



In [ ]:

    
adjvocab = {}
for i in range(len(adj_unique)):
    adjvocab[adj_unique[i]] = i



In [ ]:

    
len(adjvocab)

We want to flatten the list of adjectives within a review into a single list of all adjectives for the whole review, over all the sentences.



In [ ]:

    
import itertools
Xarray = []
for i in nbdata: # into a review
    Xarraypre = " ".join(list(itertools.chain.from_iterable(i)))
    Xarray.append(Xarraypre)



In [ ]:

    
# fp = open("Xarray.json","w")
# json.dump(Xarray, fp)
# fp.close()

with open("Xarray.json","r") as fp:
    Xarray = json.load(fp)

Unlike in HW5 where we used a binary response column that was based on the star ratings in the dataset, we used an external sentiment dictionary (LabMT) to calculate a valence score for each review, and have used that in leiu of the response array. (Details on how we calculated the valence score are above).



In [ ]:

    
# making the response array based off of the sentiment dictionary 'happiness' there is valence_avg and valence_sum
resparray = []
for index, row in maas_IMDB.iterrows():
    if row.valence_avg > 0: # need to make it binomial because that's how the calibration/cv score works
        resparray.append(1)
    else: 
        resparray.append(0)



In [ ]:

    
# fp = open("resparray.json","w")
# json.dump(resparray, fp)
# fp.close()

with open("resparray.json","r") as fp:
    resparray = json.load(fp)



In [ ]:

    
print  len(Xarray), len(resparray)
print Xarray[0]

Now we create a mask to split things into a test and train set.



In [ ]:

    
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(len(Xarray)), train_size=0.7)
mask=np.ones(len(Xarray), dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

We ensure that X and y are numpy arrays for sklearn.



In [ ]:

    
X=np.array(Xarray)
y=np.array(resparray)

We define a function make_xy which takes a aX_col of word based documents and uses the vectorizer vectorizer to transform them into a feature matrix where the features are the vocabulary. Essentially the vectorizer transforms the space separated adjectives we made above into a bag-of-words representation.



In [ ]:

    
def make_xy(X_col, y_col, vectorizer):
    X = vectorizer.fit_transform(X_col)
    y = y_col
    return X, y

We write a log_likelihood function which predicts the log likelihood of the dataset according to a Naive Bayes classifier.



In [ ]:

    
"""
Function
--------
log_likelihood

Compute the log likelihood of a dataset according to 
a Naive Bayes classifier. 
The Log Likelihood is defined by

L = Sum_positive(logP(positive)) + Sum_negative(logP(negative))

Where Sum_positive indicates a sum over all positive reviews, 
and Sum_negative indicates a sum over negative reviews
    
Parameters
----------
clf : Naive Bayes classifier
x : (nexample, nfeature) array
    The input data
y : (nexample) integer array
    Whether each review is positive
"""
def log_likelihood(clf, x, y):
    prob = clf.predict_log_proba(x)
    neg = y == 0
    pos = ~neg
    return prob[neg, 0].sum() + prob[pos, 1].sum()

We define a function cv_score to estimate the cross-validated value of a scoring function, given a classifier and data.



In [ ]:

    
from sklearn.cross_validation import KFold

def cv_score(clf, x, y, score_func, nfold=5):
    """
    Uses 5-fold cross validation to estimate a score of a classifier
    
    Inputs
    ------
    clf : Classifier object
    x : Input feature vector
    y : Input class labels
    score_func : Function like log_likelihood, that takes (clf, x, y) as input,
                 and returns a score
                 
    Returns
    -------
    The average score obtained by splitting (x, y) into 5 folds of training and 
    test sets, fitting on the training set, and evaluating score_func on the test set
    
    Examples
    cv_score(clf, x, y, log_likelihood)
    """
    result = 0
    for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
        clf.fit(x[train], y[train]) # fit
        result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
    return result / nfold # average

We define a function calibration_plot which will make a calibration plot.



In [ ]:

    
def calibration_plot(clf, xtest, ytest):
    prob = clf.predict_proba(xtest)[:, 1]
    outcome = ytest
    data = pd.DataFrame(dict(prob=prob, outcome=outcome))

    #group outcomes into bins of similar probability
    bins = np.linspace(0, 1, 20)
    cuts = pd.cut(prob, bins)
    binwidth = bins[1] - bins[0]
    
    #freshness ratio and number of examples in each bin
    cal = data.groupby(cuts).outcome.agg(['mean', 'count'])
    cal['pmid'] = (bins[:-1] + bins[1:]) / 2
    cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])
        
    #the calibration plot
    ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
    p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
    plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
    plt.ylabel("Empirical P(+)")
    
    #the distribution of P(+)
    ax = plt.subplot2grid((3, 1), (2, 0), sharex=ax)
    
    plt.bar(left=cal.pmid - binwidth / 2, height=cal['count'],
            width=.95 * (bins[1] - bins[0]),
            fc=p[0].get_color())
    
    plt.xlabel("Predicted P(+)")
    plt.ylabel("Number")



In [ ]:

    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

We run the cross-validation loop over the alphas and min_dfs for MultinomialNB and CountVectorizer, respectively. It's okay that the cross validation loop happens inside the cv_score function because we are feeding a vocabulary into CountVectorizer so min_df doesn't actually affect the CountVectorizer at all.



In [ ]:

    
%%time 

#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]

#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
maxscore = -np.inf

for alpha in alphas:       
    vectorizer = CountVectorizer(vocabulary =adjvocab)       
    Xthis, Ythis = make_xy(X[mask], y[mask], vectorizer)
    clf = MultinomialNB(alpha=alpha)
    cvscore = cv_score(clf, Xthis, Ythis, log_likelihood)
    if cvscore > maxscore:
        maxscore = cvscore
        best_alpha = alpha

Now that we've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the CountVectorization and fitting to train a final classifier with these parameters, and draw calibration plots on the test set.



In [ ]:

    
vectorizer = CountVectorizer(vocabulary =adjvocab, min_df=best_min_df)
Xnew, Ynew = make_xy(X,y, vectorizer)
xtrain, xtest, ytrain, ytest = train_test_split(Xnew, Ynew)
clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)

training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)

print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data:     %0.2f" % (test_accuracy)

calibration_plot(clf, xtest, ytest)

Unfortunately this graph looks terrible and reflects the fact that the priors are incredibly skewed towards a more positive review.

Compute sentiment scores for each review

Because our naive bayes classifier assumes conditional independence, this means that while we ran the sentiment analysis on each review, it should be fairly straightforward to apply the sentiment analysis to each sentence instead of the whole review.
In order to do this we calculate 2 dictionaries logpositives and lognegatives. These dictionaries have the words as the keys and the log probabilities as they value.



In [ ]:

    
vectorizer_alladj = CountVectorizer(min_df=best_min_df,vocabulary = adjvocab)
X_alladj, Y_alladj = make_xy(X[mask],y[mask], vectorizer_alladj)
clf_alladj = MultinomialNB(alpha=best_alpha).fit(X_alladj, Y_alladj)

logpositives = dict(zip(vectorizer_alladj.get_feature_names(),clf_alladj.feature_log_prob_[0]))
lognegatives = dict(zip(vectorizer_alladj.get_feature_names(),clf_alladj.feature_log_prob_[1]))



In [ ]:

    
# fp = open("logpositives.json","w")
# json.dump(logpositives, fp)
# fp.close()

with open("logpositives.json","r") as fp:
    logpositives = json.load(fp)



In [ ]:

    
# fp = open("lognegatives.json","w")
# json.dump(lognegatives, fp)
# fp.close()

with open("lognegatives.json","r") as fp:
    lognegatives = json.load(fp)

In the two dictionaries, we have the probability for each individual word whether it is positive or negative. Now we write a function calc_pplus which calculates the probability that a sentence is positive. Please refer to HW5 for more information on the maths behind this.



In [ ]:

    
def calc_pplus(adjlist, lp, ln, pp,pn):
    sgivenpos = 0
    sgivenneg = 0
    for adj in adjlist:
        if lp.get(adj) != None:
            sgivenpos += lp[adj]
            sgivenneg += ln[adj]
    return(np.exp(sgivenpos)*pp)/(np.exp(sgivenpos)*pp + np.exp(sgivenneg)*pn)



In [ ]:

    
reviews=[]
for index, row in maas_IMDB.iterrows():
    reviews.append(row.movie_id)



In [ ]:

    
# fp = open("reviews.json","w")
# json.dump(reviews, fp)
# fp.close()

with open("reviews.json","r") as fp:
    reviews = json.load(fp)



In [ ]:

    
len(review_parts),len(reviews)

We write a function choose_topic which chooses which of the two LDA topics the sentence is in, given the bag of words for the sentence.



In [ ]:

    
def choose_topic(ldamodel, bow):
    tee = lda2.get_document_topics(bow)
    if len(tee)==2:
        t1,t2=tee
        if t2[1] >= t1[1]:#get higher probability topic
            topicis=t2[0]
        else:
            topicis=t1[0]
    elif len(tee)==1:#if only one was provided its very high probability. Take it
        teetuple=tee[0]
        topicis=teetuple[0]
    return topicis

Before we go any further, it's important to take a look at what the prior probability that a review is positive or negative is.



In [ ]:

    
priorp = np.mean(resparray)
priorn = 1 - priorp
priorp, priorn

It's clear that there are alot more positive reviews than there are negative reviews.

Now we combine the two functions we just wrote, to populate a dictionary reviewdict which stores in a dictionary, with keys the review_id, a list of dictionaries for each sentence within that review.



In [ ]:

    
counter=0
reviewdict={}
for i, rid in enumerate(reviews):
    rlist=[]
    nlist, alist = review_parts[i]
    ln=len(alist)
    localbow=corpus[counter:counter+ln]
    for bow, adj, noun in zip(localbow, alist, nlist):
#         doc=" ".join([id2word[e[0]] for e in bow])
        pplus=calc_pplus(adj, logpositives, lognegatives, priorp, priorn)
        topicis=choose_topic(lda2, bow)
        ldict={"topic": topicis, 'pplus':pplus}
        rlist.append(ldict)
    reviewdict[str(int(rid))]=rlist
    counter=counter+ln
    print i, rid, ln, counter



In [ ]:

    
# fp = open("reviewdict.json","w")
# json.dump(reviewdict, fp)
# fp.close()

with open("reviewdict.json","r") as fp:
    reviewdict = json.load(fp)



In [ ]:

    
%%time

list_of_dicts=[]
counter = 0
sentence_counter = 0
for index, row in maas_IMDB.iterrows():
    counter += 1
    revs=reviewdict[str(int(index))]
#     print index, len(revs)
    sentence_counter += len(revs)
    for r in revs:
        r2=r.copy()
        r2['movie_id']=row.title
        r2['']=row.title
        r2['review_id']=index
        r2['stars']=row.stars
        r2['valence_avg']=row.valence_avg
        r2['valence_sum']=row.valence_sum
        list_of_dicts.append(r2)
    if random() < 0.0001: 
        print str(int(index)*100./537550) + "%"

Finally we are ready to save this as a pandas dataframe.



In [ ]:

    
completedf=pd.DataFrame(list_of_dicts)
completedf.head()



In [ ]:

    
completedf[completedf.stars == "N/A"].shape



In [ ]:

    
# fp = open("completedf_intermediate.json","w")
# json.dump(completedf.to_dict(), fp)
# fp.close()

with open("completedf_intermediate.json","r") as fp:
    completedf = json.load(fp)
completedf = pd.DataFrame(completedf)

Not that this dataframe is by every sentence that we have, and we want to groupby review and topic. Therefore we define the function get_stats which takes the minimum, maximum, mean, count and variance of the probability that the review is positive, as well as preserving the valence numbers.



In [ ]:

    
def get_stats(group):
    min_pplus = group.pplus.min()
    max_pplus = group.pplus.max()
    rid = group.movie_id.unique()
    valence_avg = group.valence_avg.mean()
    valence_sum = group.valence_sum.mean()
    count = group.topic.count()
    if count == 1: 
        varj = 0
    else: 
        varj = group.pplus.var(ddof=1)
    mean_pplus = group.pplus.mean()
    stars = group.stars.mean()
    return pd.DataFrame({'min': min_pplus, 'max':max_pplus ,'valence_avg': valence_avg,'valence_sum':valence_sum, 
                         'count': count,'var': varj, 'mean': mean_pplus, 'stars': stars}, index=rid)

We apply our new function to the groupby object of completedf.



In [ ]:

    
%%time
dftouse=completedf.groupby(['review_id', 'topic']).apply(get_stats).reset_index()



In [ ]:

    
print dftouse.shape
dftouse = dftouse.rename(columns={'level_2': 'movie_id'})
dftouse.head()

Finally we save our dataset so we don't have to run this anymore!!!!!!!



In [ ]:

    
print dftouse.shape
dftouse = dftouse.rename(columns={'level_2': 'movie_id'})
dftouse.head()



In [54]:

    
fp = open("complete_byreviews.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()

with open("complete_byreviews.json","r") as fp:
    complete_byreviews = json.load(fp)
complete_byreviews = pd.DataFrame(complete_byreviews)

Groupby to prepare the final dataset



In [74]:

    
complete_byreviews = complete_byreviews.drop(['valence_avg','valence_sum','var'], 1)
complete_byreviews.head()









    Out[74]:






  
    
      
      count
      max
      mean
      min
      movie_id
      review_id
      topic
    
  
  
    
      0
      1
      0.921705
      0.921705
      0.921705
      Home Alone
      0
      1
    
    
      1
      1
      0.922908
      0.922908
      0.922908
      Home Alone
      1
      1
    
    
      10
      1
      0.090731
      0.090731
      0.090731
      P.S.
      10
      0
    
    
      100
      1
      0.946521
      0.946521
      0.946521
      Stuck
      100
      0
    
    
      1000
      1
      0.990585
      0.990585
      0.990585
      Mother of Tears
      1000
      0



In [181]:

    
with open("dftouse.json", "r") as fp:
    imdb_dict = json.load(fp)
imdb_inflat_adj = pd.DataFrame(imdb_dict)



In [ ]:

    
imdb_inflat_adj = complete_byreviews.drop(['adj_gross_bin','adj_opening_gross_bin','var'], 1)



In [369]:

    
reviews_df = pd.concat([imdb_inflat_adj, complete_byreviews], axis=1, join_axes=[complete_byreviews.index])



In [58]:

    
def isNA(input):
    if input=='N/A':
        return True
    else:
        return False



In [368]:

    
reviews_df = reviews_df[~reviews_df['adj_opening_gross'].map(np.isnan)].fillna(0)
reviews_df = reviews_df[~reviews_df.stars.map(isNA)]
reviews_df['stars'] = reviews_df['stars'].astype(int)



In [172]:

    
fp = open("reviews_df_jason.json","w")
json.dump(reviews_df.to_dict(), fp)
fp.close()



In [263]:

    
groupby_topic_mean = reviews_df.groupby(['title','topic'])

topic_vals_mean = groupby_topic_mean['max','mean','min'].mean()
# topic_vals_var = groupby_topic_mean['max','mean','min'].var()

# topic_vals = pd.concat([topic_vals_mean,topic_vals_var], axis=1, join_axes=[topic_vals_var.index])
# topic_vals_var.rename(columns={'max':'max_var','mean':'mean_var','min':'min_var','count':'count_var'})

topic01 = topic_vals_mean.reset_index().set_index('title')
topic1 = topic01[topic01.topic==1].rename(columns={'max':'max_t1','mean':'mean_t1','min':'min_t1','count':'count_t1'})
topic0 = topic01[topic01.topic==0].rename(columns={'max':'max_t0','mean':'mean_t0','min':'min_t0','count':'count_t0'})



In [264]:

    
bymovie = reviews_df.groupby("title")

adj_gross = bymovie.adj_gross.mean()
adj_gross_bin = bymovie.adj_gross_bin.max()
adj_opening_gross = bymovie.adj_opening_gross.mean()
adj_opening_gross_bin = bymovie.adj_opening_gross_bin.max()

n_scorables = bymovie.n_scorables.mean()

open_theaters = bymovie.opening_theaters.mean()
year = bymovie.open_year.max()

pct_scorables = bymovie.pct_scorables.mean()
star_avg = bymovie.stars.mean()

theaters = bymovie.theaters.mean()

review_count = bymovie.title.count()




valence_avg_var = bymovie.valence_avg.var()
valence_sum_var = bymovie.valence_sum.var()
# max_bow_var = bymovie['max'].var()
# min_bow_var = bymovie['min'].var()
# mean_bow_var = bymovie['mean'].var()


valence_avg = bymovie.valence_avg.mean()
valence_sum = bymovie.valence_sum.mean()
# max_bow = bymovie['max'].mean()
# min_bow = bymovie['min'].mean()
# mean_bow = bymovie['mean'].mean()



In [265]:

    
dftouse = pd.concat([adj_gross,adj_gross_bin,adj_opening_gross,adj_opening_gross_bin,n_scorables,open_theaters,year
                     ,pct_scorables,star_avg,theaters,review_count,valence_avg_var,valence_sum_var,]
                    , axis=1, join_axes=[adj_gross.index])
dftouse = dftouse.rename(columns={'valence_avg':'valence_avg_var','valence_sum':'valence_sum_var',})
dftouse = pd.concat([dftouse,valence_avg,valence_sum], axis=1, join_axes=[dftouse.index])
dftouse = pd.concat([dftouse,topic1,topic0],axis=1, join_axes=[dftouse.index])
dftouse['abs_valence_avg'] = np.abs(dftouse['valence_avg']-np.mean(dftouse['valence_avg']))

blockbuster_threshold = 100000000
dftouse['adj_opening_gross_bin'] = 1*(dftouse.adj_gross>blockbuster_threshold)

# we only keep movies with more than 20 reviews
dftouse = dftouse[dftouse.title>20]



for i in dftouse:    
    dftouse[i] = dftouse[i].fillna(0)



In [267]:

    
fp = open("dftouse_final.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()

Descriptive statistics on final dataset

First by review:



In [370]:

    
graph_val_avg = plt.hist(dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==1],bins=40,alpha=0.9,label="Blockbuster reviews",color="red")
graph_val_avg = plt.hist(dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==0],bins=40,alpha=0.5,label="Non-Blockbuster reviews")
graph_val_avg = plt.axvline(x=dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==1].mean(), color='red',alpha=0.9,label="Blockbuster mean")
graph_val_avg = plt.axvline(x=dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==0].mean(), color='blue',alpha=0.5,label="Non-Blockbuster mean")

graph_val_avg = plt.title("Valence sum for positive and negative reviews")
graph_val_avg = plt.xlabel("Sum of valence")
graph_val_avg = plt.ylabel("Frequency")
graph_val_avg = plt.legend(loc='upper left')



In [363]:

    
opening_gross = reviews_df.groupby("movie_id").adj_opening_gross.mean()
stars = reviews_df.groupby("movie_id").stars.mean()



In [364]:

    
plt.hist(opening_gross,bins=10000,alpha=0.5,label="Opening gross")
plt.axvline(x=opening_gross.mean(), label = "Mean", color = "r")

plt.ylim(0, 50)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[364]:





<matplotlib.legend.Legend at 0x1bb429690>

Now by movie



In [371]:

    
year_movies = reviews_df.groupby("movie_id").year.mean()



In [372]:

    
plt.hist(year_movies, bins=26,alpha=0.5,label="Movies")
# plt.axhline(y=avg_number, label = "Mean", color = "b")

plt.title("Number of movies released from 1990 - 2015")
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[372]:





<matplotlib.legend.Legend at 0x1eaaf5610>

Number of movies released increase throughout time, peaking in 2005.



In [373]:

    
opening_gross_2000s = reviews_df[reviews_df.year >=2000].groupby("movie_id").adj_opening_gross.mean()
opening_gross_1990s = reviews_df[reviews_df.year < 2000].groupby("movie_id").adj_opening_gross.mean()



In [374]:

    
plt.hist(opening_gross_2000s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000s")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
# plt.axvline(x=opening_gross.mean(), label = "Mean")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")

plt.ylim(0, 100)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[374]:





<matplotlib.legend.Legend at 0x1c36dd650>



In [375]:

    
plt.hist(opening_gross_2000s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000s")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
# plt.axvline(x=opening_gross.mean(), label = "Mean")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")

plt.ylim(0, 100)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[375]:





<matplotlib.legend.Legend at 0x1c666c350>



In [376]:

    
opening_gross_00s = reviews_df[reviews_df.year >=2000][reviews_df.year <2010].groupby("movie_id").adj_opening_gross.mean()
opening_gross_10s = reviews_df[reviews_df.year >=2010].groupby("movie_id").adj_opening_gross.mean()



In [377]:

    
plt.hist(opening_gross_10s,bins=1000, alpha=0.5, color = "g", label="Opening gross for movies post 2010s")
plt.hist(opening_gross_00s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000 - 2010")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")

plt.ylim(0, 30)
plt.xlim(31217400.17,)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[377]:





<matplotlib.legend.Legend at 0x4682db490>

Even after adjusting for inflation, we learned that there has been more block busters in recent years (after 2000) than from 1990-2000. We defined a block buster as a movie that made more money after inflation than Jaws (1975).



In [378]:

    
review_count = reviews_df.groupby("movie_id").review_id.count()



In [379]:

    
plt.hist(review_count, bins = 50, alpha = 0.5, color = "b", label="Number of reviews")
plt.axhline(y=review_count.mean(), color = "r", label = "Average number of reviews")

plt.xlim(0,100)
plt.title("Movie reviews")
plt.xlabel("Number of reviews")
plt.ylabel("Frequency")
plt.legend(loc='upper right')









    Out[379]:





<matplotlib.legend.Legend at 0x459921b10>



In [380]:

    
plt.scatter(reviews_df.n_scorables, reviews_df.valence_avg)

plt.title("Valence average vs. number of words within the review that were scored")
plt.xlabel("Number of words scorable")
plt.ylabel("Valence average for review")









    Out[380]:





<matplotlib.text.Text at 0x467be5110>

It looks like the more words there are in a review that is scorable the more likely the review is just neutral. This makes sense, and we'd hope that for reviews with the same valence score our sentiment analysis will have more differentiated probabilities -- suggesting that there is more nuance.



In [381]:

    
plt.scatter(reviews_df["mean"], reviews_df["valence_avg"], color = "r")

plt.title("Sentiment analysis gives more nuance than a valence average.")
plt.xlabel("Probability of positive from sentiment analysis")
plt.ylabel("Valence_Avg from simple sentiment dictionary")









    Out[381]:





<matplotlib.text.Text at 0x467b80a10>

This confirms that our sentiment analysis indicators are more sensitive, which we hope will make them better regressors!



In [383]:

    
plt.scatter(opening_gross, stars)
plt.axvline(x=31217400.17)









    Out[383]:





<matplotlib.lines.Line2D at 0x46d7b0f10>

The blue line indicates the difference between Blockbuster and non-Blockbuster.



In [395]:

    
year_theaters = dftouse.groupby('open_year')
mean_year_theaters = year_theaters['opening_theaters'].mean()
std_year_theaters=year_theaters['opening_theaters'].std()
decadesplot = plt.plot(dftouse.open_year,dftouse.opening_theaters,'.')
decadesplot= plt.xlabel("Year")
decadesplot= plt.ylabel("Number of opening theaters")
decadesplot= plt.title("Number of opening theaters for movies over time")



In [403]:

    
graph_val_avg = plt.hist(dftouse.valence_avg[dftouse.adj_opening_gross_bin ==1],bins=40,alpha=0.9,label="Blockbuster reviews",color="red")
graph_val_avg = plt.hist(dftouse.valence_avg[dftouse.adj_opening_gross_bin ==0],bins=40,alpha=0.5,label="Non-Blockbuster reviews")
graph_val_avg = plt.axvline(x=dftouse.valence_avg[dftouse.adj_opening_gross_bin ==1].mean(), color='red',alpha=0.9,label="Blockbuster mean")
graph_val_avg = plt.axvline(x=dftouse.valence_avg[dftouse.adj_opening_gross_bin ==0].mean(), color='blue',alpha=0.5,label="Non-Blockbuster mean")

graph_val_avg = plt.title("Valence average for Blockbuster and non-Blockbuster")
graph_val_avg = plt.xlabel("Average of valence")
graph_val_avg = plt.ylabel("Frequency")
graph_val_avg = plt.legend(loc='upper left')

The above shows that average valence as calculated by LabMT is similar across Blockbuster and non-Blockbuster movies.

Clustering movies

Looking at the regression results above, we can see our valence data is not strongly correlated with gross box office numbers. However, this does not imply that there is not a relationship between movie reviews and box office success. It is certainly possible that there are different underlying structures for various groups of movies in the world.

A neat way to discover these structures is to implement a Clustering Algorithm. Using the sklearn.mixture.GMM, we can determine if there are underlying Gaussian distributions in the population that is creating the patches of data we see in our sample. Thus, we can determine which subsets of our data we should be running our regressions on.



In [317]:

    
Xall=dftouse[['valence_avg', 'adj_opening_gross']].values
from sklearn.mixture import GMM
n_clusters=2
clfgmm = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm.fit(Xall)
print clfgmm
gmm_means=clfgmm.means_
gmm_covar=clfgmm.covars_
print gmm_means, gmm_covar









    



GMM(covariance_type='tied', init_params='wmc', min_covar=0.001,
  n_components=2, n_init=1, n_iter=100, params='wmc', random_state=None,
  thresh=None, tol=0.001)
[[  4.96446021e-01   8.93331897e+06]
 [  5.07113904e-01   1.29998776e+08]] [[  3.00979587e-02  -1.03986465e+05]
 [ -1.03986465e+05   2.97903076e+14]]



In [318]:

    
from scipy import linalg

def plot_ellipse(splot, mean, cov, color):
    v, w = linalg.eigh(cov)
    u = w[0] / linalg.norm(w[0])
    angle = np.arctan(u[1] / u[0])
    angle = 180 * angle / np.pi  # convert to degrees
    # filled Gaussian at 2 standard deviation
    ell = mpl.patches.Ellipse(mean, 2 * v[0] ** 0.5, 2 * v[1] ** 0.5,
                              180 + angle, color=color, lw=3, fill=False)
    ell.set_clip_box(splot.bbox)
    ell1 = mpl.patches.Ellipse(mean, 1 * v[0] ** 0.5, 1 * v[1] ** 0.5,
                              180 + angle, color=color, lw=3, fill=False)
    ell1.set_clip_box(splot.bbox)
    ell3 = mpl.patches.Ellipse(mean, 3 * v[0] ** 0.5, 3 * v[1] ** 0.5,
                              180 + angle, color=color, lw=3, fill=False)
    ell3.set_clip_box(splot.bbox)
    #ell.set_alpha(0.2)
    splot.add_artist(ell)
    splot.add_artist(ell1)
    splot.add_artist(ell3)



In [334]:

    
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
gmm_labels=clfgmm.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red']):
    my_members = gmm_labels == k
    ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
            markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')









    Out[334]:





<matplotlib.text.Text at 0x1caa1a3d0>

This gives us a decent approximation of two distinct clusters in our data. Basically, the movies that make more than 700 million dollars in total adjusted gross belongs to the red cluster, and the rest (which are the signifcant majority of the movies) belong to the blue cluster.

However, the shapes of clusters still doesn't look very intuitive, and it looks like the current cluster is just trying to average out extreme values. Let's try to see if we would get a better result with 3 clusters:



In [333]:

    
n_clusters=3
clfgmm3 = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm3.fit(Xall)
print clfgmm
gmm_means=clfgmm3.means_
gmm_covar=clfgmm3.covars_
print gmm_means, gmm_covar
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[2], gmm_covar, 'k')
gmm_labels=clfgmm3.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red', 'green']):
    my_members = gmm_labels == k
    ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
            markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')









    



GMM(covariance_type='tied', init_params='wmc', min_covar=0.001,
  n_components=2, n_init=1, n_iter=100, params='wmc', random_state=None,
  thresh=None, tol=0.001)
[[  4.96652363e-01   6.96859291e+06]
 [  5.02661289e-01   7.39133403e+07]
 [  4.64227407e-01   2.07888790e+08]] [[  3.00909368e-02  -4.08592198e+04]
 [ -4.08592198e+04   1.62697509e+14]]






    Out[333]:





<matplotlib.text.Text at 0x1b90a0210>

Brilliant! Our data clearly comes from 3 distinct clusters of distributions: those that make more than 150,000,000 dollars in their opening weekends ("the blockbusters"), those that make between 50 and 100 million ("the successful movies"), and those that make less than 50,000,000 ("the ordinary movies"). We will keep this in mind for our future analyses.

Now, let's try 4



In [331]:

    
n_clusters=4
clfgmm3 = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm3.fit(Xall)
print clfgmm
gmm_means=clfgmm3.means_
gmm_covar=clfgmm3.covars_
print gmm_means, gmm_covar
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[2], gmm_covar, 'k')
gmm_labels=clfgmm3.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red', 'green', 'yellow']):
    my_members = gmm_labels == k
    ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
            markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')









    



GMM(covariance_type='tied', init_params='wmc', min_covar=0.001,
  n_components=2, n_init=1, n_iter=100, params='wmc', random_state=None,
  thresh=None, tol=0.001)
[[  5.01607271e-01   6.54716889e+06]
 [  5.03587182e-01   7.54934876e+07]
 [  4.63665714e-01   2.08932630e+08]
 [  4.89778713e-01   7.82695570e+06]] [[  3.00582297e-02  -3.99804483e+04]
 [ -3.99804483e+04   1.64411009e+14]]






    Out[331]:





<matplotlib.text.Text at 0x1e10db4d0>

Since there is so much overlap between the lower two clusters this time, we see that clustering into 3 groups is the better choice for our data. (We have also tried more clusters, but they similarly give worse results than the 3-clusters version)

Analysis

Visualization

Now we will begin our analysis of seeing to what extent we can predict opening gross of movies using the features we have, namely the following:

(all variables and explanation of variables)

We are going to use several regression methods, namely

Linear regression
Support Vector Machine Regression
Decision Tree Regression
Random Forest Regression
AdaBooster Regression
GradientBooster Regression

For methods 2-6, we will train these data and optimize it given some parameters. We will test the accuracy score of all of the classifiers and find the best one.

We begin by defining functions do_regress which takes a regressor from and feed it into the function cv_optimize to do optimization, then printing the training and testing accuracy scores. The function cv_optimize will help optimize the regressor given the parameters.



In [250]:

    
def do_regress(clf, parameters, indf, featurenames, targetname, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
    subdf=indf[featurenames]
    X=subdf.values
    y=indf[targetname]
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
    gs.fit(X, y)
#     print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
    best = gs.best_estimator_
    return best


# create training and testing sets
itrain, itest = train_test_split(xrange(dftouse.shape[0]), train_size=0.6)
mask_classify=np.ones(dftouse.shape[0], dtype='int')
mask_classify[itrain]=1
mask_classify[itest]=0
mask_classify = (mask_classify==1)

# define list of features to be trained over
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
          ,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']



In [204]:

    
# linear regression
from sklearn.linear_model import LinearRegression

clflinear_reg = LinearRegression()

Xlin = dftouse[Xnames].values
ylin = dftouse['adj_opening_gross']
clflinear_reg = clflinear_reg.fit(Xlin[mask_classify], ylin[mask_classify])
training_accuracy = clflinear_reg.score(Xlin[mask_classify], ylin[mask_classify])
test_accuracy = clflinear_reg.score(Xlin[~mask_classify], ylin[~mask_classify])

print "############# based on standard predict ################"
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data:     %0.2f" % (test_accuracy)
print "########################################################"









    



############# based on standard predict ################
Accuracy on training data: 0.49
Accuracy on test data:     0.47
########################################################



In [140]:

    
%%time
# SVC regression

from sklearn.svm import SVR
parameters_svr = {"C": [0.001, 0.01, 0.1, 1, 10, 100]}
clfsvr_reg = SVR(kernel="linear")
clfsvr_reg, Xtrain, ytrain, Xtest, ytest = do_regress(clfsvr_reg, parameters_svr, 
                                                       dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)









    



using mask
############# based on standard predict ################
Accuracy on training data: 0.37
Accuracy on test data:     0.39
########################################################
CPU times: user 8min 18s, sys: 866 ms, total: 8min 19s
Wall time: 8min 20s






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [141]:

    
# Tree regressor

from sklearn.tree import DecisionTreeRegressor
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clftree_reg = DecisionTreeRegressor()
clftree_reg, Xtrain, ytrain, Xtest, ytest = do_regress(clftree_reg, parameters_tree, 
                                                       dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)









    



using mask
############# based on standard predict ################
Accuracy on training data: 0.86
Accuracy on test data:     0.79
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [142]:

    
# Random forest regression

from sklearn.ensemble import RandomForestRegressor
parameters_forest = {"n_estimators": range(1, len(Xnames))}
clfforest_1 = RandomForestRegressor()
clfforest_1, Xtrain, ytrain, Xtest, ytest = do_regress(clfforest_1, parameters_forest, 
                                                       dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)









    



using mask
############# based on standard predict ################
Accuracy on training data: 0.97
Accuracy on test data:     0.81
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [143]:

    
# AdaBoost Regression
from sklearn.ensemble import AdaBoostRegressor
clfAda_regress = AdaBoostRegressor()

parameters_adaboost = {"n_estimators": range(20, 60)}
clfAda_regress, Xtrain, ytrain, Xtest, ytest = do_regress(clfAda_regress, parameters_adaboost, 
                                                       dftouse, Xnames, 'adj_opening_gross_bin', 
                                                          mask=mask_classify)









    



using mask
############# based on standard predict ################
Accuracy on training data: 0.65
Accuracy on test data:     0.61
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [144]:

    
from sklearn.ensemble import GradientBoostingRegressor

clfGB_regress = GradientBoostingRegressor()

parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB_regress, Xtrain, ytrain, Xtest, ytest = do_regress(clfGB_regress, parameters_GB, 
                                                       dftouse, Xnames, 'adj_opening_gross_bin'
                                                  , mask=mask_classify)









    



using mask
############# based on standard predict ################
Accuracy on training data: 0.78
Accuracy on test data:     0.69
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:5: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.

From this we can see that Random Forest regression is the best classifier as it has the lowest accuracy score on the test set.

Classification

We are also interested in classifying whether movies are blockbusters or not, which is defined as having a grossing of over 100M. We would like to see to what extent can our sentiment analysis predict whether a given movie has been a blockbuser.

We will first train classifiers on two variables, our valence average scores for topic 1 and topic 0. This is because we are interested in seeing to what extent our bag of words analysis has produced valences of reviews that have predictive powers. We will also train classifiers on two other variabels, namely average valence from LabMT and number of theaters showing film and compare. The classifiers we will be using are the following:

1) Logistic regression 2) Support Vector Machine 3) Decision tree classifier 4) AdaBoost classifier 5) GradientBooster classifier

With these classifiers we will visualize how these classifiers perform through a contour plot. We will also compare how these classifiers compare to each other using ROC curves.

We will also split betwen training and testing sets and choose the classifier that performs best on the testing set.

Next, we will train classifiers on a larger list of variables similar what we did for regression. For example, this will include other valence scores we've calculated using the LabMT dictionary, the variance of valence scores across reviews for a given movie and etc. In addition, we will be trying using another classifier that we were not able to use when we trained on two variables:

6) Random forest classifier

Again, we will split between training and testing set and select the classifier that performs the best on the testing set. We will visualize how well the classifiers perform using a ROC curve.

Let's define all the functions needed to do classificaiton. We write one main function called do_classify that inputs a classifier, data to train over and outputs the performance of accuracy score of the classifier on test and training data as well as the confusion matrix. We write a function called cv_optimize that optimizes a classifier given the parameters that are fed into it. We also create functions to make ROC curves, contour plots and tree contours.



In [145]:

    
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LogisticRegression

def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
    subdf=indf[featurenames]
    X=subdf.values
    y=(indf[targetname].values==target1val)*1
    print clf.__class__
    if mask !=None:
        print "using mask"
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if reuse_split !=None:
        print "using reuse split"
        Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
    if parameters:
        clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print "############# based on standard predict ################"
    print "Accuracy on training data: %0.2f" % (training_accuracy)
    print "Accuracy on test data:     %0.2f" % (test_accuracy)
    print confusion_matrix(ytest, clf.predict(Xtest))
    print "########################################################"
    return clf, Xtrain, ytrain, Xtest, ytest



In [146]:

    
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

def points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10, zfunc=False):
    h = .02
    X=np.concatenate((Xtr, Xte))
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))

    #plt.figure(figsize=(10,6))
    if mesh:
        if zfunc:
            p0 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 0]
            p1 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
            Z=zfunc(p0, p1)
        else:
            Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=alpha, axes=ax)
    ax.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, s=psize, alpha=alpha,edgecolor="k")
    # and testing points
    yact=clf.predict(Xte)
    ax.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker="s", s=psize+10)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    return ax,xx,yy



In [147]:

    
def points_plot_prob(ax, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold, ccolor=cm, psize=10, alpha=0.1, prob=True):
    ax,xx,yy = points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=False, colorscale=colorscale, cdiscrete=cdiscrete, psize=psize, alpha=alpha) 
    if prob:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    else:
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=ccolor, alpha=.2, axes=ax)
    cs2 = plt.contour(xx, yy, Z, cmap=ccolor, alpha=.6, axes=ax)
    plt.clabel(cs2, fmt = '%2.1f', colors = 'k', fontsize=14, axes=ax)
    plt.title(clf.__class__)
    return ax



In [355]:

    
from sklearn.metrics import roc_curve, auc

def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    for k in xrange(0, fpr.shape[0],labe):
        #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
        threshold = str(np.round(thresholds[k], 2))
        ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC using many features')
    ax.legend(loc="lower right")
    return ax



In [149]:

    
def plot_2tree(ax, Xtr, Xte, ytr, yte, clf, plot_train = True, plot_test = True, lab = ['Feature 1', 'Feature 2'], mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10, zfunc=False):
    # Create a meshgrid as our test data
    plt.figure(figsize=(15,10))
    plot_step= 0.05
    xmin, xmax= Xtr[:,0].min(), Xtr[:,0].max()
    ymin, ymax= Xtr[:,1].min(), Xtr[:,1].max()
    xx, yy = np.meshgrid(np.arange(xmin, xmax, plot_step), np.arange(ymin, ymax, plot_step))

    # Re-cast every coordinate in the meshgrid as a 2D point
    Xplot= np.c_[xx.ravel(), yy.ravel()]


    # Predict the class
    Z = clfTree1.predict( Xplot )

    # Re-shape the results
    Z= Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap= cmap_light, alpha=0.3)
  
    # Overlay training samples
    if (plot_train == True):
        plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, alpha=alpha,edgecolor="k") 
    # and testing points
    if (plot_test == True):
        plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker="s")

    plt.xlabel(lab[0])
    plt.ylabel(lab[1])
    plt.title("Boundary for decision tree classifier",fontsize=20)

Now let's train all the classifiers on first valence average for topic0 and topic1.



In [340]:

    
%%time

two_features_bow = ['mean_t1','mean_t0']

clflog1 = LogisticRegression()
parameters_log = {"C": [0.000000001, 0.00000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,]}
clflog1, Xtrain, ytrain, Xtest, ytest=do_classify(clflog1, parameters_log, dftouse, 
                                                 two_features_bow,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)
print ""


from sklearn.svm import SVC
clfsvm1 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm1, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm1, parameters_svc, dftouse, 
                                                 two_features_bow,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)

print ""

from sklearn import tree
clfTree1 = tree.DecisionTreeClassifier()

parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

clfTree1, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree1, parameters_tree, dftouse, 
                                                 two_features,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)

print ""



from sklearn.ensemble import AdaBoostClassifier

clfAda1 = AdaBoostClassifier()

parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda1, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda1, parameters_adaboost, 
                                                       dftouse, two_features, 'adj_opening_gross_bin', 
                                                   1, mask=mask_classify)

print ""


from sklearn.ensemble import GradientBoostingClassifier

clfGB1 = GradientBoostingClassifier()

parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB1, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB1, parameters_GB, 
                                                       dftouse, two_features, 'adj_opening_gross_bin', 1
                                                  , mask=mask_classify)









    



<class 'sklearn.linear_model.logistic.LogisticRegression'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.86
[[2235    0]
 [ 357    0]]
########################################################

<class 'sklearn.svm.classes.SVC'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.86
[[2235    0]
 [ 357    0]]
########################################################

<class 'sklearn.tree.tree.DecisionTreeClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.86
[[2235    0]
 [ 357    0]]
########################################################

<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.86
[[2235    0]
 [ 357    0]]
########################################################

<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.87
Accuracy on test data:     0.86
[[2235    0]
 [ 357    0]]
########################################################
CPU times: user 3min 14s, sys: 324 ms, total: 3min 15s
Wall time: 3min 15s






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:13: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.

Now let's print out the contour plots, accuracy scores and ROC curves to see which classifier performed the best



In [271]:

    
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clflog1, prob=False);



In [272]:

    
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfsvm1, prob=False);



In [273]:

    
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfAda1, prob=False);



In [274]:

    
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfGB1, prob=False);



In [275]:

    
plot_2tree(plt, Xtrain, Xtest, ytrain, ytest, clfTree1, 
           lab = ['abs_valence_avg','total_theaters'], alpha = 1, plot_test = False)

Let's try plotting the ROC and see which classifier performs better.



In [346]:

    
ax1=make_roc("log", clflog1, ytest, Xtest, proba=True,labe=50)
ax1=make_roc("svm", clfsvm1, ytest, Xtest,proba=False,labe=50)
ax1=make_roc("AdaBoost", clfAda1, ytest, Xtest, proba=False,labe=50)
ax1=make_roc("GB", clfGB1, ytest, Xtest,proba=False,labe=50)

It seems as though if average valence for topic 1 and 2 are very poor features to train classifiers on, as all classifiers predict no blockbusters and only non-blockbusters. The ROC curve also give us inconclusive results, but the GB classifier seems marginally better than the rest.

Now we classify on another set of two variables, which is average valence from LabMT and number of theaters showing movie.



In [350]:

    
two_features2 = ['valence_avg','theaters']

clflog12 = LogisticRegression()
parameters_log = {"C": [0.000000001, 0.00000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,]}
clflog12, Xtrain, ytrain, Xtest, ytest=do_classify(clflog12, parameters_log, dftouse, 
                                                 two_features2,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)
print ""


from sklearn.svm import SVC
clfsvm12 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm12, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm12, parameters_svc, dftouse, 
                                                 two_features2,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)

print ""

from sklearn import tree
clfTree12 = tree.DecisionTreeClassifier()

parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

clfTree12, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree12, parameters_tree, dftouse, 
                                                 two_features2,
                                                 'adj_opening_gross_bin', 1, mask=mask_classify)

print ""



from sklearn.ensemble import AdaBoostClassifier

clfAda12 = AdaBoostClassifier()

parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda12, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda12, parameters_adaboost, 
                                                       dftouse, two_features2, 'adj_opening_gross_bin', 
                                                   1, mask=mask_classify)

print ""


from sklearn.ensemble import GradientBoostingClassifier

clfGB12 = GradientBoostingClassifier()

parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB12, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB12, parameters_GB, 
                                                       dftouse, two_features2, 'adj_opening_gross_bin', 1
                                                  , mask=mask_classify)









    



<class 'sklearn.linear_model.logistic.LogisticRegression'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.92
Accuracy on test data:     0.93
[[2178   57]
 [ 135  222]]
########################################################

<class 'sklearn.svm.classes.SVC'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.93
[[2183   52]
 [ 131  226]]
########################################################

<class 'sklearn.tree.tree.DecisionTreeClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.92
[[2178   57]
 [ 139  218]]
########################################################

<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.92
[[2173   62]
 [ 134  223]]
########################################################

<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.93
[[2183   52]
 [ 136  221]]
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:13: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [353]:

    
classifiers2 = [clfsvm12,clflog12,clfAda12,clfGB12]

fig, axes = plt.subplots(nrows = len(classifiers2), ncols = 1, figsize=(40, 40))
for (ax, x) in zip(axes.ravel(), classifiers2):
    Xtr=np.concatenate((Xtrain, Xtest))
    plt.figure()
    ax=plt.gca()
    points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, x,prob=False);



In [354]:

    
ax1=make_roc("log", clflog12, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("svm", clfsvm12, ytest, Xtest,proba=False,labe=1000)
ax1=make_roc("AdaBoost", clfAda12, ytest, Xtest, proba=False,skip=50,labe=1000)
ax1=make_roc("GB", clfGB12, ytest, Xtest,proba=False,skip=50,labe=1000)

We see that all the classifiers have very similar accuracy scores so it is difficult to differentiate between which classifiers perform better. These two features have more predictive power than the previous two, which comes mostly from the feature for number of theaters. LabMT valence score lends very little predictive power.

Now let's try training the classifiers on the list of features that we used in the regression earlier. We will also introduce the random forest classifer here and see how if performs.



In [302]:

    
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
          ,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']
clflog2 = LogisticRegression()
parameters_log = {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
clflog2, Xtrain, ytrain, Xtest, ytest=do_classify(clflog2, parameters_log, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""


from sklearn.svm import SVC
clfsvm2 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm2, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm2, parameters_svc, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)

print ""

from sklearn import tree
clfTree2 = tree.DecisionTreeClassifier()

parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

clfTree2, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree2, parameters_tree, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)

print ""



from sklearn.ensemble import AdaBoostClassifier

clfAda2 = AdaBoostClassifier()

parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters_adaboost, 
                                                       dftouse, Xnames, 'adj_opening_gross_bin', 
                                                   1, mask=mask_classify)

print ""


from sklearn.ensemble import GradientBoostingClassifier

clfGB2 = GradientBoostingClassifier()

parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters_GB, dftouse, 
                                                   Xnames, 'adj_opening_gross_bin', 1, mask=mask_classify)









    



<class 'sklearn.linear_model.logistic.LogisticRegression'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.93
[[2179   56]
 [ 129  228]]
########################################################

<class 'sklearn.svm.classes.SVC'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.94
[[2184   51]
 [ 117  240]]
########################################################

<class 'sklearn.tree.tree.DecisionTreeClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.94
Accuracy on test data:     0.93
[[2167   68]
 [ 116  241]]
########################################################

<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.94
Accuracy on test data:     0.93
[[2182   53]
 [ 122  235]]
########################################################

<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.96
Accuracy on test data:     0.93
[[2181   54]
 [ 117  240]]
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:13: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [360]:

    
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
          ,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']
clflog2 = LogisticRegression()
parameters_log = {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
clflog2, Xtrain, ytrain, Xtest, ytest=do_classify(clflog2, parameters_log, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""


from sklearn.svm import SVC
clfsvm2 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm2, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm2, parameters_svc, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)

print ""

from sklearn import tree
clfTree2 = tree.DecisionTreeClassifier()

parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}

clfTree2, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree2, parameters_tree, dftouse, 
                                                 Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)

print ""



from sklearn.ensemble import AdaBoostClassifier

clfAda2 = AdaBoostClassifier()

parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters_adaboost, 
                                                       dftouse, Xnames, 'adj_opening_gross_bin', 
                                                   1, mask=mask_classify)

print ""


from sklearn.ensemble import GradientBoostingClassifier

clfGB2 = GradientBoostingClassifier()

parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters_GB, dftouse, 
                                                   Xnames, 'adj_opening_gross_bin', 1, mask=mask_classify)









    



<class 'sklearn.linear_model.logistic.LogisticRegression'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.93
[[2185   50]
 [ 133  224]]
########################################################

<class 'sklearn.svm.classes.SVC'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.93
Accuracy on test data:     0.93
[[2168   67]
 [ 109  248]]
########################################################

<class 'sklearn.tree.tree.DecisionTreeClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.94
Accuracy on test data:     0.93
[[2181   54]
 [ 140  217]]
########################################################

<class 'sklearn.ensemble.weight_boosting.AdaBoostClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.94
Accuracy on test data:     0.93
[[2174   61]
 [ 121  236]]
########################################################

<class 'sklearn.ensemble.gradient_boosting.GradientBoostingClassifier'>
using mask
############# based on standard predict ################
Accuracy on training data: 0.95
Accuracy on test data:     0.93
[[2176   59]
 [ 113  244]]
########################################################






    



/Users/jasondong/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:13: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.



In [361]:

    
ax1=make_roc("log", clflog2, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("svm", clfsvm2, ytest, Xtest,proba=False,labe=1000)
ax1=make_roc("AdaBoost", clfAda2, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("GB", clfGB2, ytest, Xtest,proba=False,labe=1000)

Here, we again do not see a significant different between the classifiers as their accuracy on test data is roughly similar.

	close_date	gross	open_date	opening_gross	opening_theaters	ranking	title	total_theaters	year
0	-	285761243	11/16	17081997	1202	1	Home Alone	2173	1990
1	-	217631306	7/13	12191540	1101	2	Ghost	1766	1990
2	-	184208848	11/9	598257	14	3	Dances with Wolves	1636	1990
3	-	178406268	3/23	11280591	1325	4	Pretty Woman	1811	1990
4	-	135265915	3/30	25398367	2006	5	Teenage Mutant Ninja Turtles	2377	1990

	close_date	close_year	gross	n_scorables	open_date	open_year	opening_gross	opening_theaters	pct_scorables	rank	review_title	stars	text	theaters	title	valence_avg	valence_sum	year	adj_gross	adj_opening_gross	adj_gross_bin	adj_opening_gross_bin
0	NaN	1990	285761243	144	632448000000	1990	17081997	1202	0.701389	1	My favourite Christmas movie!	10	it just isn't christmas if i don't watch home...	2173	Home Alone	0.758622	76.620792	1990	3.011865e+08	18004075.757142	1	1
1	NaN	1990	285761243	65	632448000000	1990	17081997	1202	0.738462	1	Very funny family movie	8	this is a very funny movie for kids and peopl...	2173	Home Alone	1.408094	67.588495	1990	3.011865e+08	18004075.757142	1	1
2	NaN	2001	55835	82	978739200000	2001	9134	7	0.719512	304	The worst book to screen adaption I have ever ...	1	i've read all of these books and perhaps that...	12	Beautiful Creatures	0.446455	26.340859	2001	7.975142e+04	13046.466254	0	0
3	NaN	1996	2491989	23	822960000000	1996	1593929	1292	0.695652	174	Cheesy! Cameos! And learn how to be your own g...	8	this is one of those movies which is so bad ...	1292	The Stupids	0.292260	4.676165	1996	3.153013e+06	2016733.775082	0	0
4	NaN	1996	1408575	30	821318400000	1996	33843	2	0.533333	189	Not a good induction to Shakespeare.	N/A	the film would have been better if the actor...	81	Looking for Richard	1.055385	16.886165	1996	1.782213e+06	42820.176526	0	0

	count	max	mean	min	movie_id	review_id	topic
0	1	0.921705	0.921705	0.921705	Home Alone	0	1
1	1	0.922908	0.922908	0.922908	Home Alone	1	1
10	1	0.090731	0.090731	0.090731	P.S.	10	0
100	1	0.946521	0.946521	0.946521	Stuck	100	0
1000	1	0.990585	0.990585	0.990585	Mother of Tears	1000	0