Our first attempt can be read about and accessed at this notebook: LINK HERE
About halfway into our project, we realized that our data was less than ideal for a few reasons. First, our Box Office Mojo data originally only scraped the top 100 ranked movies for each year. This represented only the upper eschelon of movie incomes. The low range of incomes, we presume, had less power to pick up a trend or correlation with valence statistics. We realized that Box Office Mojo actually includes a lot more movies! Rather than just the top 100, Box Office Mojo actually provides grossing information on all of the major movies for the year at hand, which is usally above 500. Second, the overlap between our IMDB data set from Stanford and our Box Office Mojo was just over 300 movies representing 4,111 overlapping reviews. We decided that we needed more data to better determine predictive power and the presence of trends. Third, our IMDB data represented only polarized reviews, i.e. those with stars below 4 or above 7. We decided that this would decrease ability to test one particuar hypothesis: that polarizing movies earn more money. As a result, we needed to scrape a balanced sample of reviews from IMDB for each movie, not discriminating based on number of stars. We want reviews in in the 4-7 star range.
Our new approach, as charted below, was to scrape all data from Box Office Mojo from 1990-2015, resulting in about 11,000 movies. For each of those movies, we scraped a max of 100 reviews from IMDB's website. This resulted in a final data set containing over half a million reviews, which were no longer only polarized. We feel that the new data is more fit to truly convey the presence or absence of trends and predictive power of gross income on review sentiment.
In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
import json
import statsmodels.api as sm
from statsmodels.formula.api import logit, glm, ols
from random import random
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import math
from sklearn.svm import LinearSVC
As mentioned above, we now scrape all data from Box Office Mojo from 1990-2015, resulting in about 11,000 movies. For each of those movies, we scraped a max of 100 reviews from IMDB's website. This resulted in a final data set containing over half a million reviews, which were no longer only polarized. We feel that the new data is more fit to truly convey the presence or absence of trends and predictive power of gross income on review sentiment.
In [2]:
from bs4 import BeautifulSoup
import requests
Let's create a data frame bom_df
to contain the scraped Box Office Mojo data.
In [3]:
bom_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])
The function rowInfoGrabber
retrieves relevant values from each row in a Box Office Mojo table.
In [4]:
def rowInfoGrabber(r):
info = []
# Ranking
info.append(int(r.find("font").get_text()))
# Title
info.append(r.find("a").get_text())
# Gross
info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
'''
For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
if right above it in the movie table and differs in gross income by about $1 million, which is a small
difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
'''
# Total number of theaters
if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
info.append(2587)
else:
info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
# Opening Gross
if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
info.append(10015822)
else:
info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
# Opening Number of Theaters
if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
info.append(2587)
else:
info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
# Date of Opening
if not r.find_all("td", attrs={"align":"right"})[4].find("a"):
info.append("-")
else:
info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
# Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
info.append('-')
else:
info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
return info
Let's scrape all movies from Box Office Mojo from 1990-2015. This takes about 4 minutes.
In [5]:
%%time
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]
years = [1990 + i for i in range(26)]
for year in years:
print year
pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
soup = BeautifulSoup(pageText.text, "html.parser")
movieTable = soup.find("td", attrs={"colspan":"3"})
movieRows = movieTable.find("table").find_all("tr")[2:102]
movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
year_df = pd.DataFrame(movie_dicts)
year_df['year'] = year
bom_df = bom_df.append(year_df, ignore_index=True)
# See how many pages of movies there are for this particular year
extraPageLinks = soup.find("center").find("font", attrs={"face":"Verdana"}).find("b").find_all("a")
time.sleep(1)
for p in extraPageLinks:
pageText = requests.get('http://www.boxofficemojo.com' + p['href'])
soup = BeautifulSoup(pageText.text, "html.parser")
movieTable = soup.find("td", attrs={"colspan":"3"})
movieRows = movieTable.find("table").find_all("tr")
movieRows = movieRows[2:len(movieRows)-4]
movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
year_df = pd.DataFrame(movie_dicts)
year_df['year'] = year
bom_df = bom_df.append(year_df, ignore_index=True)
time.sleep(1)
Let's see how it looks.
In [6]:
bom_df.head()
Out[6]:
Now we'll save and reload so we don't have to do that again.
In [7]:
with open('newBOM.json','r') as fp:
bom_dict = json.load(fp)
bom_df = pd.DataFrame(bom_dict)
We want a less polarized review set in order to more accurately determine the effect of review polarity on gross income. For this we'll have to part with Andrew Maas's Large Movie Review Data Set and scrape IMDB ourselves for a mixture of polarized and non-polarized reviews for each movie. Unfortunately, IMDB's URLs contain a movie ID rather than a movie title, so we cannot request each page's HTML using the mere title. Fortunately, the IMDB API allows us to obtain a movie's IMDB ID based on its title by returning a JSON object.
Let's create a helper function that returns the relevant url used to query the OMDB API given the movie's title.
An example url is http://www.omdbapi.com/?t=Jurassic%20World.
In [8]:
'''
Name: urlOMDB
Input: Movie name
Output: url used to Query OMDB API
'''
def urlOMDB(title):
base = "http://www.omdbapi.com/?t="
base += title.split()[0]
for word in title.split()[1:]:
base += "%20" + word
return base
Let's test our function by getting the IMDB ID for "Jurassic World".
In [9]:
requests.get(urlOMDB("Jurassic World")).json()['imdbID']
Out[9]:
Perfect. Now we'll use these IDs to scrape IMDB for each movie in our Box Office Mojo page.
First, let's write a function reviewsGrabber
which returns a list of (title of review, stars, text of review) tuples for a given IMDB webpage. The function cleans the review texts upon collecting them by removing punctuation and converting to lower case.
In [10]:
'''
Name: reviewsGrabber
Input: url (webpage)
Output: [(titleOfReview1, stars1, review1), (titleOfReview2, stars2, review2), ... (titleofReviewn, starsn, reviewn)]
'''
punctuation = set('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')
def reviewsGrabber(url):
pageText = requests.get(url)
soup = BeautifulSoup(pageText.text, "html.parser")
reviewTable = soup.find("div", attrs={"id":"tn15content"})
divs = []
for rev in reviewTable.find_all("div", attrs={"class": None}):
if len(rev.find_all("img")) > 1:
divs.append((rev.find("h2").get_text(), int(rev.find_all("img")[1]['alt'].split("/")[0])))
else:
divs.append((rev.find("h2").get_text(), 'N/A'))
reviews = []
for rev in reviewTable.find_all("p")[:len(reviewTable.find_all("p"))-1]:
if len(rev.find_all("b")) == 0:
text = rev.get_text()
text = ''.join(ch for ch in text if ch not in punctuation).replace("\n", " ").lower()
reviews.append(text)
return [(title,stars,text) for ((title,stars), text) in zip(divs,reviews)]
Now let's scrape each web page based on the movies in bom_df
. We'll take a max of 100 movies reviews for each movie. Some movies have less than 100 reviews. IMDB lists reviews by default from best to worst. We don't want that. Fortunately, they offer the option to sort in chronological order, this works particularly well for our project, because we want to see whether the earliest reviews are a good indicator for box office performance over the opening weekend.
I would frequenty see an error, account for it by tweaking the code (maybe throw in a try catch here and there), then continue where the code left off by changing the variable "count". (You can see that the most recent stopping point was 11,011.) Every once in awhile, the code spits out my progress as a percentage. Babysitting my computer was quite the experience.
In [11]:
# fields = ['title','text','stars','close_date','open_date','opening_gross','gross','rank','total_theaters','year']
# count = 11011
# for index, row in bom_df[count:].iterrows():
# close, gross, op, op_gross, open_th, r, title, thea, yr = tuple([row[i] for i in range(9)])
# if random() < 0.02:
# print str(count*100.0 / 11858) + "%"
# try:
# omdb = requests.get(urlOMDB(title)).json()
# except ValueError:
# continue
# omdbJSON = dict(omdb)
# if 'Error' in omdbJSON.keys():
# continue
# else:
# imdbID = str(omdb['imdbID'])
# time.sleep(0.1)
# reviews = []
# baseURL = 'http://www.imdb.com/title/%(imdbID)s/reviews?filter=chrono' % {'imdbID':imdbID}
# pageText = requests.get(baseURL)
# soup = BeautifulSoup(pageText.text, "html.parser")
# if soup.find("div", attrs={'class':'error_code_404'}):
# continue
# try:
# reviews = reviews + reviewsGrabber(baseURL)
# except AttributeError:
# continue
# # Let's scrape the number of reviews for this given movie.
# table = soup.find("div", attrs={"id":"tn15content"}).find_all("table", attrs={'border':'0'})
# if table == []:
# continue
# td = table[1].find("td", attrs = {'align':'right'})
# if td == None:
# continue
# else:
# text = td.get_text()
# n_reviews = int(text.split("reviews")[0].split("\n")[1])
# # This number represents the page: page p is represented in the url as the integer 10(p-1)
# p = 10
# while len(reviews) < min(n_reviews, 100):
# url = baseURL + ";filter=chrono" + str(";start=%(page)d" % {'page': p})
# reviews = list(reviews + reviewsGrabber(url))
# p += 10
# reviews = reviews[:100]
# # review_dict = [dict(zip(fields, [[] for i in range(len(fields))]))]
# for review in reviews:
# review_dict = {'title':[title], 'text': [review[2]], 'review_title': [review[0]], 'stars': [review[1]],
# 'close_date': [close],'gross': [gross], 'open_date': [op], 'opening_gross':[op_gross],
# 'opening_theaters': [open_th], 'rank':[r], 'theaters': [thea], 'year': [yr]}
# combined_df = combined_df.append(pd.DataFrame(review_dict), ignore_index=True)
# count += 1
Let's save our finally scraped movie reviews.
In [ ]:
# fp = open("complete.json","w")
# json.dump(combined_df.to_dict(), fp)
# fp.close()
with open("complete.json", "r") as fd:
combined_dict = json.load(fd)
combined_df = pd.DataFrame(combined_dict)
In [ ]:
print "Length of dataframe: ", len(combined_df.gross.values)
Because some films do not have a close date, we will have to be careful with the close date! Next, we combine the close_date
, open_date
and year
columns into two columns close_date
and open_date
that are time series. This will make it easier for us to work with the data in the future.
In [ ]:
def lam(x, month=False):
if x == '-' or x =='':
return '1'
try:
dummy = int(x)
except ValueError:
return '1'
else:
return x[:x.find('/')]
# splitting the close_date and open_date into the respective month and day
combined_df['close_month'] = combined_df['close_date'].map(lam)
combined_df['close_day'] = combined_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
combined_df['open_month'] = combined_df['open_date'].map(lam)
combined_df['open_day'] = combined_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])
# dropping the old close_date and open_date
combined_df = combined_df.drop('close_date', 1)
combined_df = combined_df.drop('open_date', 1)
# creating an open_year by turning the year column into a string and getting rid of trailing bits
combined_df['open_year'] = combined_df.year.astype(str)
combined_df['open_year'] = combined_df.open_year.map(lambda x: x[:x.find('.')])
# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = combined_df.close_month.astype(int)
open_month = combined_df.open_month.astype(int)
year = combined_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
if close_month[i] >= open_month[i]:
close_year.append(year[i])
else:
close_year.append(year[i]+1)
combined_df['close_year'] = close_year
combined_df['close_year'] = combined_df['close_year'].astype(str)
In [ ]:
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in combined_df.iterrows():
if row.close_day != '0':
close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
else:
close_date.append(None)
combined_df['close_date'] = close_date
combined_df['open_date']=combined_df.open_year + '-' + combined_df.open_month + '-' + combined_df.open_day
def lam2(x):
if x[7] == '-' or x[7:10] == 'Jan':
return x[:7] + "1"
else:
return x
combined_df['open_date']=combined_df['open_date'].map(lam2).apply(pd.datetools.parse)
Let's drop obselete columns.
In [ ]:
combined_df = combined_df.drop('close_day', 1)
combined_df = combined_df.drop('close_month', 1)
combined_df = combined_df.drop('open_day', 1)
combined_df = combined_df.drop('open_month', 1)
And now let's save our results again.
In [ ]:
# import pymongo
# from bson import json_util
# fp = open("complete_dates.json","w")
# json.dump(combined_df.to_dict(), fp, default=json_util.default)
# fp.close()
with open("complete_dates.json", "r") as fd:
complete_dates = json.load(fd)
combined_df = pd.DataFrame(complete_dates)
Datatime objects are not JSON serializable. We had to make them such, now we need to map them back to normal.
In [ ]:
def getdate(dic):
if dic != None:
return int(dic["$date"])
else:
return None
combined_df.open_date = combined_df.open_date.map(getdate)
combined_df.close_date = combined_df.close_date.map(getdate)
It's now time to collect valence statistics on our reviews in order to do sentiment analysis. We'll be using labMT (language assessment by Mechanical Turk), a word score list for sentiment analysis containing over 10,000 words available in the qdap Dictionaries package in R (http://www.inside-r.org/packages/cran/qdapDictionaries/docs/labMT). The file contains a "happiness" value, and ranks words by their happiness. It also includes mean and standard deviation, Twitter rank and Google rank.
In [ ]:
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)
In [ ]:
labmt.head()
Let's make a happiness dictionary consisting of the word's happiness minus the average of all words.
In [ ]:
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()
Let's look at the happiness values to find a narrow middle range. We'll want to disregard all words in this middle range so they do not overpower the polarized words.
In [ ]:
plt.hist(happiness.values())
plt.axvline(np.percentile(happiness.values(),35), color='r')
plt.axvline(np.percentile(happiness.values(),65), color='r')
We'll consider words below the 35th and above the 65th percentiles.
In [ ]:
lower_bound = np.percentile(happiness.values(),35)
upper_bound = np.percentile(happiness.values(),65)
print lower_bound, upper_bound
Now we'll write a function to remove stop words, words that have little to do with sentiment analysis. We use the sklearn package to obtain these words.
In [ ]:
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')
def removeStopWords(text, stopwords = stopwords):
new_text = ""
for word in text.split():
if word not in stopwords:
while len(word) != 0 and word[-1] in punctuation:
word = word[:len(word)-1]
new_text += word + ' '
return new_text
Finally we're ready to analyze a particular review given our valence dictionary. So let's put it in a function.
In [10]:
'''
Name: getValenceInfo()
Inputs: review text, dictionary of happiness
Returns: a 4-tuple of (happiness total, happiness average, total # of scorable words, % of scorable words)
'''
def getValenceInfo(text, valenceDict):
total_words = len(text.split())
happiness_total, count_relevant = 0, 0
for word in text.split():
try:
# Only consider words below the 35th and above the 65th percentile
if valenceDict[word] <= lower_bound or valenceDict[word] >= upper_bound:
happiness_total += valenceDict[word]
count_relevant += 1
except KeyError:
continue
if count_relevant != 0:
avg_valence = 1.*happiness_total/count_relevant
else:
avg_valence = 0
return happiness_total, avg_valence, total_words, 1.*count_relevant / total_words
Before we run this through our data frame, let's reset the index.
In [ ]:
'''
Name: getAllInfo
Input: data frame, happiness dictionary, list of stop words
Returns: a new data frame with 4 new columns: valence_sum, valence_avg, n_scorables, pct_scorables
'''
def getAllInfo(df, valenceDict, stopwords):
valence_suml, valence_avgl, review_lenl, review_fractionl = [], [], [], []
for i, row in df.iterrows():
if random() < 0.00001:
print i
cleaned_review = removeStopWords(row['text'], stopwords)
valence_sum, valence_avg, review_len, review_fraction = getValenceInfo(cleaned_review, valenceDict)
valence_suml.append(valence_sum)
valence_avgl.append(valence_avg)
review_lenl.append(review_len)
review_fractionl.append(review_fraction)
conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl,
'pct_scorables': review_fractionl})
return pd.concat([df, conc], axis=1)
The code takes about 5 minutes to run. We save the results below.
In [ ]:
%%time
valence_df = getAllInfo(combined_df, happiness, stopwords)
Let's take a look, then save again. The file is almost 1GB in size, so it takes a minute to save.
In [ ]:
valence_df.head()
In [ ]:
# import pymongo
# from bson import json_util
# fp = open("complete_valence.json","w")
# json.dump(valence_df.to_dict(), fp)
# fp.close()
In [40]:
#Loading the latest complete dataset
with open("complete_valence.json", "r") as fp:
valence_df_dict = json.load(fp)
dftouse = pd.DataFrame(valence_df_dict)
In [41]:
dftouse = dftouse.reset_index().drop('index',1)
In [42]:
inflation = pd.read_csv("inf.csv")
In [43]:
years_90 = range(1990,2015)
infdict = {}
infindex = 0
infvalue = 1
testlist = []
for row in inflation.values:
currentval = 1 + (row[1]/100)
cuminf = infvalue*currentval
infdict[years_90[infindex]] = cuminf
infindex += 1
infvalue = cuminf
testlist.append(cuminf)
inframe = pd.DataFrame(data=testlist, index=range(1990,2015))
In [44]:
infdict
Out[44]:
In [45]:
#Creating two new lists of adjusted gross values.
#If the open year of the movie was 2015, we use the 2014 inflation data, since 2015 data is not released until 2016
newgross, newopengross = [], []
for gross, opengross, openyr in zip(dftouse.gross, dftouse.opening_gross, dftouse.open_year):
if int(openyr) == 2015:
openyr = 2014
newgross.append(infdict[int(openyr)]*gross)
newopengross.append(infdict[int(openyr)]*opengross)
In [46]:
#dftouse = flattened_df
dftouse['adj_gross'] = newgross
dftouse['adj_opening_gross'] = newopengross
In [47]:
# creating binary variables of adjusted grossing for classifications
dftouse['adj_gross_bin'] = 1*(dftouse.adj_gross>np.mean(dftouse.adj_gross))
dftouse['adj_opening_gross_bin'] = 1*(dftouse.adj_opening_gross>np.mean(dftouse.adj_opening_gross))
In [48]:
dftouse.head()
Out[48]:
Now that we have adjusted for inflation, we can remove the previous gross and all columns pertaining to date (except perhaps year
, since it's interesting).
In [49]:
keep = ['close_date','close_year', 'adj_gross','adj_gross_bin','adj_opening_gross_bin', 'n_scorables','open_date','open_year',
'adj_opening_gross','opening_theaters','pct_scorables', 'rank', 'review_title','stars',
'text','theaters','title','valence_avg','valence_sum','year']
dftouse = dftouse[keep]
Let's save
In [50]:
fp = open("dftouse.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()
# with open("dftouse.json","r") as fp:
# dftouse_dict = json.load(fp)
# dftouse = pd.DataFrame(dftouse_dict)
Now we have a dataframe of all the reviews, we want to see if we can use sentiment analysis to better describe the text. Namely we want to emulate the method used in HW5 to see what each review has to say on different topics about movies.
In [51]:
with open("dftouse.json", "r") as fp:
dftouse = json.load(fp)
maas_IMDB = pd.DataFrame(dftouse)
In [ ]:
maas_IMDB=maas_IMDB.reset_index()
maas_IMDB.drop('index',1)
In [ ]:
maas_IMDB.shape
First we do some Natural language processing this will take the form of a few steps.
In [ ]:
from pattern.en import parse
from pattern.en import pprint
from pattern.vector import stem, PORTER, LEMMA
punctuation = list('.,;:!?()[]{}`''\"@#$^&*+-|=~_')
from sklearn.feature_extraction import text
stopwords=text.ENGLISH_STOP_WORDS
In [ ]:
import re
regex1=re.compile(r"\.{2,}")
regex2=re.compile(r"\-{2,}")
Define a function get_parts
which will take in an input review and returns a tuple of nouns and adjectives. Each member of the tuple is a list of lists. In next steps we will use the nouns used to find the topic (using LDA), and the adjectives to do sentiment analysis (via Naive Bayes).
In [ ]:
def get_parts(thetext):
thetext=re.sub(regex1, ' ', thetext)
thetext=re.sub(regex2, ' ', thetext)
nouns=[]
descriptives=[]
for i,sentence in enumerate(parse(thetext, tokenize=True, lemmata=True).split()):
nouns.append([])
descriptives.append([])
for token in sentence:
#print token
if len(token[4]) >0:
if token[1] in ['JJ', 'JJR', 'JJS']:
if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
continue
descriptives[i].append(token[4])
elif token[1] in ['NN', 'NNS']:
if token[4] in stopwords or token[4][0] in punctuation or token[4][-1] in punctuation or len(token[4])==1:
continue
nouns[i].append(token[4])
out=zip(nouns, descriptives)
nouns2=[]
descriptives2=[]
for n,d in out:
if len(n)!=0 and len(d)!=0:
nouns2.append(n)
descriptives2.append(d)
return nouns2, descriptives2
In [ ]:
# %%time
# # review_parts = maas_IMDB.map(lambda x: get_parts(x.text))
# review_parts = []
# for index, row in maas_IMDB.iterrows():
# review_parts.append(get_parts(row.text))
# if random() < 0.0001:
# print str(int(index)*100./537550) + "%"
Save it so we do not need to run this again:
In [14]:
# fp = open("review_parts.json","w")
# json.dump(review_parts, fp)
# fp.close()
with open("review_parts.json","r") as fp:
review_parts = json.load(fp)
We will now look specifically at the nouns in each sentence to see if that sentence is talking about actor quality, directing, scenery, music etc. The nouns are the first elements of all the tuples we created, so we take the first element in each tuple, remembering that this will be a list of lists of lists.
In [21]:
ldadata = [ele[0] for ele in review_parts]
First we want to create a vocabulary of all the nouns that are used in our database. To assist us we define a function flatten_unique
which will flatten any list of lists into one list of unique values. This function can be called upon twice in order to flatten a list of lists of lists into one single list.
In [5]:
import itertools as IT
import collections
def flatten_unique(outer_list):
flattened_list = []
for inner_list in outer_list: # into a review
for element in inner_list:
if element not in flattened_list:
flattened_list.append(element)
return flattened_list
Next we define a function save_vocab
which calls flatten_unique
onto the data; however, partitions the data into nth sized buckets so that we are able to see our progress.
In [365]:
# def save_vocab(data, n):
# ln = len(data)
# count = 0
# vocab = {}
# while count < n:
# lower = int(count*ln/n)
# upper = int((count+1)*ln/n - 1)
# if upper < ln-1:
# data_segment = data[lower:upper]
# data_raw = flatten_unique(flatten_unique(data_segment))
# vocab[count] = data_raw
# else:
# data_segment = data[lower: ln]
# data_raw = flatten_unique(flatten_unique(data_segment))
# vocab[count] = data_raw
# print str(int(count+1)*100./n) + "%"
# count += 1
# return vocab
In [366]:
# %%time
# vocab_unique = save_vocab(ldadata, 20)
We actually decided to run save_vocab outside of the function loop so that we could save things into the dictionary vocab_unique
as the coputer ran. Such that if the kernel died we could continue where it left off, rather than having to start again.
In [ ]:
%%time
n = 20
ln = len(ldadata)
count = 0
vocab_unique = {}
while count < n:
lower = int(count*ln/n)
upper = int((count+1)*ln/n - 1)
if upper < ln-1:
data_segment = ldadata[lower:upper]
data_raw = flatten_unique(flatten_unique(data_segment))
vocab_unique[count] = data_raw
else:
data_segment = ldadata[lower: ln]
data_raw = flatten_unique(flatten_unique(data_segment))
vocab_unique[count] = data_raw
print str(int(count+1)*100./n) + "%", count
count += 1
Saving it for use later on
In [23]:
# fp = open("vocab_unique.json","w")
# json.dump(vocab_unique, fp)
# fp.close()
with open("vocab_unique.json","r") as fp:
vocab_unique_dict = json.load(fp)
We want to take the lists out of the dictionary and save it into a list. While the elements within each list are unique, because they ran independently of each other, the aggregate list vocab_raw
may not be unique because a word may have appeared in the first list and second list. Therefore we also need to remove any duplicates.
In [ ]:
vocab_raw = []
for i in range(50):
vocab_raw = vocab_raw + vocab_unique_dict[str(i)]
In [ ]:
# len(vocab_raw)
In [ ]:
# %%time
# vocab_unique = []
# counter = 0
# for i in vocab_raw:
# if i not in vocab_unique:
# vocab_unique.append(i)
# if random() < 0.00001:
# print str(int(counter)*100./1807254) + "%"
# counter += 1
In [ ]:
# fp = open("vocab_unique_2.json","w")
# json.dump(vocab_unique, fp)
# fp.close()
with open("vocab_unique_2.json","r") as fp:
vocab_unique = json.load(fp)
In [ ]:
len(vocab_unique)
We apply the flatten_iter function onto the ldadata and then we comb through the output, vocab_raw, to delete any duplicates and produce vocab_unique.
We make a dictionary vocab which has as its keys the word, and the value an index. Then we also make a complementary dictionary id2word which has as its keys the index, and the value as the word.
In [ ]:
vocab = {}
for i in range(len(vocab_unique)):
vocab[vocab_unique[i]] = i
id2word = {}
for i in range(len(vocab_unique)):
id2word[i] = vocab_unique[i]
In [ ]:
# len(id2word), len(vocab)
The next thing is to create the bag of words corpus
. We learnt from the PSET that we should run LDA on the granularity of sentences, because we need some data that has a very clear and strong cluster membership so that we can clearly delineate the clusters (topics). Thus we consider things at the level of sentences, where some sentences will very clearly only talk about one topic. (Reviews are too confusing and may not have enough samples that exhibit strong preference for a membership).
We write a helper function auxillary_function
that can be called on each sentence to create a bag of words which is a list of tuples of the word id and count of the word.
In [ ]:
from collections import defaultdict
def auxillary_function(ldadata, vocab):
d = defaultdict(int)
for i in ldadata:
index = vocab[i]
d[index] += 1
return d.items()
In [ ]:
BOW_test = auxillary_function(tester_sentence, vocab)
In order to call auxillary_function
we need to flatten ldadata once (recall that it is currently a list of lists of lists), so that it is just a lists of lists and each element represents a sentence. Once we flatten ldadata, we can call auxillary_function
to make corpus
.
In [ ]:
%%time
all_sentences = []
for review in ldadata: # into a review
for sentence in review:
this_sentence = []
for word in sentence:
if word not in this_sentence:
this_sentence.append(word)
all_sentences.append(this_sentence)
In [ ]:
%%time
corpus = []
counter = 0
for sentence in all_sentences:
corpus.append(auxillary_function(sentence, vocab))
if random() < 0.0001:
print str(int(counter)*100./537550) + "%"
counter += 1
In [ ]:
# fp = open("corpus.json","w")
# json.dump(corpus, fp)
# fp.close()
with open("corpus.json","r") as fp:
corpus = json.load(fp)
In [ ]:
import gensim
lda2 = gensim.models.ldamodel.LdaModel(corpus, num_topics=2,id2word = id2word,update_every=1,chunksize=20000,passes=1)
In [ ]:
lda2.print_topics()
In [ ]:
for bow in corpus[0:900:15]:
print bow
print lda2.get_document_topics(bow)
print " ".join([id2word[e[0]] for e in bow])
print "=========================================="
It looks like topic 2 is about the external factors of a movie -- who the actors are and how production went, and topic 1 is about the intrinsic factors of a movie -- the character development and plot. Examples:
* words like "movie" "actor" "film" or "budget" "film" have higher topic 2 ratings.
* words like "relationship" "scenery" "personality" have a higher topic 1 rating
Now we turn to running the sentiment analysis on adjectives. Firstly we need to get all the adjectives from review_parts
.
In [ ]:
nbdata = [ele[1] for ele in review_parts]
Similar to our treatment of the nouns, we create a dictionary adjvocab
with keys being the adjectives and values being an index.
In [367]:
# %%time
# adj_unique = save_vocab(nbdata, 10000)
In [ ]:
# fp = open("adj_unique.json","w")
# json.dump(adj_unique, fp)
# fp.close()
with open("adj_unique.json","r") as fp:
adj_unique_dict = json.load(fp)
In [ ]:
adj_aggregate = []
for i in range(len(adj_unique_dict)):
adj_aggregate = adj_aggregate + adj_unique_dict[str(i)]
In [ ]:
%%time
adj_unique = []
for word in adj_aggregate:
if word not in adj_unique:
adj_unique.append(word)
In [ ]:
len(adj_unique)
In [ ]:
# fp = open("adj_unique_2.json","w")
# json.dump(adj_unique, fp)
# fp.close()
with open("adj_unique_2.json","r") as fp:
adj_unique = json.load(fp)
In [ ]:
adjvocab = {}
for i in range(len(adj_unique)):
adjvocab[adj_unique[i]] = i
In [ ]:
len(adjvocab)
We want to flatten the list of adjectives within a review into a single list of all adjectives for the whole review, over all the sentences.
In [ ]:
import itertools
Xarray = []
for i in nbdata: # into a review
Xarraypre = " ".join(list(itertools.chain.from_iterable(i)))
Xarray.append(Xarraypre)
In [ ]:
# fp = open("Xarray.json","w")
# json.dump(Xarray, fp)
# fp.close()
with open("Xarray.json","r") as fp:
Xarray = json.load(fp)
Unlike in HW5 where we used a binary response column that was based on the star ratings in the dataset, we used an external sentiment dictionary (LabMT) to calculate a valence score for each review, and have used that in leiu of the response array. (Details on how we calculated the valence score are above).
In [ ]:
# making the response array based off of the sentiment dictionary 'happiness' there is valence_avg and valence_sum
resparray = []
for index, row in maas_IMDB.iterrows():
if row.valence_avg > 0: # need to make it binomial because that's how the calibration/cv score works
resparray.append(1)
else:
resparray.append(0)
In [ ]:
# fp = open("resparray.json","w")
# json.dump(resparray, fp)
# fp.close()
with open("resparray.json","r") as fp:
resparray = json.load(fp)
In [ ]:
print len(Xarray), len(resparray)
print Xarray[0]
Now we create a mask to split things into a test and train set.
In [ ]:
from sklearn.cross_validation import train_test_split
itrain, itest = train_test_split(xrange(len(Xarray)), train_size=0.7)
mask=np.ones(len(Xarray), dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)
We ensure that X and y are numpy arrays for sklearn
.
In [ ]:
X=np.array(Xarray)
y=np.array(resparray)
We define a function make_xy
which takes a aX_col
of word based documents and uses the vectorizer vectorizer
to transform them into a feature matrix where the features are the vocabulary. Essentially the vectorizer transforms the space separated adjectives we made above into a bag-of-words representation.
In [ ]:
def make_xy(X_col, y_col, vectorizer):
X = vectorizer.fit_transform(X_col)
y = y_col
return X, y
We write a log_likelihood
function which predicts the log likelihood of the dataset according to a Naive Bayes classifier.
In [ ]:
"""
Function
--------
log_likelihood
Compute the log likelihood of a dataset according to
a Naive Bayes classifier.
The Log Likelihood is defined by
L = Sum_positive(logP(positive)) + Sum_negative(logP(negative))
Where Sum_positive indicates a sum over all positive reviews,
and Sum_negative indicates a sum over negative reviews
Parameters
----------
clf : Naive Bayes classifier
x : (nexample, nfeature) array
The input data
y : (nexample) integer array
Whether each review is positive
"""
def log_likelihood(clf, x, y):
prob = clf.predict_log_proba(x)
neg = y == 0
pos = ~neg
return prob[neg, 0].sum() + prob[pos, 1].sum()
We define a function cv_score
to estimate the cross-validated value of a scoring function, given a classifier and data.
In [ ]:
from sklearn.cross_validation import KFold
def cv_score(clf, x, y, score_func, nfold=5):
"""
Uses 5-fold cross validation to estimate a score of a classifier
Inputs
------
clf : Classifier object
x : Input feature vector
y : Input class labels
score_func : Function like log_likelihood, that takes (clf, x, y) as input,
and returns a score
Returns
-------
The average score obtained by splitting (x, y) into 5 folds of training and
test sets, fitting on the training set, and evaluating score_func on the test set
Examples
cv_score(clf, x, y, log_likelihood)
"""
result = 0
for train, test in KFold(y.size, nfold): # split data into train/test groups, 5 times
clf.fit(x[train], y[train]) # fit
result += score_func(clf, x[test], y[test]) # evaluate score function on held-out data
return result / nfold # average
We define a function calibration_plot
which will make a calibration plot.
In [ ]:
def calibration_plot(clf, xtest, ytest):
prob = clf.predict_proba(xtest)[:, 1]
outcome = ytest
data = pd.DataFrame(dict(prob=prob, outcome=outcome))
#group outcomes into bins of similar probability
bins = np.linspace(0, 1, 20)
cuts = pd.cut(prob, bins)
binwidth = bins[1] - bins[0]
#freshness ratio and number of examples in each bin
cal = data.groupby(cuts).outcome.agg(['mean', 'count'])
cal['pmid'] = (bins[:-1] + bins[1:]) / 2
cal['sig'] = np.sqrt(cal.pmid * (1 - cal.pmid) / cal['count'])
#the calibration plot
ax = plt.subplot2grid((3, 1), (0, 0), rowspan=2)
p = plt.errorbar(cal.pmid, cal['mean'], cal['sig'])
plt.plot(cal.pmid, cal.pmid, linestyle='--', lw=1, color='k')
plt.ylabel("Empirical P(+)")
#the distribution of P(+)
ax = plt.subplot2grid((3, 1), (2, 0), sharex=ax)
plt.bar(left=cal.pmid - binwidth / 2, height=cal['count'],
width=.95 * (bins[1] - bins[0]),
fc=p[0].get_color())
plt.xlabel("Predicted P(+)")
plt.ylabel("Number")
In [ ]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
We run the cross-validation loop over the alphas and min_dfs for MultinomialNB and CountVectorizer, respectively. It's okay that the cross validation loop happens inside the cv_score function because we are feeding a vocabulary into CountVectorizer so min_df doesn't actually affect the CountVectorizer at all.
In [ ]:
%%time
#the grid of parameters to search over
alphas = [0, .1, 1, 5, 10, 50]
#Find the best value for alpha and min_df, and the best classifier
best_alpha = None
maxscore = -np.inf
for alpha in alphas:
vectorizer = CountVectorizer(vocabulary =adjvocab)
Xthis, Ythis = make_xy(X[mask], y[mask], vectorizer)
clf = MultinomialNB(alpha=alpha)
cvscore = cv_score(clf, Xthis, Ythis, log_likelihood)
if cvscore > maxscore:
maxscore = cvscore
best_alpha = alpha
Now that we've determined values for alpha and min_df that optimize the cross-validated log-likelihood, repeat the CountVectorization and fitting to train a final classifier with these parameters, and draw calibration plots on the test set.
In [ ]:
vectorizer = CountVectorizer(vocabulary =adjvocab, min_df=best_min_df)
Xnew, Ynew = make_xy(X,y, vectorizer)
xtrain, xtest, ytrain, ytest = train_test_split(Xnew, Ynew)
clf = MultinomialNB(alpha=best_alpha).fit(xtrain, ytrain)
training_accuracy = clf.score(xtrain, ytrain)
test_accuracy = clf.score(xtest, ytest)
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
calibration_plot(clf, xtest, ytest)
Unfortunately this graph looks terrible and reflects the fact that the priors are incredibly skewed towards a more positive review.
Because our naive bayes classifier assumes conditional independence, this means that while we ran the sentiment analysis on each review, it should be fairly straightforward to apply the sentiment analysis to each sentence instead of the whole review.
In order to do this we calculate 2 dictionaries logpositives
and lognegatives
. These dictionaries have the words as the keys and the log probabilities as they value.
In [ ]:
vectorizer_alladj = CountVectorizer(min_df=best_min_df,vocabulary = adjvocab)
X_alladj, Y_alladj = make_xy(X[mask],y[mask], vectorizer_alladj)
clf_alladj = MultinomialNB(alpha=best_alpha).fit(X_alladj, Y_alladj)
logpositives = dict(zip(vectorizer_alladj.get_feature_names(),clf_alladj.feature_log_prob_[0]))
lognegatives = dict(zip(vectorizer_alladj.get_feature_names(),clf_alladj.feature_log_prob_[1]))
In [ ]:
# fp = open("logpositives.json","w")
# json.dump(logpositives, fp)
# fp.close()
with open("logpositives.json","r") as fp:
logpositives = json.load(fp)
In [ ]:
# fp = open("lognegatives.json","w")
# json.dump(lognegatives, fp)
# fp.close()
with open("lognegatives.json","r") as fp:
lognegatives = json.load(fp)
In the two dictionaries, we have the probability for each individual word whether it is positive or negative. Now we write a function calc_pplus
which calculates the probability that a sentence is positive. Please refer to HW5 for more information on the maths behind this.
In [ ]:
def calc_pplus(adjlist, lp, ln, pp,pn):
sgivenpos = 0
sgivenneg = 0
for adj in adjlist:
if lp.get(adj) != None:
sgivenpos += lp[adj]
sgivenneg += ln[adj]
return(np.exp(sgivenpos)*pp)/(np.exp(sgivenpos)*pp + np.exp(sgivenneg)*pn)
In [ ]:
reviews=[]
for index, row in maas_IMDB.iterrows():
reviews.append(row.movie_id)
In [ ]:
# fp = open("reviews.json","w")
# json.dump(reviews, fp)
# fp.close()
with open("reviews.json","r") as fp:
reviews = json.load(fp)
In [ ]:
len(review_parts),len(reviews)
We write a function choose_topic
which chooses which of the two LDA topics the sentence is in, given the bag of words for the sentence.
In [ ]:
def choose_topic(ldamodel, bow):
tee = lda2.get_document_topics(bow)
if len(tee)==2:
t1,t2=tee
if t2[1] >= t1[1]:#get higher probability topic
topicis=t2[0]
else:
topicis=t1[0]
elif len(tee)==1:#if only one was provided its very high probability. Take it
teetuple=tee[0]
topicis=teetuple[0]
return topicis
Before we go any further, it's important to take a look at what the prior probability that a review is positive or negative is.
In [ ]:
priorp = np.mean(resparray)
priorn = 1 - priorp
priorp, priorn
It's clear that there are alot more positive reviews than there are negative reviews.
Now we combine the two functions we just wrote, to populate a dictionary reviewdict
which stores in a dictionary, with keys the review_id, a list of dictionaries for each sentence within that review.
In [ ]:
counter=0
reviewdict={}
for i, rid in enumerate(reviews):
rlist=[]
nlist, alist = review_parts[i]
ln=len(alist)
localbow=corpus[counter:counter+ln]
for bow, adj, noun in zip(localbow, alist, nlist):
# doc=" ".join([id2word[e[0]] for e in bow])
pplus=calc_pplus(adj, logpositives, lognegatives, priorp, priorn)
topicis=choose_topic(lda2, bow)
ldict={"topic": topicis, 'pplus':pplus}
rlist.append(ldict)
reviewdict[str(int(rid))]=rlist
counter=counter+ln
print i, rid, ln, counter
In [ ]:
# fp = open("reviewdict.json","w")
# json.dump(reviewdict, fp)
# fp.close()
with open("reviewdict.json","r") as fp:
reviewdict = json.load(fp)
In [ ]:
%%time
list_of_dicts=[]
counter = 0
sentence_counter = 0
for index, row in maas_IMDB.iterrows():
counter += 1
revs=reviewdict[str(int(index))]
# print index, len(revs)
sentence_counter += len(revs)
for r in revs:
r2=r.copy()
r2['movie_id']=row.title
r2['']=row.title
r2['review_id']=index
r2['stars']=row.stars
r2['valence_avg']=row.valence_avg
r2['valence_sum']=row.valence_sum
list_of_dicts.append(r2)
if random() < 0.0001:
print str(int(index)*100./537550) + "%"
Finally we are ready to save this as a pandas dataframe.
In [ ]:
completedf=pd.DataFrame(list_of_dicts)
completedf.head()
In [ ]:
completedf[completedf.stars == "N/A"].shape
In [ ]:
# fp = open("completedf_intermediate.json","w")
# json.dump(completedf.to_dict(), fp)
# fp.close()
with open("completedf_intermediate.json","r") as fp:
completedf = json.load(fp)
completedf = pd.DataFrame(completedf)
Not that this dataframe is by every sentence that we have, and we want to groupby review and topic. Therefore we define the function get_stats
which takes the minimum, maximum, mean, count and variance of the probability that the review is positive, as well as preserving the valence numbers.
In [ ]:
def get_stats(group):
min_pplus = group.pplus.min()
max_pplus = group.pplus.max()
rid = group.movie_id.unique()
valence_avg = group.valence_avg.mean()
valence_sum = group.valence_sum.mean()
count = group.topic.count()
if count == 1:
varj = 0
else:
varj = group.pplus.var(ddof=1)
mean_pplus = group.pplus.mean()
stars = group.stars.mean()
return pd.DataFrame({'min': min_pplus, 'max':max_pplus ,'valence_avg': valence_avg,'valence_sum':valence_sum,
'count': count,'var': varj, 'mean': mean_pplus, 'stars': stars}, index=rid)
We apply our new function to the groupby object of completedf
.
In [ ]:
%%time
dftouse=completedf.groupby(['review_id', 'topic']).apply(get_stats).reset_index()
In [ ]:
print dftouse.shape
dftouse = dftouse.rename(columns={'level_2': 'movie_id'})
dftouse.head()
Finally we save our dataset so we don't have to run this anymore!!!!!!!
In [ ]:
print dftouse.shape
dftouse = dftouse.rename(columns={'level_2': 'movie_id'})
dftouse.head()
In [54]:
fp = open("complete_byreviews.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()
with open("complete_byreviews.json","r") as fp:
complete_byreviews = json.load(fp)
complete_byreviews = pd.DataFrame(complete_byreviews)
In [74]:
complete_byreviews = complete_byreviews.drop(['valence_avg','valence_sum','var'], 1)
complete_byreviews.head()
Out[74]:
In [181]:
with open("dftouse.json", "r") as fp:
imdb_dict = json.load(fp)
imdb_inflat_adj = pd.DataFrame(imdb_dict)
In [ ]:
imdb_inflat_adj = complete_byreviews.drop(['adj_gross_bin','adj_opening_gross_bin','var'], 1)
In [369]:
reviews_df = pd.concat([imdb_inflat_adj, complete_byreviews], axis=1, join_axes=[complete_byreviews.index])
In [58]:
def isNA(input):
if input=='N/A':
return True
else:
return False
In [368]:
reviews_df = reviews_df[~reviews_df['adj_opening_gross'].map(np.isnan)].fillna(0)
reviews_df = reviews_df[~reviews_df.stars.map(isNA)]
reviews_df['stars'] = reviews_df['stars'].astype(int)
In [172]:
fp = open("reviews_df_jason.json","w")
json.dump(reviews_df.to_dict(), fp)
fp.close()
In [263]:
groupby_topic_mean = reviews_df.groupby(['title','topic'])
topic_vals_mean = groupby_topic_mean['max','mean','min'].mean()
# topic_vals_var = groupby_topic_mean['max','mean','min'].var()
# topic_vals = pd.concat([topic_vals_mean,topic_vals_var], axis=1, join_axes=[topic_vals_var.index])
# topic_vals_var.rename(columns={'max':'max_var','mean':'mean_var','min':'min_var','count':'count_var'})
topic01 = topic_vals_mean.reset_index().set_index('title')
topic1 = topic01[topic01.topic==1].rename(columns={'max':'max_t1','mean':'mean_t1','min':'min_t1','count':'count_t1'})
topic0 = topic01[topic01.topic==0].rename(columns={'max':'max_t0','mean':'mean_t0','min':'min_t0','count':'count_t0'})
In [264]:
bymovie = reviews_df.groupby("title")
adj_gross = bymovie.adj_gross.mean()
adj_gross_bin = bymovie.adj_gross_bin.max()
adj_opening_gross = bymovie.adj_opening_gross.mean()
adj_opening_gross_bin = bymovie.adj_opening_gross_bin.max()
n_scorables = bymovie.n_scorables.mean()
open_theaters = bymovie.opening_theaters.mean()
year = bymovie.open_year.max()
pct_scorables = bymovie.pct_scorables.mean()
star_avg = bymovie.stars.mean()
theaters = bymovie.theaters.mean()
review_count = bymovie.title.count()
valence_avg_var = bymovie.valence_avg.var()
valence_sum_var = bymovie.valence_sum.var()
# max_bow_var = bymovie['max'].var()
# min_bow_var = bymovie['min'].var()
# mean_bow_var = bymovie['mean'].var()
valence_avg = bymovie.valence_avg.mean()
valence_sum = bymovie.valence_sum.mean()
# max_bow = bymovie['max'].mean()
# min_bow = bymovie['min'].mean()
# mean_bow = bymovie['mean'].mean()
In [265]:
dftouse = pd.concat([adj_gross,adj_gross_bin,adj_opening_gross,adj_opening_gross_bin,n_scorables,open_theaters,year
,pct_scorables,star_avg,theaters,review_count,valence_avg_var,valence_sum_var,]
, axis=1, join_axes=[adj_gross.index])
dftouse = dftouse.rename(columns={'valence_avg':'valence_avg_var','valence_sum':'valence_sum_var',})
dftouse = pd.concat([dftouse,valence_avg,valence_sum], axis=1, join_axes=[dftouse.index])
dftouse = pd.concat([dftouse,topic1,topic0],axis=1, join_axes=[dftouse.index])
dftouse['abs_valence_avg'] = np.abs(dftouse['valence_avg']-np.mean(dftouse['valence_avg']))
blockbuster_threshold = 100000000
dftouse['adj_opening_gross_bin'] = 1*(dftouse.adj_gross>blockbuster_threshold)
# we only keep movies with more than 20 reviews
dftouse = dftouse[dftouse.title>20]
for i in dftouse:
dftouse[i] = dftouse[i].fillna(0)
In [267]:
fp = open("dftouse_final.json","w")
json.dump(dftouse.to_dict(), fp)
fp.close()
First by review:
In [370]:
graph_val_avg = plt.hist(dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==1],bins=40,alpha=0.9,label="Blockbuster reviews",color="red")
graph_val_avg = plt.hist(dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==0],bins=40,alpha=0.5,label="Non-Blockbuster reviews")
graph_val_avg = plt.axvline(x=dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==1].mean(), color='red',alpha=0.9,label="Blockbuster mean")
graph_val_avg = plt.axvline(x=dftouse.pct_scorables[dftouse.adj_opening_gross_bin ==0].mean(), color='blue',alpha=0.5,label="Non-Blockbuster mean")
graph_val_avg = plt.title("Valence sum for positive and negative reviews")
graph_val_avg = plt.xlabel("Sum of valence")
graph_val_avg = plt.ylabel("Frequency")
graph_val_avg = plt.legend(loc='upper left')
In [363]:
opening_gross = reviews_df.groupby("movie_id").adj_opening_gross.mean()
stars = reviews_df.groupby("movie_id").stars.mean()
In [364]:
plt.hist(opening_gross,bins=10000,alpha=0.5,label="Opening gross")
plt.axvline(x=opening_gross.mean(), label = "Mean", color = "r")
plt.ylim(0, 50)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[364]:
Now by movie
In [371]:
year_movies = reviews_df.groupby("movie_id").year.mean()
In [372]:
plt.hist(year_movies, bins=26,alpha=0.5,label="Movies")
# plt.axhline(y=avg_number, label = "Mean", color = "b")
plt.title("Number of movies released from 1990 - 2015")
plt.xlabel("Year")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[372]:
Number of movies released increase throughout time, peaking in 2005.
In [373]:
opening_gross_2000s = reviews_df[reviews_df.year >=2000].groupby("movie_id").adj_opening_gross.mean()
opening_gross_1990s = reviews_df[reviews_df.year < 2000].groupby("movie_id").adj_opening_gross.mean()
In [374]:
plt.hist(opening_gross_2000s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000s")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
# plt.axvline(x=opening_gross.mean(), label = "Mean")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")
plt.ylim(0, 100)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[374]:
In [375]:
plt.hist(opening_gross_2000s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000s")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
# plt.axvline(x=opening_gross.mean(), label = "Mean")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")
plt.ylim(0, 100)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[375]:
In [376]:
opening_gross_00s = reviews_df[reviews_df.year >=2000][reviews_df.year <2010].groupby("movie_id").adj_opening_gross.mean()
opening_gross_10s = reviews_df[reviews_df.year >=2010].groupby("movie_id").adj_opening_gross.mean()
In [377]:
plt.hist(opening_gross_10s,bins=1000, alpha=0.5, color = "g", label="Opening gross for movies post 2010s")
plt.hist(opening_gross_00s,bins=1000, alpha=0.5, color = "r", label="Opening gross for movies post 2000 - 2010")
plt.hist(opening_gross_1990s,bins=1000,alpha=0.5, color = "k", label="Opening gross for movies pre 2000s")
plt.axvline(x=31217400.17, label = "Opening gross for Jaws (1975)")
plt.ylim(0, 30)
plt.xlim(31217400.17,)
plt.title("Distribution of movie's opening gross (adjusted for inflation) 1990 - 2015")
plt.xlabel("Opening gross (adjusted for inflation)")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[377]:
Even after adjusting for inflation, we learned that there has been more block busters in recent years (after 2000) than from 1990-2000. We defined a block buster as a movie that made more money after inflation than Jaws (1975).
In [378]:
review_count = reviews_df.groupby("movie_id").review_id.count()
In [379]:
plt.hist(review_count, bins = 50, alpha = 0.5, color = "b", label="Number of reviews")
plt.axhline(y=review_count.mean(), color = "r", label = "Average number of reviews")
plt.xlim(0,100)
plt.title("Movie reviews")
plt.xlabel("Number of reviews")
plt.ylabel("Frequency")
plt.legend(loc='upper right')
Out[379]:
In [380]:
plt.scatter(reviews_df.n_scorables, reviews_df.valence_avg)
plt.title("Valence average vs. number of words within the review that were scored")
plt.xlabel("Number of words scorable")
plt.ylabel("Valence average for review")
Out[380]:
It looks like the more words there are in a review that is scorable the more likely the review is just neutral. This makes sense, and we'd hope that for reviews with the same valence score our sentiment analysis will have more differentiated probabilities -- suggesting that there is more nuance.
In [381]:
plt.scatter(reviews_df["mean"], reviews_df["valence_avg"], color = "r")
plt.title("Sentiment analysis gives more nuance than a valence average.")
plt.xlabel("Probability of positive from sentiment analysis")
plt.ylabel("Valence_Avg from simple sentiment dictionary")
Out[381]:
This confirms that our sentiment analysis indicators are more sensitive, which we hope will make them better regressors!
In [383]:
plt.scatter(opening_gross, stars)
plt.axvline(x=31217400.17)
Out[383]:
The blue line indicates the difference between Blockbuster and non-Blockbuster.
In [395]:
year_theaters = dftouse.groupby('open_year')
mean_year_theaters = year_theaters['opening_theaters'].mean()
std_year_theaters=year_theaters['opening_theaters'].std()
decadesplot = plt.plot(dftouse.open_year,dftouse.opening_theaters,'.')
decadesplot= plt.xlabel("Year")
decadesplot= plt.ylabel("Number of opening theaters")
decadesplot= plt.title("Number of opening theaters for movies over time")
In [403]:
graph_val_avg = plt.hist(dftouse.valence_avg[dftouse.adj_opening_gross_bin ==1],bins=40,alpha=0.9,label="Blockbuster reviews",color="red")
graph_val_avg = plt.hist(dftouse.valence_avg[dftouse.adj_opening_gross_bin ==0],bins=40,alpha=0.5,label="Non-Blockbuster reviews")
graph_val_avg = plt.axvline(x=dftouse.valence_avg[dftouse.adj_opening_gross_bin ==1].mean(), color='red',alpha=0.9,label="Blockbuster mean")
graph_val_avg = plt.axvline(x=dftouse.valence_avg[dftouse.adj_opening_gross_bin ==0].mean(), color='blue',alpha=0.5,label="Non-Blockbuster mean")
graph_val_avg = plt.title("Valence average for Blockbuster and non-Blockbuster")
graph_val_avg = plt.xlabel("Average of valence")
graph_val_avg = plt.ylabel("Frequency")
graph_val_avg = plt.legend(loc='upper left')
The above shows that average valence as calculated by LabMT is similar across Blockbuster and non-Blockbuster movies.
Looking at the regression results above, we can see our valence data is not strongly correlated with gross box office numbers. However, this does not imply that there is not a relationship between movie reviews and box office success. It is certainly possible that there are different underlying structures for various groups of movies in the world.
A neat way to discover these structures is to implement a Clustering Algorithm. Using the sklearn.mixture.GMM, we can determine if there are underlying Gaussian distributions in the population that is creating the patches of data we see in our sample. Thus, we can determine which subsets of our data we should be running our regressions on.
In [317]:
Xall=dftouse[['valence_avg', 'adj_opening_gross']].values
from sklearn.mixture import GMM
n_clusters=2
clfgmm = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm.fit(Xall)
print clfgmm
gmm_means=clfgmm.means_
gmm_covar=clfgmm.covars_
print gmm_means, gmm_covar
In [318]:
from scipy import linalg
def plot_ellipse(splot, mean, cov, color):
v, w = linalg.eigh(cov)
u = w[0] / linalg.norm(w[0])
angle = np.arctan(u[1] / u[0])
angle = 180 * angle / np.pi # convert to degrees
# filled Gaussian at 2 standard deviation
ell = mpl.patches.Ellipse(mean, 2 * v[0] ** 0.5, 2 * v[1] ** 0.5,
180 + angle, color=color, lw=3, fill=False)
ell.set_clip_box(splot.bbox)
ell1 = mpl.patches.Ellipse(mean, 1 * v[0] ** 0.5, 1 * v[1] ** 0.5,
180 + angle, color=color, lw=3, fill=False)
ell1.set_clip_box(splot.bbox)
ell3 = mpl.patches.Ellipse(mean, 3 * v[0] ** 0.5, 3 * v[1] ** 0.5,
180 + angle, color=color, lw=3, fill=False)
ell3.set_clip_box(splot.bbox)
#ell.set_alpha(0.2)
splot.add_artist(ell)
splot.add_artist(ell1)
splot.add_artist(ell3)
In [334]:
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
gmm_labels=clfgmm.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red']):
my_members = gmm_labels == k
ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')
Out[334]:
This gives us a decent approximation of two distinct clusters in our data. Basically, the movies that make more than 700 million dollars in total adjusted gross belongs to the red cluster, and the rest (which are the signifcant majority of the movies) belong to the blue cluster.
However, the shapes of clusters still doesn't look very intuitive, and it looks like the current cluster is just trying to average out extreme values. Let's try to see if we would get a better result with 3 clusters:
In [333]:
n_clusters=3
clfgmm3 = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm3.fit(Xall)
print clfgmm
gmm_means=clfgmm3.means_
gmm_covar=clfgmm3.covars_
print gmm_means, gmm_covar
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[2], gmm_covar, 'k')
gmm_labels=clfgmm3.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red', 'green']):
my_members = gmm_labels == k
ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')
Out[333]:
Brilliant! Our data clearly comes from 3 distinct clusters of distributions: those that make more than 150,000,000 dollars in their opening weekends ("the blockbusters"), those that make between 50 and 100 million ("the successful movies"), and those that make less than 50,000,000 ("the ordinary movies"). We will keep this in mind for our future analyses.
Now, let's try 4
In [331]:
n_clusters=4
clfgmm3 = GMM(n_components=n_clusters, covariance_type="tied")
clfgmm3.fit(Xall)
print clfgmm
gmm_means=clfgmm3.means_
gmm_covar=clfgmm3.covars_
print gmm_means, gmm_covar
plt.figure()
ax=plt.gca()
plot_ellipse(ax, gmm_means[0], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[1], gmm_covar, 'k')
plot_ellipse(ax, gmm_means[2], gmm_covar, 'k')
gmm_labels=clfgmm3.predict(Xall)
for k, col in zip(range(n_clusters), ['blue','red', 'green', 'yellow']):
my_members = gmm_labels == k
ax.plot(Xall[my_members, 0], Xall[my_members, 1], 'w',
markerfacecolor=col, marker='.', alpha=0.5)
plt.xlabel('Average Valance Score')
plt.ylabel('Opening Weekened Box Office ($)')
plt.title('Discovering Hidden Distributions')
Out[331]:
Since there is so much overlap between the lower two clusters this time, we see that clustering into 3 groups is the better choice for our data. (We have also tried more clusters, but they similarly give worse results than the 3-clusters version)
Now we will begin our analysis of seeing to what extent we can predict opening gross of movies using the features we have, namely the following:
(all variables and explanation of variables)
We are going to use several regression methods, namely
For methods 2-6, we will train these data and optimize it given some parameters. We will test the accuracy score of all of the classifiers and find the best one.
We begin by defining functions do_regress which takes a regressor from and feed it into the function cv_optimize to do optimization, then printing the training and testing accuracy scores. The function cv_optimize will help optimize the regressor given the parameters.
In [250]:
def do_regress(clf, parameters, indf, featurenames, targetname, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
subdf=indf[featurenames]
X=subdf.values
y=indf[targetname]
if mask !=None:
print "using mask"
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
if reuse_split !=None:
print "using reuse split"
Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
if parameters:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print "############# based on standard predict ################"
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
print "########################################################"
return clf, Xtrain, ytrain, Xtest, ytest
def cv_optimize(clf, parameters, X, y, n_jobs=1, n_folds=5, score_func=None):
if score_func:
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
else:
gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
gs.fit(X, y)
# print "BEST", gs.best_params_, gs.best_score_, gs.grid_scores_
best = gs.best_estimator_
return best
# create training and testing sets
itrain, itest = train_test_split(xrange(dftouse.shape[0]), train_size=0.6)
mask_classify=np.ones(dftouse.shape[0], dtype='int')
mask_classify[itrain]=1
mask_classify[itest]=0
mask_classify = (mask_classify==1)
# define list of features to be trained over
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']
In [204]:
# linear regression
from sklearn.linear_model import LinearRegression
clflinear_reg = LinearRegression()
Xlin = dftouse[Xnames].values
ylin = dftouse['adj_opening_gross']
clflinear_reg = clflinear_reg.fit(Xlin[mask_classify], ylin[mask_classify])
training_accuracy = clflinear_reg.score(Xlin[mask_classify], ylin[mask_classify])
test_accuracy = clflinear_reg.score(Xlin[~mask_classify], ylin[~mask_classify])
print "############# based on standard predict ################"
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
print "########################################################"
In [140]:
%%time
# SVC regression
from sklearn.svm import SVR
parameters_svr = {"C": [0.001, 0.01, 0.1, 1, 10, 100]}
clfsvr_reg = SVR(kernel="linear")
clfsvr_reg, Xtrain, ytrain, Xtest, ytest = do_regress(clfsvr_reg, parameters_svr,
dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)
In [141]:
# Tree regressor
from sklearn.tree import DecisionTreeRegressor
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clftree_reg = DecisionTreeRegressor()
clftree_reg, Xtrain, ytrain, Xtest, ytest = do_regress(clftree_reg, parameters_tree,
dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)
In [142]:
# Random forest regression
from sklearn.ensemble import RandomForestRegressor
parameters_forest = {"n_estimators": range(1, len(Xnames))}
clfforest_1 = RandomForestRegressor()
clfforest_1, Xtrain, ytrain, Xtest, ytest = do_regress(clfforest_1, parameters_forest,
dftouse, Xnames, 'adj_opening_gross', mask=mask_classify)
In [143]:
# AdaBoost Regression
from sklearn.ensemble import AdaBoostRegressor
clfAda_regress = AdaBoostRegressor()
parameters_adaboost = {"n_estimators": range(20, 60)}
clfAda_regress, Xtrain, ytrain, Xtest, ytest = do_regress(clfAda_regress, parameters_adaboost,
dftouse, Xnames, 'adj_opening_gross_bin',
mask=mask_classify)
In [144]:
from sklearn.ensemble import GradientBoostingRegressor
clfGB_regress = GradientBoostingRegressor()
parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB_regress, Xtrain, ytrain, Xtest, ytest = do_regress(clfGB_regress, parameters_GB,
dftouse, Xnames, 'adj_opening_gross_bin'
, mask=mask_classify)
From this we can see that Random Forest regression is the best classifier as it has the lowest accuracy score on the test set.
We are also interested in classifying whether movies are blockbusters or not, which is defined as having a grossing of over 100M. We would like to see to what extent can our sentiment analysis predict whether a given movie has been a blockbuser.
We will first train classifiers on two variables, our valence average scores for topic 1 and topic 0. This is because we are interested in seeing to what extent our bag of words analysis has produced valences of reviews that have predictive powers. We will also train classifiers on two other variabels, namely average valence from LabMT and number of theaters showing film and compare. The classifiers we will be using are the following:
1) Logistic regression 2) Support Vector Machine 3) Decision tree classifier 4) AdaBoost classifier 5) GradientBooster classifier
With these classifiers we will visualize how these classifiers perform through a contour plot. We will also compare how these classifiers compare to each other using ROC curves.
We will also split betwen training and testing sets and choose the classifier that performs best on the testing set.
Next, we will train classifiers on a larger list of variables similar what we did for regression. For example, this will include other valence scores we've calculated using the LabMT dictionary, the variance of valence scores across reviews for a given movie and etc. In addition, we will be trying using another classifier that we were not able to use when we trained on two variables:
6) Random forest classifier
Again, we will split between training and testing set and select the classifier that performs the best on the testing set. We will visualize how well the classifiers perform using a ROC curve.
Let's define all the functions needed to do classificaiton. We write one main function called do_classify that inputs a classifier, data to train over and outputs the performance of accuracy score of the classifier on test and training data as well as the confusion matrix. We write a function called cv_optimize that optimizes a classifier given the parameters that are fed into it. We also create functions to make ROC curves, contour plots and tree contours.
In [145]:
from sklearn.svm import LinearSVC
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap
from sklearn.linear_model import LogisticRegression
def do_classify(clf, parameters, indf, featurenames, targetname, target1val, mask=None, reuse_split=None, score_func=None, n_folds=5, n_jobs=1):
subdf=indf[featurenames]
X=subdf.values
y=(indf[targetname].values==target1val)*1
print clf.__class__
if mask !=None:
print "using mask"
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
if reuse_split !=None:
print "using reuse split"
Xtrain, Xtest, ytrain, ytest = reuse_split['Xtrain'], reuse_split['Xtest'], reuse_split['ytrain'], reuse_split['ytest']
if parameters:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print "############# based on standard predict ################"
print "Accuracy on training data: %0.2f" % (training_accuracy)
print "Accuracy on test data: %0.2f" % (test_accuracy)
print confusion_matrix(ytest, clf.predict(Xtest))
print "########################################################"
return clf, Xtrain, ytrain, Xtest, ytest
In [146]:
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])
def points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10, zfunc=False):
h = .02
X=np.concatenate((Xtr, Xte))
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
np.linspace(y_min, y_max, 100))
#plt.figure(figsize=(10,6))
if mesh:
if zfunc:
p0 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 0]
p1 = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z=zfunc(p0, p1)
else:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=alpha, axes=ax)
ax.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, s=psize, alpha=alpha,edgecolor="k")
# and testing points
yact=clf.predict(Xte)
ax.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker="s", s=psize+10)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
return ax,xx,yy
In [147]:
def points_plot_prob(ax, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold, ccolor=cm, psize=10, alpha=0.1, prob=True):
ax,xx,yy = points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=False, colorscale=colorscale, cdiscrete=cdiscrete, psize=psize, alpha=alpha)
if prob:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
else:
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=ccolor, alpha=.2, axes=ax)
cs2 = plt.contour(xx, yy, Z, cmap=ccolor, alpha=.6, axes=ax)
plt.clabel(cs2, fmt = '%2.1f', colors = 'k', fontsize=14, axes=ax)
plt.title(clf.__class__)
return ax
In [355]:
from sklearn.metrics import roc_curve, auc
def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
initial=False
if not ax:
ax=plt.gca()
initial=True
if proba:
fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
else:
fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
roc_auc = auc(fpr, tpr)
if skip:
l=fpr.shape[0]
ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
else:
ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
label_kwargs = {}
label_kwargs['bbox'] = dict(
boxstyle='round,pad=0.3', alpha=0.2,
)
for k in xrange(0, fpr.shape[0],labe):
#from https://gist.github.com/podshumok/c1d1c9394335d86255b8
threshold = str(np.round(thresholds[k], 2))
ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
if initial:
ax.plot([0, 1], [0, 1], 'k--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.05])
ax.set_xlabel('False Positive Rate')
ax.set_ylabel('True Positive Rate')
ax.set_title('ROC using many features')
ax.legend(loc="lower right")
return ax
In [149]:
def plot_2tree(ax, Xtr, Xte, ytr, yte, clf, plot_train = True, plot_test = True, lab = ['Feature 1', 'Feature 2'], mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10, zfunc=False):
# Create a meshgrid as our test data
plt.figure(figsize=(15,10))
plot_step= 0.05
xmin, xmax= Xtr[:,0].min(), Xtr[:,0].max()
ymin, ymax= Xtr[:,1].min(), Xtr[:,1].max()
xx, yy = np.meshgrid(np.arange(xmin, xmax, plot_step), np.arange(ymin, ymax, plot_step))
# Re-cast every coordinate in the meshgrid as a 2D point
Xplot= np.c_[xx.ravel(), yy.ravel()]
# Predict the class
Z = clfTree1.predict( Xplot )
# Re-shape the results
Z= Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap= cmap_light, alpha=0.3)
# Overlay training samples
if (plot_train == True):
plt.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, alpha=alpha,edgecolor="k")
# and testing points
if (plot_test == True):
plt.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker="s")
plt.xlabel(lab[0])
plt.ylabel(lab[1])
plt.title("Boundary for decision tree classifier",fontsize=20)
Now let's train all the classifiers on first valence average for topic0 and topic1.
In [340]:
%%time
two_features_bow = ['mean_t1','mean_t0']
clflog1 = LogisticRegression()
parameters_log = {"C": [0.000000001, 0.00000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,]}
clflog1, Xtrain, ytrain, Xtest, ytest=do_classify(clflog1, parameters_log, dftouse,
two_features_bow,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.svm import SVC
clfsvm1 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm1, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm1, parameters_svc, dftouse,
two_features_bow,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn import tree
clfTree1 = tree.DecisionTreeClassifier()
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clfTree1, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree1, parameters_tree, dftouse,
two_features,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.ensemble import AdaBoostClassifier
clfAda1 = AdaBoostClassifier()
parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda1, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda1, parameters_adaboost,
dftouse, two_features, 'adj_opening_gross_bin',
1, mask=mask_classify)
print ""
from sklearn.ensemble import GradientBoostingClassifier
clfGB1 = GradientBoostingClassifier()
parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB1, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB1, parameters_GB,
dftouse, two_features, 'adj_opening_gross_bin', 1
, mask=mask_classify)
Now let's print out the contour plots, accuracy scores and ROC curves to see which classifier performed the best
In [271]:
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clflog1, prob=False);
In [272]:
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfsvm1, prob=False);
In [273]:
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfAda1, prob=False);
In [274]:
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, clfGB1, prob=False);
In [275]:
plot_2tree(plt, Xtrain, Xtest, ytrain, ytest, clfTree1,
lab = ['abs_valence_avg','total_theaters'], alpha = 1, plot_test = False)
Let's try plotting the ROC and see which classifier performs better.
In [346]:
ax1=make_roc("log", clflog1, ytest, Xtest, proba=True,labe=50)
ax1=make_roc("svm", clfsvm1, ytest, Xtest,proba=False,labe=50)
ax1=make_roc("AdaBoost", clfAda1, ytest, Xtest, proba=False,labe=50)
ax1=make_roc("GB", clfGB1, ytest, Xtest,proba=False,labe=50)
It seems as though if average valence for topic 1 and 2 are very poor features to train classifiers on, as all classifiers predict no blockbusters and only non-blockbusters. The ROC curve also give us inconclusive results, but the GB classifier seems marginally better than the rest.
Now we classify on another set of two variables, which is average valence from LabMT and number of theaters showing movie.
In [350]:
two_features2 = ['valence_avg','theaters']
clflog12 = LogisticRegression()
parameters_log = {"C": [0.000000001, 0.00000001, 0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100,]}
clflog12, Xtrain, ytrain, Xtest, ytest=do_classify(clflog12, parameters_log, dftouse,
two_features2,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.svm import SVC
clfsvm12 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm12, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm12, parameters_svc, dftouse,
two_features2,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn import tree
clfTree12 = tree.DecisionTreeClassifier()
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clfTree12, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree12, parameters_tree, dftouse,
two_features2,
'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.ensemble import AdaBoostClassifier
clfAda12 = AdaBoostClassifier()
parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda12, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda12, parameters_adaboost,
dftouse, two_features2, 'adj_opening_gross_bin',
1, mask=mask_classify)
print ""
from sklearn.ensemble import GradientBoostingClassifier
clfGB12 = GradientBoostingClassifier()
parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB12, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB12, parameters_GB,
dftouse, two_features2, 'adj_opening_gross_bin', 1
, mask=mask_classify)
In [353]:
classifiers2 = [clfsvm12,clflog12,clfAda12,clfGB12]
fig, axes = plt.subplots(nrows = len(classifiers2), ncols = 1, figsize=(40, 40))
for (ax, x) in zip(axes.ravel(), classifiers2):
Xtr=np.concatenate((Xtrain, Xtest))
plt.figure()
ax=plt.gca()
points_plot_prob(ax, Xtrain, Xtest, ytrain, ytest, x,prob=False);
In [354]:
ax1=make_roc("log", clflog12, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("svm", clfsvm12, ytest, Xtest,proba=False,labe=1000)
ax1=make_roc("AdaBoost", clfAda12, ytest, Xtest, proba=False,skip=50,labe=1000)
ax1=make_roc("GB", clfGB12, ytest, Xtest,proba=False,skip=50,labe=1000)
We see that all the classifiers have very similar accuracy scores so it is difficult to differentiate between which classifiers perform better. These two features have more predictive power than the previous two, which comes mostly from the feature for number of theaters. LabMT valence score lends very little predictive power.
Now let's try training the classifiers on the list of features that we used in the regression earlier. We will also introduce the random forest classifer here and see how if performs.
In [302]:
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']
clflog2 = LogisticRegression()
parameters_log = {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
clflog2, Xtrain, ytrain, Xtest, ytest=do_classify(clflog2, parameters_log, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.svm import SVC
clfsvm2 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm2, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm2, parameters_svc, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn import tree
clfTree2 = tree.DecisionTreeClassifier()
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clfTree2, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree2, parameters_tree, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.ensemble import AdaBoostClassifier
clfAda2 = AdaBoostClassifier()
parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters_adaboost,
dftouse, Xnames, 'adj_opening_gross_bin',
1, mask=mask_classify)
print ""
from sklearn.ensemble import GradientBoostingClassifier
clfGB2 = GradientBoostingClassifier()
parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters_GB, dftouse,
Xnames, 'adj_opening_gross_bin', 1, mask=mask_classify)
In [360]:
Xnames = ['opening_theaters','pct_scorables','theaters','valence_avg_var','valence_sum','abs_valence_avg'
,'valence_avg','mean_t1','mean_t0','max_t1','max_t0']
clflog2 = LogisticRegression()
parameters_log = {"C": [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100]}
clflog2, Xtrain, ytrain, Xtest, ytest=do_classify(clflog2, parameters_log, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.svm import SVC
clfsvm2 = SVC(kernel="linear")
parameters_svc = {"C": [0.001, 0.01, 0.1, 1, 10, 100, 1000]}
clfsvm2, Xtrain, ytrain, Xtest, ytest=do_classify(clfsvm2, parameters_svc, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn import tree
clfTree2 = tree.DecisionTreeClassifier()
parameters_tree = {"max_depth": [1, 2, 3, 4, 5, 6, 7], 'min_samples_leaf': [1, 2, 3, 4, 5, 6]}
clfTree2, Xtrain, ytrain, Xtest, ytest=do_classify(clfTree2, parameters_tree, dftouse,
Xnames,'adj_opening_gross_bin', 1, mask=mask_classify)
print ""
from sklearn.ensemble import AdaBoostClassifier
clfAda2 = AdaBoostClassifier()
parameters_adaboost = {"n_estimators": range(10, 60)}
clfAda2, Xtrain, ytrain, Xtest, ytest = do_classify(clfAda2, parameters_adaboost,
dftouse, Xnames, 'adj_opening_gross_bin',
1, mask=mask_classify)
print ""
from sklearn.ensemble import GradientBoostingClassifier
clfGB2 = GradientBoostingClassifier()
parameters_GB = {"n_estimators": range(30, 60), "max_depth": [1, 2, 3, 4, 5]}
clfGB2, Xtrain, ytrain, Xtest, ytest = do_classify(clfGB2, parameters_GB, dftouse,
Xnames, 'adj_opening_gross_bin', 1, mask=mask_classify)
In [361]:
ax1=make_roc("log", clflog2, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("svm", clfsvm2, ytest, Xtest,proba=False,labe=1000)
ax1=make_roc("AdaBoost", clfAda2, ytest, Xtest, proba=False,labe=1000)
ax1=make_roc("GB", clfGB2, ytest, Xtest,proba=False,labe=1000)
Here, we again do not see a significant different between the classifiers as their accuracy on test data is roughly similar.