CS109 Final Project Process Book

Background & Motivation

Social media and entertainment are such pervasive parts of millennials' lives. We want to study the intersection of these two. Is it possible to predict box office success of film through sentiments expressed on social media? Stay tuned for more!


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
import json
import statsmodels.api as sm
from statsmodels.formula.api import glm, ols

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import math

Milestone 1: Scrape and prepare data before thanksgiving

For our project we will be using data from 3 different sources

    Box Office Mojo (BOM) (http://www.boxofficemojo.com) is a website that aggregates, in a table, a list of all movies released in a year and attributes such as how much it grossed in the opening week, how much it grossed in total and how long it aired for
    Large Movie Review Dataset (http://ai.stanford.edu/~amaas/data/sentiment/) is a polarized dataset of movie reviews from IMDB prepared by Maas et al from Stanford. The dataset contains 25,000 entries in the training set and 25,000 entries in the test set.
    AFINN-111 Dictionary (http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=60100) is a dictionary list of 2477 english words and phrases rated for valence with an integer from -5 to 5. Originally prepared by Finn Årup Nielsen

In this first milestone, we will get all the data into a format that we can start running analysis on!

Scraping and cleaning Box Office Mojo

First we import the requests and BeautifulSoup libraries to make working with HTTP requests easier, and then easily transfer HTML content to Python data structures.


In [2]:
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Secondly, we prepare the data frame movie_df to store the data that we will scrape from BOM. We give this dataframe 9 columns:

* ranking: the ranking of the movie in its release year by gross
* title: movie title
* gross: how much the movie grossed while in theatres
* Total_theaters: the total number of theaters that showed this movie
* opening_gross: how much the movie grossed in the opening weekend (Fri-Sun)
* opening_theaters: the total number of theaters that showed this movie in the opening weekend (Fri-Sun)
* open_date: date of opening
* close_date: date of closing
* year: year of release

In [3]:
movie_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])

Now we write a function rowInfoGrabber that we will call in a loop over the table on the BOM webpage to grab the attributes and save them into the corresponding columns in movie_df.


In [4]:
def rowInfoGrabber(r):
    info = []
    # Ranking
    info.append(int(r.find("font").get_text()))
    # Title
    info.append(r.find("a").get_text())
    # Gross
    info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
    '''
    For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
    In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
    if right above it in the movie table and differs in gross income by about $1 million, which is a small 
    difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
    '''
    # Total number of theaters
    if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
    # Opening Gross
    if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
        info.append(10015822)
    else: 
        info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
    # Opening Number of Theaters
    if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
    # Date of Opening
    info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
    # Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
    if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
        info.append('-')
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
    return info

This is the image:


In [5]:
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]

Finally we're ready to scrape!
Because IMDB was created in 1990, we will scrape that far back in BOM. So we're scraping the past 26 years (1990 - 2015). Also note that because the HTML was changed starting in 2001, our scraping will be a little different before and after then.


In [6]:
%%time
years = [1990 + i for i in range(26)]
for year in years:
    pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
    soup = BeautifulSoup(pageText.text, "html.parser")
    movieTable = soup.find("td", attrs={"colspan":"3"})
    movieRows = movieTable.find("table").find_all("tr")[2:102]
    print year
    movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
    year_df = pd.DataFrame(movie_dicts)
    year_df['year'] = year
    movie_df = movie_df.append(year_df, ignore_index=True)
    time.sleep(1)


1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
CPU times: user 13.1 s, sys: 240 ms, total: 13.4 s
Wall time: 44.2 s

In [7]:
movie_df.shape


Out[7]:
(2600, 9)

In [8]:
movie_df.head()


Out[8]:
close_date gross open_date opening_gross opening_theaters ranking title total_theaters year
0 - 285761243 11/16 17081997 1202 1 Home Alone 2173 1990
1 - 217631306 7/13 12191540 1101 2 Ghost 1766 1990
2 - 184208848 11/9 598257 14 3 Dances with Wolves 1636 1990
3 - 178406268 3/23 11280591 1325 4 Pretty Woman 1811 1990
4 - 135265915 3/30 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990

Because some films do not have a close date, we will have to be careful with the close date!


In [9]:
# if we decide it's worth just dumping the movies with no close_date, we can use the code below
# movie_df=movie_df[movie_df.close_date != '-'].reset_index(drop=True)

Next, we combine the close_date, open_date and year columns into two columns close_date and open_date that are time series. This will make it easier for us to work with the data in the future.


In [9]:
# splitting the close_date and open_date into the respective month and day
movie_df['close_month'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[:x.find('/')])
movie_df['close_day'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
movie_df['open_month'] = movie_df['open_date'].map(lambda x: x[:x.find('/')])
movie_df['open_day'] = movie_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])

# dropping the old close_date and open_date
movie_df = movie_df.drop('close_date', 1)
movie_df = movie_df.drop('open_date', 1)

# creating an open_year by turning the year column into a string and getting rid of trailing bits
movie_df['open_year'] = movie_df.year.astype(str)
movie_df['open_year'] = movie_df.open_year.map(lambda x: x[:x.find('.')])

# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = movie_df['close_month'].astype(int)
open_month = movie_df['open_month'].astype(int)
year = movie_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
    if close_month[i] >= open_month[i]:
        close_year.append(year[i])
    else:
        close_year.append(year[i]+1) 
movie_df['close_year'] = close_year
movie_df['close_year'] = movie_df['close_year'].astype(str)

In [10]:
movie_df.head()


Out[10]:
gross opening_gross opening_theaters ranking title total_theaters year close_month close_day open_month open_day open_year close_year
0 285761243 17081997 1202 1 Home Alone 2173 1990 0 0 11 16 1990 1991
1 217631306 12191540 1101 2 Ghost 1766 1990 0 0 7 13 1990 1991
2 184208848 598257 14 3 Dances with Wolves 1636 1990 0 0 11 9 1990 1991
3 178406268 11280591 1325 4 Pretty Woman 1811 1990 0 0 3 23 1990 1991
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990 0 0 3 30 1990 1991

Let's take a look at the data, now!


In [11]:
movie_df.head()


Out[11]:
gross opening_gross opening_theaters ranking title total_theaters year close_month close_day open_month open_day open_year close_year
0 285761243 17081997 1202 1 Home Alone 2173 1990 0 0 11 16 1990 1991
1 217631306 12191540 1101 2 Ghost 1766 1990 0 0 7 13 1990 1991
2 184208848 598257 14 3 Dances with Wolves 1636 1990 0 0 11 9 1990 1991
3 178406268 11280591 1325 4 Pretty Woman 1811 1990 0 0 3 23 1990 1991
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990 0 0 3 30 1990 1991

Let's take a look at if we can get the run times for each movie!


In [14]:
run_time=[]
for index, row in movie_df.iterrows():
    if row.close_date != None:
        run_time.append(row['close_date']-row['open_date'])
    else: 
        run_time.append('N/A')


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-90b495e926a8> in <module>()
      1 run_time=[]
      2 for index, row in movie_df.iterrows():
----> 3     if row.close_date != None:
      4         run_time.append(row['close_date']-row['open_date'])
      5     else:

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'close_date'

In [13]:
movie_df.head()


Out[13]:
gross opening_gross opening_theaters ranking title total_theaters year close_month close_day open_month open_day open_year close_year
0 285761243 17081997 1202 1 Home Alone 2173 1990 0 0 11 16 1990 1991
1 217631306 12191540 1101 2 Ghost 1766 1990 0 0 7 13 1990 1991
2 184208848 598257 14 3 Dances with Wolves 1636 1990 0 0 11 9 1990 1991
3 178406268 11280591 1325 4 Pretty Woman 1811 1990 0 0 3 23 1990 1991
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990 0 0 3 30 1990 1991

Looks like the data is ready for us to use! Let's save this data so we're ready to use it next time.


In [17]:
! pip install pymongo


Requirement already satisfied (use --upgrade to upgrade): pymongo in /anaconda/lib/python2.7/site-packages

In [15]:
# Save the movie Dictionaries corresponding to each row of the BoxOfficeMojo table.
import json # (dong)
# import pymongo
# from bson import json_util

# Make a dictionary out of the dataset for storage in JSON format.
movieSaved = {feature: movie_df[feature].values.tolist() for feature in movie_df.columns.values}
fp = open("allMovies_new.json","w")
json.dump(movieSaved, fp)
fp.close()

Loading and preparing IMDB review dataset

We have cleaned up our IMDB review dataset in the ipython notebook New.ipynb. From that notebook we have been able to save dictionaries of our data, which we will now call.


In [16]:
with open("train_df_dict.json", "r") as fd:
    train_df_dict = json.load(fd)
with open("test_df_dict.json", "r") as fd:
    test_df_dict = json.load(fd)

In [17]:
train_df = pd.DataFrame(train_df_dict)
test_df = pd.DataFrame(test_df_dict)

The Stanford group distinguishes between train and test because this was relevant for their project. This may prove to be useful later, so we will keep two separate dataframes. However, for our purposes at the moment, we can combine them since the BOM data will serve as our true data set.


In [18]:
IMDB_df = train_df.append(test_df)

We want to figure out which movies from IMDB_df are also present in our BOM DF. So let's get all the movie titles in our BOM table.


In [19]:
BOM_movie_list = movie_df.title.values.tolist()

Now let's create a mask over IMDB_df, the boolean values of which indicate whether or not a movie is in the BOM list.


In [20]:
movie_mask = [(movie in BOM_movie_list) for movie in IMDB_df.movie_name]

In [21]:
sum(movie_mask)


Out[21]:
4111

We can now create out final, relevant IMDB data frame, with only those movies that also appear in the BOM tables from 1990 - 2015.


In [22]:
IMDB_dftouse=IMDB_df[movie_mask]

Finally we want to save our dictionary of IMDB_dftouse into a JSON file for storage.


In [23]:
IMDB_dftouse_dict = {feature: IMDB_dftouse[feature].values.tolist() for feature in IMDB_dftouse.columns.values}
fp = open("IMDB_dftouse_dict.json","w")
json.dump(IMDB_dftouse_dict, fp)
fp.close()

In [24]:
# Reopen
with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB_dftouse_dict = json.load(fd)

Analyzing and Saving Review Attributes Using labMT Happiness Dictionary

Now let's download labMT, a word score list for sentiment analysis containing over 10,000 words. The file contains a "happiness" value, and ranks words by their happiness. It also includes mean and standard deviation, Twitter rank and Google rank.


In [25]:
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)

In [26]:
labmt.head()


Out[26]:
happiness_rank happiness_average happiness_standard_deviation twitter_rank google_rank nyt_rank lyrics_rank
word
laughter 1 8.50 0.9313 3600 -- -- 1728
happiness 2 8.44 0.9723 1853 2458 -- 1230
love 3 8.42 1.1082 25 317 328 23
happy 4 8.30 0.9949 65 1372 1313 375
laughed 5 8.26 1.1572 3334 3542 -- 2332

Now let's create a happiness dictionary of (word, valence) pairs where each valence is that word's original valence minus the average valence.


In [27]:
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()

In [28]:
print "Score(happy): ", happiness['happy']
print "Score(miserable): ", happiness['miserable']
print "Best score: ", max(happiness.values())
print "Worst score: ", min(happiness.values())


Score(happy):  2.92476032088
Score(miserable):  -2.83523967912
Best score:  3.12476032088
Worst score:  -4.07523967912

In [29]:
# Save to disc
# fp = open("happiness.json","w")
# json.dump(happiness, fp)
# fp.close()

In [30]:
# Reopen
with open("happiness.json", "r") as fp:
    happiness = json.load(fp)

Now let's collect several attributes from a given review's text body, and save all valuable information into a new data frame. First we define a function that removes stop words (all non important words from a valence perspective) from a text body.


In [31]:
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')

def removeStopWords(text, stopwords = stopwords):
    new_text = ""
    for word in text.split():
        if word not in stopwords:
            while len(word) != 0 and word[-1] in punctuation:
                word = word[:len(word)-1]
            new_text += word + ' '
    return new_text

Now we'll write a function that returns total happiness, average happiness, total scorable words, and percentage of scorable words in a given review text.


In [32]:
'''
Name: getValenceInfo()
Inputs: review text, dictionary of happiness
Returns: a 4-tuple of (happiness total, happiness average, total # of scorable words, % of scorable words)
'''
def getValenceInfo(text, valenceDict):
    total_words = len(text.split())
    happiness_total, count_relevant = 0, 0
    for word in text.split():
        if word in valenceDict.keys():
            count_relevant += 1
            happiness_total += valenceDict[word]
    if count_relevant != 0: 
        avg_valence = 1.*happiness_total/count_relevant
    else: 
        avg_valence = 0
    return happiness_total, avg_valence, total_words, 1.*count_relevant / total_words

Now we'll write a function that, given a data frame, returns a new data frame with the concatenation of valence (happiness) info in 4 new columns: valence sum, valence average, # of scorable words, % of scorable words.


In [33]:
'''
Name: getAllInfo
Input: data frame, happiness dictionary, list of stop words
Returns: a new data frame with 4 new columns: valence_sum, valence_avg, n_scorables, pct_scorables
'''
def getAllInfo(df, valenceDict, stopwords): 
    valence_suml, valence_avgl, review_lenl, review_fractionl = [], [], [], []
    for i, row in df.iterrows():
        cleaned_review = removeStopWords(row['text'], stopwords)
        valence_sum, valence_avg, review_len, review_fraction = getValenceInfo(cleaned_review, valenceDict)
        valence_suml.append(valence_sum)
        valence_avgl.append(valence_avg)
        review_lenl.append(review_len)
        review_fractionl.append(review_fraction)
    conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl, 
                         'pct_scorables': review_fractionl})
    return pd.concat([df, conc], axis=1)

Now let's create a new dataframe valence_df with the valence statistics run on our IMDB_df. This code takes a few minutes to run.


In [34]:
%%time
valence_df = getAllInfo(IMDB_df, happiness, stopwords)


---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-34-5d56700d21b1> in <module>()
----> 1 get_ipython().run_cell_magic(u'time', u'', u'valence_df = getAllInfo(IMDB_df, happiness, stopwords)')

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell)
   2262             magic_arg_s = self.var_expand(line, stack_depth)
   2263             with self.builtin_trap:
-> 2264                 result = fn(magic_arg_s, cell)
   2265             return result
   2266 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1164         else:
   1165             st = clock2()
-> 1166             exec(code, glob, local_ns)
   1167             end = clock2()
   1168             out = None

<timed exec> in <module>()

<ipython-input-33-b22b5564e667> in getAllInfo(df, valenceDict, stopwords)
     15     conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl, 
     16                          'pct_scorables': review_fractionl})
---> 17     return pd.concat([df, conc], axis=1)

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    752                        keys=keys, levels=levels, names=names,
    753                        verify_integrity=verify_integrity,
--> 754                        copy=copy)
    755     return op.get_result()
    756 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    884         self.copy = copy
    885 
--> 886         self.new_axes = self._get_new_axes()
    887 
    888     def get_result(self):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self)
    944                 if i == self.axis:
    945                     continue
--> 946                 new_axes[i] = self._get_comb_axis(i)
    947         else:
    948             if len(self.join_axes) != ndim - 1:

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i)
    970                 raise TypeError("Cannot concatenate list of %s" % types)
    971 
--> 972         return _get_combined_index(all_indexes, intersect=self.intersect)
    973 
    974     def _get_concat_axis(self):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect)
   5730             index = index.intersection(other)
   5731         return index
-> 5732     union = _union_indexes(indexes)
   5733     return _ensure_index(union)
   5734 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes)
   5762         else:
   5763             for other in indexes[1:]:
-> 5764                 result = result.union(other)
   5765             return result
   5766     elif kind == 'array':

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other)
   1400                 result.extend([x for x in other.values if x not in value_set])
   1401         else:
-> 1402             indexer = self.get_indexer(other)
   1403             indexer, = (indexer == -1).nonzero()
   1404 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit)
   1685 
   1686         if not self.is_unique:
-> 1687             raise InvalidIndexError('Reindexing only valid with uniquely'
   1688                                     ' valued Index objects')
   1689 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

In [36]:
valence_df.head()


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-36-efa49c793d2a> in <module>()
----> 1 valence_df.head()

NameError: name 'valence_df' is not defined

In [ ]:
# Convert True/False to 1/0: needed to make valence_df JSON serializable, also better practice
valence_df.positive = 1.0*valence_df.positive

In [ ]:
# Save to disc
# fp = open("valence_df_dict.json","w")
# json.dump(valence_df.to_dict(), fp)
# fp.close()

In [37]:
# Reopen
with open("valence_df_dict.json", "r") as fp:
    valence_df_dict = json.load(fp)
valence_df = pd.DataFrame(valence_df_dict)

Milestone 2: Analysing and visualizing the data

Descriptive statistics

We're going to reimport our prepared datasets - dictionary.csv (contains the dictionary from the AFINN dataset), and allMovies_new.json (which contains all the reviews that cross over with our BOM data). We will import these as panda dataframes dictionary and IMDB_df respectively.


In [38]:
dictionary=pd.read_csv("dictionary.csv")

with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB = json.load(fd)
IMDB_df = pd.DataFrame(IMDB)

Now we'll import allMoives_new.json. Remember that we still need to convert some of the columns into the datetime type so we'll do that afterwards.


In [39]:
with open("allMovies_new.json", "r") as fd:
    movie_df = json.load(fd)
    movie_df = pd.DataFrame(movie_df)

In [40]:
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in movie_df.iterrows():
    if row.close_day != '0':
        close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
    else: 
        close_date.append(None)
movie_df['close_date'] = close_date

movie_df['open_date']=movie_df.open_year + '-' + movie_df.open_month + '-' + movie_df.open_day
movie_df['open_date']=movie_df['open_date'].apply(pd.datetools.parse)

# dropping unnecessary columns
movie_df = movie_df.drop('close_day', 1)
movie_df = movie_df.drop('close_month', 1)
movie_df = movie_df.drop('open_day', 1)
movie_df = movie_df.drop('open_month', 1)

Great! Let's take a look at all the dataframes we've created


In [41]:
movie_df


Out[41]:
close_year gross open_year opening_gross opening_theaters ranking title total_theaters year close_date open_date
0 1991 285761243 1990 17081997 1202 1 Home Alone 2173 1990 None 1990-11-16
1 1991 217631306 1990 12191540 1101 2 Ghost 1766 1990 None 1990-07-13
2 1991 184208848 1990 598257 14 3 Dances with Wolves 1636 1990 None 1990-11-09
3 1991 178406268 1990 11280591 1325 4 Pretty Woman 1811 1990 None 1990-03-23
4 1991 135265915 1990 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990 None 1990-03-30
5 1991 122012643 1990 17161835 1225 6 The Hunt for Red October 1817 1990 None 1990-03-02
6 1991 119394840 1990 25533700 2060 7 Total Recall 2131 1990 None 1990-06-01
7 1991 117540947 1990 21744661 2507 8 Die Hard 2: Die Harder 2507 1990 None 1990-07-06
8 1991 103738726 1990 22543911 2332 9 Dick Tracy 2332 1990 None 1990-06-15
9 1991 91457688 1990 7918560 1833 10 Kindergarten Cop 1937 1990 None 1990-12-22
10 1991 87727583 1990 19089645 2019 11 Back to the Future Part III 2070 1990 None 1990-05-25
11 1991 86303188 1990 11718981 1349 12 Presumed Innocent 1451 1990 None 1990-07-27
12 1991 82670733 1990 15490445 2307 13 Days of Thunder 2307 1990 None 1990-06-29
13 1991 80818974 1990 19475559 2721 14 Another 48 HRS. 2721 1990 None 1990-06-08
14 1991 71609321 1990 13774642 1281 15 Three Men and a Little Lady 1614 1990 None 1990-11-21
15 1991 70978012 1990 15338160 1944 16 Bird on a Wire 2008 1990 None 1990-05-18
16 1991 66666062 1990 19558558 1901 17 The Godfather Part III 1922 1990 None 1990-12-25
17 1991 61489265 1990 10034685 1319 18 Flatliners 1483 1990 None 1990-08-10
18 1991 61276872 1990 10076834 1244 19 Misery 1370 1990 None 1990-11-30
19 1991 56362352 1990 159622 2 20 Edward Scissorhands 1372 1990 None 1990-12-07
20 1991 53470891 1990 10026900 1714 21 Problem Child 1769 1990 None 1990-07-27
21 1991 53208180 1990 8045760 1479 22 Arachnophobia 2005 1990 None 1990-07-20
22 1991 52096475 1990 417076 12 23 Awakenings 1330 1990 None 1990-12-22
23 1991 47789074 1990 8100640 1576 24 Look Who's Talking Too 1647 1990 None 1990-12-14
24 1991 47410827 1990 9213631 1301 25 Hard to Kill 1508 1990 None 1990-02-09
25 1991 46836214 1990 6368901 1070 26 Goodfellas 1328 1990 None 1990-09-19
26 1991 46044396 1990 11790047 1968 27 Marked for Death 1974 1990 None 1990-10-05
27 1991 45681173 1990 14145411 1768 28 Robocop 2 1806 1990 None 1990-06-22
28 1991 44645619 1990 7708029 1901 29 The Jungle Book (re-issue) (1990) 1923 1990 None 1990-07-13
29 1991 44143410 1990 8017438 1770 30 Young Guns II 1770 1990 None 1990-08-03
... ... ... ... ... ... ... ... ... ... ... ...
2570 2016 29467855 2015 8246267 1960 71 A Walk in the Woods 2158 2015 None 2015-09-02
2571 2015 27740955 2015 10542116 2766 72 Sinister 2 2799 2015 2015-10-29 00:00:00 2015-08-21
2572 2015 27288872 2015 8111264 3355 73 No Escape 3415 2015 2015-11-26 00:00:00 2015-08-26
2573 2016 26997690 2015 10812861 3082 74 The Last Witch Hunter 3082 2015 None 2015-10-23
2574 2016 26822144 2015 6610961 1603 75 Ricki and the Flash 2064 2015 None 2015-08-07
2575 2015 26501323 2015 15027415 2602 76 The Woman in Black 2: Angel of Death 2602 2015 2015-03-19 00:00:00 2015-01-02
2576 2015 26461644 2015 11012305 3171 77 Run All Night 3171 2015 2015-05-07 00:00:00 2015-03-13
2577 2016 26439912 2015 9880536 2960 78 The Night Before 2960 2015 None 2015-11-20
2578 2015 25801047 2015 10203437 2666 79 The Lazarus Effect 2666 2015 2015-06-11 00:00:00 2015-02-27
2579 2015 25442958 2015 237264 4 80 Ex Machina 2004 2015 2015-09-03 00:00:00 2015-04-10
2580 2015 22764410 2015 9808463 2720 81 The Gallows 2720 2015 2015-09-17 00:00:00 2015-07-10
2581 2015 22467450 2015 8326530 3261 82 Hitman: Agent 47 3273 2015 2015-10-15 00:00:00 2015-08-21
2582 2015 22348241 2015 8310252 2893 83 Project Almanac 2900 2015 2015-03-26 00:00:00 2015-01-30
2583 2015 21571189 2015 6213362 1823 84 Black or White 1823 2015 2015-05-14 00:00:00 2015-01-30
2584 2015 21067116 2015 9670235 2815 85 Aloha 2815 2015 2015-07-30 00:00:00 2015-05-29
2585 2016 21028459 2015 8317545 2603 86 Love the Coopers 2603 2015 None 2015-11-13
2586 2015 19375982 2015 4038962 2320 87 Shaun the Sheep Movie 2360 2015 2015-10-22 00:00:00 2015-08-05
2587 2015 18754371 2015 197000 12 88 Still Alice 1318 2015 2015-05-21 00:00:00 2015-01-16
2588 2016 18300124 2015 8070493 1656 89 Paranormal Activity: The Ghost Dimension 1656 2015 None 2015-10-23
2589 2015 17737646 2015 2434908 361 90 Mr. Holmes 898 2015 2015-11-05 00:00:00 2015-07-17
2590 2016 17717593 2015 521522 4 91 Steve Jobs 2493 2015 None 2015-10-09
2591 2015 17506470 2015 6100010 2002 92 Dope 2002 2015 2015-09-17 00:00:00 2015-06-19
2592 2015 17223265 2015 7217640 2875 93 Seventh Son 2875 2015 2015-03-19 00:00:00 2015-02-06
2593 2015 16432322 2015 4577861 2012 94 Monkey Kingdom 2012 2015 2015-07-23 00:00:00 2015-04-17
2594 2015 16029670 2015 7355622 3434 95 The Transporter Refueled 3434 2015 2015-11-19 00:00:00 2015-09-04
2595 2016 15012671 2015 6652996 2392 96 The Secret in their Eyes (2015) 2392 2015 None 2015-11-20
2596 2015 14674076 2015 160089 4 97 It Follows 1655 2015 2015-06-25 00:00:00 2015-03-13
2597 2015 14440985 2015 5454284 2778 98 American Ultra 2778 2015 2015-10-08 00:00:00 2015-08-21
2598 2016 14218236 2015 4002226 1553 99 Woodlawn 1553 2015 None 2015-10-16
2599 2016 13550619 2015 5002521 3003 100 Burnt 3003 2015 None 2015-10-30

2600 rows × 11 columns


In [42]:
IMDB_df.head()


Out[42]:
movie_id movie_name positive stars text url
0 10027 Titanic True 7 Sure, Titanic was a good movie, the first time... http://www.imdb.com/title/tt0120338/usercommen...
1 10028 Titanic True 10 When I saw this movie I was stunned by what a ... http://www.imdb.com/title/tt0120338/usercommen...
2 10029 Titanic True 10 Why do people bitch about this movie and not a... http://www.imdb.com/title/tt0120338/usercommen...
3 10030 Titanic True 10 What's inexplicable? Firstly, the hatred towar... http://www.imdb.com/title/tt0120338/usercommen...
4 10031 Titanic True 10 Previously, I wrote that I loved "Titanic", cr... http://www.imdb.com/title/tt0120338/usercommen...

Now, we want to make a new dataframe flattened_df that we can use to run our regressions on. This dataframe will include all the columns in movie_df and the extra columns

  • number of reviews for the movie in IMDB_df
  • average stars from the reviews
  • overall ranking

In [43]:
# set index to title
indexed_df = movie_df.set_index("title")

In [44]:
# use groupby to get the review_count and the star_avg
gold = IMDB_df.groupby("movie_name")
review_count = gold.movie_name.count()
star_avg = gold.stars.mean()
positive = gold.positive.mean()

In [45]:
# concatenate the two series into our final dataframe flattened_df
flattened_df = pd.concat([indexed_df, review_count], axis=1, join_axes=[indexed_df.index])
flattened_df.rename(columns={'movie_name': 'review_count'}, inplace=True)
flattened_df = pd.concat([flattened_df, star_avg], axis=1, join_axes=[indexed_df.index])
flattened_df.rename(columns={'stars': 'star_avg'}, inplace=True)
flattened_df = pd.concat([flattened_df, positive], axis=1, join_axes=[indexed_df.index])

Let's take a look at what our dataframe looks like now!


In [46]:
flattened_df.head()


Out[46]:
close_year gross open_year opening_gross opening_theaters ranking total_theaters year close_date open_date review_count star_avg positive
title
Home Alone 1991 285761243 1990 17081997 1202 1 2173 1990 None 1990-11-16 NaN NaN NaN
Ghost 1991 217631306 1990 12191540 1101 2 1766 1990 None 1990-07-13 NaN NaN NaN
Dances with Wolves 1991 184208848 1990 598257 14 3 1636 1990 None 1990-11-09 NaN NaN NaN
Pretty Woman 1991 178406268 1990 11280591 1325 4 1811 1990 None 1990-03-23 2 1 0
Teenage Mutant Ninja Turtles 1991 135265915 1990 25398367 2006 5 2377 1990 None 1990-03-30 NaN NaN NaN

In [47]:
dftouse = flattened_df[~flattened_df['review_count'].map(np.isnan)]
dftouse.shape


Out[47]:
(334, 13)

In [48]:
dftouse.head()


Out[48]:
close_year gross open_year opening_gross opening_theaters ranking total_theaters year close_date open_date review_count star_avg positive
title
Pretty Woman 1991 178406268 1990 11280591 1325 4 1811 1990 None 1990-03-23 2 1.000000 0.0
Dick Tracy 1991 103738726 1990 22543911 2332 9 2332 1990 None 1990-06-15 24 8.875000 1.0
Flatliners 1991 61489265 1990 10034685 1319 18 1483 1990 None 1990-08-10 30 6.966667 0.8
Problem Child 1991 53470891 1990 10026900 1714 21 1769 1990 None 1990-07-27 8 2.500000 0.0
Marked for Death 1991 46044396 1990 11790047 1968 27 1974 1990 None 1990-10-05 3 4.000000 0.0

In [49]:
plt.scatter(y=dftouse.opening_gross, x=dftouse.star_avg)


Out[49]:
<matplotlib.collections.PathCollection at 0x120a7c710>
/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):

In [50]:
plt.scatter(y=dftouse_four.opening_gross, x=dftouse_four.star_avg,s=dftouse_four.review_count)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-50-113626d5200e> in <module>()
----> 1 plt.scatter(y=dftouse_four.opening_gross, x=dftouse_four.star_avg,s=dftouse_four.review_count)

NameError: name 'dftouse_four' is not defined

In [160]:
plt.scatter(y=dftouse_seven.opening_gross, x=dftouse_seven.star_avg,s=dftouse_seven.review_count)


Out[160]:
<matplotlib.collections.PathCollection at 0x111fa4990>

In [51]:
dftouse_four = flattened_df[flattened_df['star_avg'] <= 4]
dftouse_seven = flattened_df[flattened_df['star_avg'] >= 7]

In [52]:
dftouse_four.head()


Out[52]:
close_year gross open_year opening_gross opening_theaters ranking total_theaters year close_date open_date review_count star_avg positive
title
Pretty Woman 1991 178406268 1990 11280591 1325 4 1811 1990 None 1990-03-23 2 1.0 0
Problem Child 1991 53470891 1990 10026900 1714 21 1769 1990 None 1990-07-27 8 2.5 0
Marked for Death 1991 46044396 1990 11790047 1968 27 1974 1990 None 1990-10-05 3 4.0 0
Air America 1991 31053601 1990 8064480 1902 37 1902 1990 None 1990-08-10 5 2.8 0
Internal Affairs 1991 27734391 1990 5043516 975 43 986 1990 None 1990-01-12 2 3.5 0

In [53]:
bobby_ols = ols('opening_gross ~ star_avg',dftouse).fit()
bobby_ols.summary()


Out[53]:
OLS Regression Results
Dep. Variable: opening_gross R-squared: 0.004
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.465
Date: Thu, 03 Dec 2015 Prob (F-statistic): 0.227
Time: 19:13:11 Log-Likelihood: -6007.5
No. Observations: 334 AIC: 1.202e+04
Df Residuals: 332 BIC: 1.203e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.769e+07 1.58e+06 11.170 0.000 1.46e+07 2.08e+07
star_avg -3.764e+05 3.11e+05 -1.210 0.227 -9.88e+05 2.35e+05
Omnibus: 203.186 Durbin-Watson: 1.180
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1582.013
Skew: 2.499 Prob(JB): 0.00
Kurtosis: 12.418 Cond. No. 9.64

In [162]:
bobby_ols = ols('opening_gross ~ star_avg',dftouse_four).fit()
bobby_ols.summary()


Out[162]:
OLS Regression Results
Dep. Variable: opening_gross R-squared: 0.005
Model: OLS Adj. R-squared: 0.000
Method: Least Squares F-statistic: 1.002
Date: Mon, 30 Nov 2015 Prob (F-statistic): 0.318
Time: 22:37:19 Log-Likelihood: -3954.6
No. Observations: 220 AIC: 7913.
Df Residuals: 218 BIC: 7920.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 1.973e+07 3.31e+06 5.959 0.000 1.32e+07 2.63e+07
star_avg -1.297e+06 1.3e+06 -1.001 0.318 -3.85e+06 1.26e+06
Omnibus: 126.033 Durbin-Watson: 1.069
Prob(Omnibus): 0.000 Jarque-Bera (JB): 716.861
Skew: 2.270 Prob(JB): 2.17e-156
Kurtosis: 10.589 Cond. No. 9.19

In [159]:
bobby_ols = ols('opening_gross ~ star_avg',dftouse_seven).fit()
bobby_ols.summary()


Out[159]:
OLS Regression Results
Dep. Variable: opening_gross R-squared: 0.009
Model: OLS Adj. R-squared: -0.003
Method: Least Squares F-statistic: 0.7432
Date: Mon, 30 Nov 2015 Prob (F-statistic): 0.391
Time: 22:34:49 Log-Likelihood: -1553.5
No. Observations: 86 AIC: 3111.
Df Residuals: 84 BIC: 3116.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 3.719e+07 2.64e+07 1.408 0.163 -1.53e+07 8.97e+07
star_avg -2.671e+06 3.1e+06 -0.862 0.391 -8.83e+06 3.49e+06
Omnibus: 81.009 Durbin-Watson: 1.855
Prob(Omnibus): 0.000 Jarque-Bera (JB): 666.603
Skew: 2.982 Prob(JB): 1.77e-145
Kurtosis: 15.266 Cond. No. 123.

ASK ANDREW WHETHER WE NEED TEST/TRAIN FOR OUR REGRESSION

WORK FLOW 0 GOAL: DESCRIPTIVE STATISTICS

  • How many unique movies we have in our IMDB_df
  • how many reviews they have each
  • which years do they represent

WORK FLOW 1 GOAL: MAKE A VALENCE SCORE THAT CLOSELY MATCHES STARS

  1. Create a valence score for each review
    • summation
    • average
    • product 1a. Check how good our valence score is by testing on the stars

STEPHEN's EDITS number 2

WORK FLOW 2 GOAL: FIND A REGRESSION MODEL THAT BEST PREDICTS THE FOLLOWING Y VARIABLES

  1. Regress on:
    • opening_gross (controlled for opening weekend)
    • opening_gross/opening_theater
    • total_gross
    • total_gross/total_theateer

3.

Just realised that I was meant to take notes.

  • try making branches on github: each person should work on their own branch and not push to the origin until you are sure what you're doing is correct
  • share andrew on the github

Re: pretty woman question

  • useful to consider polarizing reviews actually predict the box office performance

Re: regression

  • if at the end of the day we choose to do simply box office gross vs. sentiment this is just a correlation, then it is more on the simper side of things
  • the difficulty should be more commensurate with the pset difficulty
  • should take it to the next level, a few possibilities
  • measures of dispersion on the sentiment analysis

  • different types of distributions you have in a given review
  • how frequently are certain words used
  • average length of words
  • average document lengths
  • ratio between high valence words and low valence words
  • we can do different sentiment scores based on different sentiment dictionaries
  • could use a BOW model (have the corpus of words in the reviews) & shrink down for regularization

We can adopt the hypothesis that stars predict box office gross Or hypothesis that evry polarized responses predict box office gross It's okay to have a dud hypothesis

Graded on the effort we put in - low unimpressive results should not be equated with not having much work to do. We can make up for it via lit reviews or extra research.

BY NEXT MONDAY: ANALYSIS SHOULD BE DONE, AND AT THAT POINT JUST TOUCHING UP THE VISUALIZATIONS

  1. Do not underestimate the time that it takes to make a good screencast, website, and effective visualizations
  • ANDREW IS HUMAN
  • If we put in alot of work to our visualizations, it will go a long way :)
  • just as much thought put into the end result as with the analysis

Analysis


In [55]:
#getting inflation data and creating a new dataframe 
inflation = pd.read_csv("inf.csv")
print inflation


      DATE     VALUE
0   1/1/90  5.397956
1   1/1/91  4.234964
2   1/1/92  3.028820
3   1/1/93  2.951657
4   1/1/94  2.607442
5   1/1/95  2.805420
6   1/1/96  2.931204
7   1/1/97  2.337690
8   1/1/98  1.552279
9   1/1/99  2.188027
10  1/1/00  3.376857
11  1/1/01  2.826171
12  1/1/02  1.586032
13  1/1/03  2.270095
14  1/1/04  2.677237
15  1/1/05  3.392747
16  1/1/06  3.225944
17  1/1/07  2.852672
18  1/1/08  3.839100
19  1/1/09 -0.355546
20  1/1/10  1.640043
21  1/1/11  3.156842
22  1/1/12  2.069337
23  1/1/13  1.464833
24  1/1/14  1.622223

In [90]:
#Creating a dataframe of Cumulative Inflation
years_90 = range(1990,2015)
infdict = {}

infindex = 0
infvalue = 1

testlist = []
for row in inflation.values:
    currentval = 1 + (row[1]/100)
    cuminf = infvalue*currentval
    infdict[years_90[infindex]] = cuminf
    infindex += 1
    infvalue = cuminf
    testlist.append(cuminf)

inframe = pd.DataFrame(data=testlist, index=range(1990,2015))
#infdict exists in case we need it later

In [91]:
inframe


Out[91]:
0
1990 1.053980
1991 1.098615
1992 1.131890
1993 1.165300
1994 1.195684
1995 1.229228
1996 1.265259
1997 1.294837
1998 1.314937
1999 1.343708
2000 1.389083
2001 1.428341
2002 1.450995
2003 1.483934
2004 1.523662
2005 1.575356
2006 1.626176
2007 1.672566
2008 1.736777
2009 1.730602
2010 1.758985
2011 1.814513
2012 1.852062
2013 1.879191
2014 1.909676

In [ ]:
adjframe =

Visualization

Milestone 3: Video and finishing touches

Screencast video

Website

Finishing touches


In [ ]: