CS109 Final Project Process Book

Background & Motivation

Social media and entertainment are such pervasive parts of millennials' lives. We want to study the intersection of these two. Is it possible to predict box office success of film through sentiments expressed on social media? Stay tuned for more!

CS109 Final Project Process Book



In [1]:

    
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
import json
import statsmodels.api as sm
from statsmodels.formula.api import glm, ols

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import math

Milestone 1: Scrape and prepare data before thanksgiving

For our project we will be using data from 3 different sources

Box Office Mojo (BOM)

Large Movie Review Dataset

AFINN-111 Dictionary

In this first milestone, we will get all the data into a format that we can start running analysis on!

Scraping and cleaning Box Office Mojo

First we import the requests and BeautifulSoup libraries to make working with HTTP requests easier, and then easily transfer HTML content to Python data structures.



In [2]:

    
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Secondly, we prepare the data frame movie_df to store the data that we will scrape from BOM. We give this dataframe 9 columns:

* ranking: the ranking of the movie in its release year by gross
* title: movie title
* gross: how much the movie grossed while in theatres
* Total_theaters: the total number of theaters that showed this movie
* opening_gross: how much the movie grossed in the opening weekend (Fri-Sun)
* opening_theaters: the total number of theaters that showed this movie in the opening weekend (Fri-Sun)
* open_date: date of opening
* close_date: date of closing
* year: year of release



In [3]:

    
movie_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])

Now we write a function rowInfoGrabber that we will call in a loop over the table on the BOM webpage to grab the attributes and save them into the corresponding columns in movie_df.



In [4]:

    
def rowInfoGrabber(r):
    info = []
    # Ranking
    info.append(int(r.find("font").get_text()))
    # Title
    info.append(r.find("a").get_text())
    # Gross
    info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
    '''
    For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
    In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
    if right above it in the movie table and differs in gross income by about $1 million, which is a small 
    difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
    '''
    # Total number of theaters
    if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
    # Opening Gross
    if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
        info.append(10015822)
    else: 
        info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
    # Opening Number of Theaters
    if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
    # Date of Opening
    info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
    # Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
    if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
        info.append('-')
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
    return info

This is the image:



In [5]:

    
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]

Finally we're ready to scrape!
Because IMDB was created in 1990, we will scrape that far back in BOM. So we're scraping the past 26 years (1990 - 2015). Also note that because the HTML was changed starting in 2001, our scraping will be a little different before and after then.



In [6]:

    
%%time
years = [1990 + i for i in range(26)]
for year in years:
    pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
    soup = BeautifulSoup(pageText.text, "html.parser")
    movieTable = soup.find("td", attrs={"colspan":"3"})
    movieRows = movieTable.find("table").find_all("tr")[2:102]
    print year
    movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
    year_df = pd.DataFrame(movie_dicts)
    year_df['year'] = year
    movie_df = movie_df.append(year_df, ignore_index=True)
    time.sleep(1)









    



1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
CPU times: user 13.1 s, sys: 240 ms, total: 13.4 s
Wall time: 44.2 s



In [7]:

    
movie_df.shape









    Out[7]:





(2600, 9)



In [8]:

    
movie_df.head()









    Out[8]:






  
    
      
      close_date
      gross
      open_date
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
    
  
  
    
      0
      -
      285761243
      11/16
      17081997
      1202
      1
      Home Alone
      2173
      1990
    
    
      1
      -
      217631306
      7/13
      12191540
      1101
      2
      Ghost
      1766
      1990
    
    
      2
      -
      184208848
      11/9
      598257
      14
      3
      Dances with Wolves
      1636
      1990
    
    
      3
      -
      178406268
      3/23
      11280591
      1325
      4
      Pretty Woman
      1811
      1990
    
    
      4
      -
      135265915
      3/30
      25398367
      2006
      5
      Teenage Mutant Ninja Turtles
      2377
      1990

Because some films do not have a close date, we will have to be careful with the close date!



In [9]:

    
# if we decide it's worth just dumping the movies with no close_date, we can use the code below
# movie_df=movie_df[movie_df.close_date != '-'].reset_index(drop=True)

Next, we combine the close_date, open_date and year columns into two columns close_date and open_date that are time series. This will make it easier for us to work with the data in the future.



In [9]:

    
# splitting the close_date and open_date into the respective month and day
movie_df['close_month'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[:x.find('/')])
movie_df['close_day'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
movie_df['open_month'] = movie_df['open_date'].map(lambda x: x[:x.find('/')])
movie_df['open_day'] = movie_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])

# dropping the old close_date and open_date
movie_df = movie_df.drop('close_date', 1)
movie_df = movie_df.drop('open_date', 1)

# creating an open_year by turning the year column into a string and getting rid of trailing bits
movie_df['open_year'] = movie_df.year.astype(str)
movie_df['open_year'] = movie_df.open_year.map(lambda x: x[:x.find('.')])

# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = movie_df['close_month'].astype(int)
open_month = movie_df['open_month'].astype(int)
year = movie_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
    if close_month[i] >= open_month[i]:
        close_year.append(year[i])
    else:
        close_year.append(year[i]+1) 
movie_df['close_year'] = close_year
movie_df['close_year'] = movie_df['close_year'].astype(str)



In [10]:

    
movie_df.head()









    Out[10]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
      close_month
      close_day
      open_month
      open_day
      open_year
      close_year
    
  
  
    
      0
      285761243
      17081997
      1202
      1
      Home Alone
      2173
      1990
      0
      0
      11
      16
      1990
      1991
    
    
      1
      217631306
      12191540
      1101
      2
      Ghost
      1766
      1990
      0
      0
      7
      13
      1990
      1991
    
    
      2
      184208848
      598257
      14
      3
      Dances with Wolves
      1636
      1990
      0
      0
      11
      9
      1990
      1991
    
    
      3
      178406268
      11280591
      1325
      4
      Pretty Woman
      1811
      1990
      0
      0
      3
      23
      1990
      1991
    
    
      4
      135265915
      25398367
      2006
      5
      Teenage Mutant Ninja Turtles
      2377
      1990
      0
      0
      3
      30
      1990
      1991

Let's take a look at the data, now!



In [11]:

    
movie_df.head()









    Out[11]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
      close_month
      close_day
      open_month
      open_day
      open_year
      close_year
    
  
  
    
      0
      285761243
      17081997
      1202
      1
      Home Alone
      2173
      1990
      0
      0
      11
      16
      1990
      1991
    
    
      1
      217631306
      12191540
      1101
      2
      Ghost
      1766
      1990
      0
      0
      7
      13
      1990
      1991
    
    
      2
      184208848
      598257
      14
      3
      Dances with Wolves
      1636
      1990
      0
      0
      11
      9
      1990
      1991
    
    
      3
      178406268
      11280591
      1325
      4
      Pretty Woman
      1811
      1990
      0
      0
      3
      23
      1990
      1991
    
    
      4
      135265915
      25398367
      2006
      5
      Teenage Mutant Ninja Turtles
      2377
      1990
      0
      0
      3
      30
      1990
      1991

Let's take a look at if we can get the run times for each movie!



In [14]:

    
run_time=[]
for index, row in movie_df.iterrows():
    if row.close_date != None:
        run_time.append(row['close_date']-row['open_date'])
    else: 
        run_time.append('N/A')









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-90b495e926a8> in <module>()
      1 run_time=[]
      2 for index, row in movie_df.iterrows():
----> 3     if row.close_date != None:
      4         run_time.append(row['close_date']-row['open_date'])
      5     else:

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in __getattr__(self, name)
   2148                 return self[name]
   2149             raise AttributeError("'%s' object has no attribute '%s'" %
-> 2150                                  (type(self).__name__, name))
   2151 
   2152     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'close_date'



In [13]:

    
movie_df.head()









    Out[13]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
      close_month
      close_day
      open_month
      open_day
      open_year
      close_year
    
  
  
    
      0
      285761243
      17081997
      1202
      1
      Home Alone
      2173
      1990
      0
      0
      11
      16
      1990
      1991
    
    
      1
      217631306
      12191540
      1101
      2
      Ghost
      1766
      1990
      0
      0
      7
      13
      1990
      1991
    
    
      2
      184208848
      598257
      14
      3
      Dances with Wolves
      1636
      1990
      0
      0
      11
      9
      1990
      1991
    
    
      3
      178406268
      11280591
      1325
      4
      Pretty Woman
      1811
      1990
      0
      0
      3
      23
      1990
      1991
    
    
      4
      135265915
      25398367
      2006
      5
      Teenage Mutant Ninja Turtles
      2377
      1990
      0
      0
      3
      30
      1990
      1991

Looks like the data is ready for us to use! Let's save this data so we're ready to use it next time.



In [17]:

    
! pip install pymongo









    



Requirement already satisfied (use --upgrade to upgrade): pymongo in /anaconda/lib/python2.7/site-packages



In [15]:

    
# Save the movie Dictionaries corresponding to each row of the BoxOfficeMojo table.
import json # (dong)
# import pymongo
# from bson import json_util

# Make a dictionary out of the dataset for storage in JSON format.
movieSaved = {feature: movie_df[feature].values.tolist() for feature in movie_df.columns.values}
fp = open("allMovies_new.json","w")
json.dump(movieSaved, fp)
fp.close()

Loading and preparing IMDB review dataset

We have cleaned up our IMDB review dataset in the ipython notebook New.ipynb. From that notebook we have been able to save dictionaries of our data, which we will now call.



In [16]:

    
with open("train_df_dict.json", "r") as fd:
    train_df_dict = json.load(fd)
with open("test_df_dict.json", "r") as fd:
    test_df_dict = json.load(fd)



In [17]:

    
train_df = pd.DataFrame(train_df_dict)
test_df = pd.DataFrame(test_df_dict)

The Stanford group distinguishes between train and test because this was relevant for their project. This may prove to be useful later, so we will keep two separate dataframes. However, for our purposes at the moment, we can combine them since the BOM data will serve as our true data set.



In [18]:

    
IMDB_df = train_df.append(test_df)

We want to figure out which movies from IMDB_df are also present in our BOM DF. So let's get all the movie titles in our BOM table.



In [19]:

    
BOM_movie_list = movie_df.title.values.tolist()

Now let's create a mask over IMDB_df, the boolean values of which indicate whether or not a movie is in the BOM list.



In [20]:

    
movie_mask = [(movie in BOM_movie_list) for movie in IMDB_df.movie_name]



In [21]:

    
sum(movie_mask)









    Out[21]:





4111

We can now create out final, relevant IMDB data frame, with only those movies that also appear in the BOM tables from 1990 - 2015.



In [22]:

    
IMDB_dftouse=IMDB_df[movie_mask]

Finally we want to save our dictionary of IMDB_dftouse into a JSON file for storage.



In [23]:

    
IMDB_dftouse_dict = {feature: IMDB_dftouse[feature].values.tolist() for feature in IMDB_dftouse.columns.values}
fp = open("IMDB_dftouse_dict.json","w")
json.dump(IMDB_dftouse_dict, fp)
fp.close()



In [24]:

    
# Reopen
with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB_dftouse_dict = json.load(fd)

Analyzing and Saving Review Attributes Using labMT Happiness Dictionary

Now let's download labMT, a word score list for sentiment analysis containing over 10,000 words. The file contains a "happiness" value, and ranks words by their happiness. It also includes mean and standard deviation, Twitter rank and Google rank.



In [25]:

    
url = 'http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0026752.s001'
labmt = pd.read_csv(url, skiprows=2, sep='\t', index_col=0)



In [26]:

    
labmt.head()









    Out[26]:






  
    
      
      happiness_rank
      happiness_average
      happiness_standard_deviation
      twitter_rank
      google_rank
      nyt_rank
      lyrics_rank
    
    
      word
      
      
      
      
      
      
      
    
  
  
    
      laughter
      1
      8.50
      0.9313
      3600
      --
      --
      1728
    
    
      happiness
      2
      8.44
      0.9723
      1853
      2458
      --
      1230
    
    
      love
      3
      8.42
      1.1082
      25
      317
      328
      23
    
    
      happy
      4
      8.30
      0.9949
      65
      1372
      1313
      375
    
    
      laughed
      5
      8.26
      1.1572
      3334
      3542
      --
      2332

Now let's create a happiness dictionary of (word, valence) pairs where each valence is that word's original valence minus the average valence.



In [27]:

    
average = labmt.happiness_average.mean()
happiness = (labmt.happiness_average - average).to_dict()



In [28]:

    
print "Score(happy): ", happiness['happy']
print "Score(miserable): ", happiness['miserable']
print "Best score: ", max(happiness.values())
print "Worst score: ", min(happiness.values())









    



Score(happy):  2.92476032088
Score(miserable):  -2.83523967912
Best score:  3.12476032088
Worst score:  -4.07523967912



In [29]:

    
# Save to disc
# fp = open("happiness.json","w")
# json.dump(happiness, fp)
# fp.close()



In [30]:

    
# Reopen
with open("happiness.json", "r") as fp:
    happiness = json.load(fp)

Now let's collect several attributes from a given review's text body, and save all valuable information into a new data frame. First we define a function that removes stop words (all non important words from a valence perspective) from a text body.



In [31]:

    
from sklearn.feature_extraction import text
stopwords = text.ENGLISH_STOP_WORDS
punctuation = list('.,;:!?()[]{}`''\"@#$%^&*+-|-=~_')

def removeStopWords(text, stopwords = stopwords):
    new_text = ""
    for word in text.split():
        if word not in stopwords:
            while len(word) != 0 and word[-1] in punctuation:
                word = word[:len(word)-1]
            new_text += word + ' '
    return new_text

Now we'll write a function that returns total happiness, average happiness, total scorable words, and percentage of scorable words in a given review text.



In [32]:

    
'''
Name: getValenceInfo()
Inputs: review text, dictionary of happiness
Returns: a 4-tuple of (happiness total, happiness average, total # of scorable words, % of scorable words)
'''
def getValenceInfo(text, valenceDict):
    total_words = len(text.split())
    happiness_total, count_relevant = 0, 0
    for word in text.split():
        if word in valenceDict.keys():
            count_relevant += 1
            happiness_total += valenceDict[word]
    if count_relevant != 0: 
        avg_valence = 1.*happiness_total/count_relevant
    else: 
        avg_valence = 0
    return happiness_total, avg_valence, total_words, 1.*count_relevant / total_words

Now we'll write a function that, given a data frame, returns a new data frame with the concatenation of valence (happiness) info in 4 new columns: valence sum, valence average, # of scorable words, % of scorable words.



In [33]:

    
'''
Name: getAllInfo
Input: data frame, happiness dictionary, list of stop words
Returns: a new data frame with 4 new columns: valence_sum, valence_avg, n_scorables, pct_scorables
'''
def getAllInfo(df, valenceDict, stopwords): 
    valence_suml, valence_avgl, review_lenl, review_fractionl = [], [], [], []
    for i, row in df.iterrows():
        cleaned_review = removeStopWords(row['text'], stopwords)
        valence_sum, valence_avg, review_len, review_fraction = getValenceInfo(cleaned_review, valenceDict)
        valence_suml.append(valence_sum)
        valence_avgl.append(valence_avg)
        review_lenl.append(review_len)
        review_fractionl.append(review_fraction)
    conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl, 
                         'pct_scorables': review_fractionl})
    return pd.concat([df, conc], axis=1)

Now let's create a new dataframe valence_df with the valence statistics run on our IMDB_df. This code takes a few minutes to run.



In [34]:

    
%%time
valence_df = getAllInfo(IMDB_df, happiness, stopwords)









    



---------------------------------------------------------------------------
InvalidIndexError                         Traceback (most recent call last)
<ipython-input-34-5d56700d21b1> in <module>()
----> 1 get_ipython().run_cell_magic(u'time', u'', u'valence_df = getAllInfo(IMDB_df, happiness, stopwords)')

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_cell_magic(self, magic_name, line, cell)
   2262             magic_arg_s = self.var_expand(line, stack_depth)
   2263             with self.builtin_trap:
-> 2264                 result = fn(magic_arg_s, cell)
   2265             return result
   2266 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magic.pyc in <lambda>(f, *a, **k)
    191     # but it's overkill for just that one bit of state.
    192     def magic_deco(arg):
--> 193         call = lambda f, *a, **k: f(*a, **k)
    194 
    195         if callable(arg):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
   1164         else:
   1165             st = clock2()
-> 1166             exec(code, glob, local_ns)
   1167             end = clock2()
   1168             out = None

<timed exec> in <module>()

<ipython-input-33-b22b5564e667> in getAllInfo(df, valenceDict, stopwords)
     15     conc = pd.DataFrame({'valence_sum': valence_suml, 'valence_avg':valence_avgl ,'n_scorables': review_lenl, 
     16                          'pct_scorables': review_fractionl})
---> 17     return pd.concat([df, conc], axis=1)

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    752                        keys=keys, levels=levels, names=names,
    753                        verify_integrity=verify_integrity,
--> 754                        copy=copy)
    755     return op.get_result()
    756 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy)
    884         self.copy = copy
    885 
--> 886         self.new_axes = self._get_new_axes()
    887 
    888     def get_result(self):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_new_axes(self)
    944                 if i == self.axis:
    945                     continue
--> 946                 new_axes[i] = self._get_comb_axis(i)
    947         else:
    948             if len(self.join_axes) != ndim - 1:

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/tools/merge.pyc in _get_comb_axis(self, i)
    970                 raise TypeError("Cannot concatenate list of %s" % types)
    971 
--> 972         return _get_combined_index(all_indexes, intersect=self.intersect)
    973 
    974     def _get_concat_axis(self):

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _get_combined_index(indexes, intersect)
   5730             index = index.intersection(other)
   5731         return index
-> 5732     union = _union_indexes(indexes)
   5733     return _ensure_index(union)
   5734 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in _union_indexes(indexes)
   5762         else:
   5763             for other in indexes[1:]:
-> 5764                 result = result.union(other)
   5765             return result
   5766     elif kind == 'array':

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in union(self, other)
   1400                 result.extend([x for x in other.values if x not in value_set])
   1401         else:
-> 1402             indexer = self.get_indexer(other)
   1403             indexer, = (indexer == -1).nonzero()
   1404 

/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit)
   1685 
   1686         if not self.is_unique:
-> 1687             raise InvalidIndexError('Reindexing only valid with uniquely'
   1688                                     ' valued Index objects')
   1689 

InvalidIndexError: Reindexing only valid with uniquely valued Index objects



In [36]:

    
valence_df.head()









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-36-efa49c793d2a> in <module>()
----> 1 valence_df.head()

NameError: name 'valence_df' is not defined



In [ ]:

    
# Convert True/False to 1/0: needed to make valence_df JSON serializable, also better practice
valence_df.positive = 1.0*valence_df.positive



In [ ]:

    
# Save to disc
# fp = open("valence_df_dict.json","w")
# json.dump(valence_df.to_dict(), fp)
# fp.close()



In [37]:

    
# Reopen
with open("valence_df_dict.json", "r") as fp:
    valence_df_dict = json.load(fp)
valence_df = pd.DataFrame(valence_df_dict)

Milestone 2: Analysing and visualizing the data

Descriptive statistics

We're going to reimport our prepared datasets - dictionary.csv (contains the dictionary from the AFINN dataset), and allMovies_new.json (which contains all the reviews that cross over with our BOM data). We will import these as panda dataframes dictionary and IMDB_df respectively.



In [38]:

    
dictionary=pd.read_csv("dictionary.csv")

with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB = json.load(fd)
IMDB_df = pd.DataFrame(IMDB)

Now we'll import allMoives_new.json. Remember that we still need to convert some of the columns into the datetime type so we'll do that afterwards.



In [39]:

    
with open("allMovies_new.json", "r") as fd:
    movie_df = json.load(fd)
    movie_df = pd.DataFrame(movie_df)



In [40]:

    
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in movie_df.iterrows():
    if row.close_day != '0':
        close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
    else: 
        close_date.append(None)
movie_df['close_date'] = close_date

movie_df['open_date']=movie_df.open_year + '-' + movie_df.open_month + '-' + movie_df.open_day
movie_df['open_date']=movie_df['open_date'].apply(pd.datetools.parse)

# dropping unnecessary columns
movie_df = movie_df.drop('close_day', 1)
movie_df = movie_df.drop('close_month', 1)
movie_df = movie_df.drop('open_day', 1)
movie_df = movie_df.drop('open_month', 1)

Great! Let's take a look at all the dataframes we've created



In [41]:

    
movie_df









    Out[41]:






  
    
      
      close_year
      gross
      open_year
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
      close_date
      open_date
    
  
  
    
      0
      1991
      285761243
      1990
      17081997
      1202
      1
      Home Alone
      2173
      1990
      None
      1990-11-16
    
    
      1
      1991
      217631306
      1990
      12191540
      1101
      2
      Ghost
      1766
      1990
      None
      1990-07-13
    
    
      2
      1991
      184208848
      1990
      598257
      14
      3
      Dances with Wolves
      1636
      1990
      None
      1990-11-09
    
    
      3
      1991
      178406268
      1990
      11280591
      1325
      4
      Pretty Woman
      1811
      1990
      None
      1990-03-23
    
    
      4
      1991
      135265915
      1990
      25398367
      2006
      5
      Teenage Mutant Ninja Turtles
      2377
      1990
      None
      1990-03-30
    
    
      5
      1991
      122012643
      1990
      17161835
      1225
      6
      The Hunt for Red October
      1817
      1990
      None
      1990-03-02
    
    
      6
      1991
      119394840
      1990
      25533700
      2060
      7
      Total Recall
      2131
      1990
      None
      1990-06-01
    
    
      7
      1991
      117540947
      1990
      21744661
      2507
      8
      Die Hard 2: Die Harder
      2507
      1990
      None
      1990-07-06
    
    
      8
      1991
      103738726
      1990
      22543911
      2332
      9
      Dick Tracy
      2332
      1990
      None
      1990-06-15
    
    
      9
      1991
      91457688
      1990
      7918560
      1833
      10
      Kindergarten Cop
      1937
      1990
      None
      1990-12-22
    
    
      10
      1991
      87727583
      1990
      19089645
      2019
      11
      Back to the Future Part III
      2070
      1990
      None
      1990-05-25
    
    
      11
      1991
      86303188
      1990
      11718981
      1349
      12
      Presumed Innocent
      1451
      1990
      None
      1990-07-27
    
    
      12
      1991
      82670733
      1990
      15490445
      2307
      13
      Days of Thunder
      2307
      1990
      None
      1990-06-29
    
    
      13
      1991
      80818974
      1990
      19475559
      2721
      14
      Another 48 HRS.
      2721
      1990
      None
      1990-06-08
    
    
      14
      1991
      71609321
      1990
      13774642
      1281
      15
      Three Men and a Little Lady
      1614
      1990
      None
      1990-11-21
    
    
      15
      1991
      70978012
      1990
      15338160
      1944
      16
      Bird on a Wire
      2008
      1990
      None
      1990-05-18
    
    
      16
      1991
      66666062
      1990
      19558558
      1901
      17
      The Godfather Part III
      1922
      1990
      None
      1990-12-25
    
    
      17
      1991
      61489265
      1990
      10034685
      1319
      18
      Flatliners
      1483
      1990
      None
      1990-08-10
    
    
      18
      1991
      61276872
      1990
      10076834
      1244
      19
      Misery
      1370
      1990
      None
      1990-11-30
    
    
      19
      1991
      56362352
      1990
      159622
      2
      20
      Edward Scissorhands
      1372
      1990
      None
      1990-12-07
    
    
      20
      1991
      53470891
      1990
      10026900
      1714
      21
      Problem Child
      1769
      1990
      None
      1990-07-27
    
    
      21
      1991
      53208180
      1990
      8045760
      1479
      22
      Arachnophobia
      2005
      1990
      None
      1990-07-20
    
    
      22
      1991
      52096475
      1990
      417076
      12
      23
      Awakenings
      1330
      1990
      None
      1990-12-22
    
    
      23
      1991
      47789074
      1990
      8100640
      1576
      24
      Look Who's Talking Too
      1647
      1990
      None
      1990-12-14
    
    
      24
      1991
      47410827
      1990
      9213631
      1301
      25
      Hard to Kill
      1508
      1990
      None
      1990-02-09
    
    
      25
      1991
      46836214
      1990
      6368901
      1070
      26
      Goodfellas
      1328
      1990
      None
      1990-09-19
    
    
      26
      1991
      46044396
      1990
      11790047
      1968
      27
      Marked for Death
      1974
      1990
      None
      1990-10-05
    
    
      27
      1991
      45681173
      1990
      14145411
      1768
      28
      Robocop 2
      1806
      1990
      None
      1990-06-22
    
    
      28
      1991
      44645619
      1990
      7708029
      1901
      29
      The Jungle Book (re-issue) (1990)
      1923
      1990
      None
      1990-07-13
    
    
      29
      1991
      44143410
      1990
      8017438
      1770
      30
      Young Guns II
      1770
      1990
      None
      1990-08-03
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      2570
      2016
      29467855
      2015
      8246267
      1960
      71
      A Walk in the Woods
      2158
      2015
      None
      2015-09-02
    
    
      2571
      2015
      27740955
      2015
      10542116
      2766
      72
      Sinister 2
      2799
      2015
      2015-10-29 00:00:00
      2015-08-21
    
    
      2572
      2015
      27288872
      2015
      8111264
      3355
      73
      No Escape
      3415
      2015
      2015-11-26 00:00:00
      2015-08-26
    
    
      2573
      2016
      26997690
      2015
      10812861
      3082
      74
      The Last Witch Hunter
      3082
      2015
      None
      2015-10-23
    
    
      2574
      2016
      26822144
      2015
      6610961
      1603
      75
      Ricki and the Flash
      2064
      2015
      None
      2015-08-07
    
    
      2575
      2015
      26501323
      2015
      15027415
      2602
      76
      The Woman in Black 2: Angel of Death
      2602
      2015
      2015-03-19 00:00:00
      2015-01-02
    
    
      2576
      2015
      26461644
      2015
      11012305
      3171
      77
      Run All Night
      3171
      2015
      2015-05-07 00:00:00
      2015-03-13
    
    
      2577
      2016
      26439912
      2015
      9880536
      2960
      78
      The Night Before
      2960
      2015
      None
      2015-11-20
    
    
      2578
      2015
      25801047
      2015
      10203437
      2666
      79
      The Lazarus Effect
      2666
      2015
      2015-06-11 00:00:00
      2015-02-27
    
    
      2579
      2015
      25442958
      2015
      237264
      4
      80
      Ex Machina
      2004
      2015
      2015-09-03 00:00:00
      2015-04-10
    
    
      2580
      2015
      22764410
      2015
      9808463
      2720
      81
      The Gallows
      2720
      2015
      2015-09-17 00:00:00
      2015-07-10
    
    
      2581
      2015
      22467450
      2015
      8326530
      3261
      82
      Hitman: Agent 47
      3273
      2015
      2015-10-15 00:00:00
      2015-08-21
    
    
      2582
      2015
      22348241
      2015
      8310252
      2893
      83
      Project Almanac
      2900
      2015
      2015-03-26 00:00:00
      2015-01-30
    
    
      2583
      2015
      21571189
      2015
      6213362
      1823
      84
      Black or White
      1823
      2015
      2015-05-14 00:00:00
      2015-01-30
    
    
      2584
      2015
      21067116
      2015
      9670235
      2815
      85
      Aloha
      2815
      2015
      2015-07-30 00:00:00
      2015-05-29
    
    
      2585
      2016
      21028459
      2015
      8317545
      2603
      86
      Love the Coopers
      2603
      2015
      None
      2015-11-13
    
    
      2586
      2015
      19375982
      2015
      4038962
      2320
      87
      Shaun the Sheep Movie
      2360
      2015
      2015-10-22 00:00:00
      2015-08-05
    
    
      2587
      2015
      18754371
      2015
      197000
      12
      88
      Still Alice
      1318
      2015
      2015-05-21 00:00:00
      2015-01-16
    
    
      2588
      2016
      18300124
      2015
      8070493
      1656
      89
      Paranormal Activity: The Ghost Dimension
      1656
      2015
      None
      2015-10-23
    
    
      2589
      2015
      17737646
      2015
      2434908
      361
      90
      Mr. Holmes
      898
      2015
      2015-11-05 00:00:00
      2015-07-17
    
    
      2590
      2016
      17717593
      2015
      521522
      4
      91
      Steve Jobs
      2493
      2015
      None
      2015-10-09
    
    
      2591
      2015
      17506470
      2015
      6100010
      2002
      92
      Dope
      2002
      2015
      2015-09-17 00:00:00
      2015-06-19
    
    
      2592
      2015
      17223265
      2015
      7217640
      2875
      93
      Seventh Son
      2875
      2015
      2015-03-19 00:00:00
      2015-02-06
    
    
      2593
      2015
      16432322
      2015
      4577861
      2012
      94
      Monkey Kingdom
      2012
      2015
      2015-07-23 00:00:00
      2015-04-17
    
    
      2594
      2015
      16029670
      2015
      7355622
      3434
      95
      The Transporter Refueled
      3434
      2015
      2015-11-19 00:00:00
      2015-09-04
    
    
      2595
      2016
      15012671
      2015
      6652996
      2392
      96
      The Secret in their Eyes (2015)
      2392
      2015
      None
      2015-11-20
    
    
      2596
      2015
      14674076
      2015
      160089
      4
      97
      It Follows
      1655
      2015
      2015-06-25 00:00:00
      2015-03-13
    
    
      2597
      2015
      14440985
      2015
      5454284
      2778
      98
      American Ultra
      2778
      2015
      2015-10-08 00:00:00
      2015-08-21
    
    
      2598
      2016
      14218236
      2015
      4002226
      1553
      99
      Woodlawn
      1553
      2015
      None
      2015-10-16
    
    
      2599
      2016
      13550619
      2015
      5002521
      3003
      100
      Burnt
      3003
      2015
      None
      2015-10-30
    
  

2600 rows × 11 columns



In [42]:

    
IMDB_df.head()









    Out[42]:






  
    
      
      movie_id
      movie_name
      positive
      stars
      text
      url
    
  
  
    
      0
      10027
      Titanic
      True
      7
      Sure, Titanic was a good movie, the first time...
      http://www.imdb.com/title/tt0120338/usercommen...
    
    
      1
      10028
      Titanic
      True
      10
      When I saw this movie I was stunned by what a ...
      http://www.imdb.com/title/tt0120338/usercommen...
    
    
      2
      10029
      Titanic
      True
      10
      Why do people bitch about this movie and not a...
      http://www.imdb.com/title/tt0120338/usercommen...
    
    
      3
      10030
      Titanic
      True
      10
      What's inexplicable? Firstly, the hatred towar...
      http://www.imdb.com/title/tt0120338/usercommen...
    
    
      4
      10031
      Titanic
      True
      10
      Previously, I wrote that I loved "Titanic", cr...
      http://www.imdb.com/title/tt0120338/usercommen...

Now, we want to make a new dataframe flattened_df that we can use to run our regressions on. This dataframe will include all the columns in movie_df and the extra columns

number of reviews for the movie in IMDB_df
average stars from the reviews
overall ranking



In [43]:

    
# set index to title
indexed_df = movie_df.set_index("title")



In [44]:

    
# use groupby to get the review_count and the star_avg
gold = IMDB_df.groupby("movie_name")
review_count = gold.movie_name.count()
star_avg = gold.stars.mean()
positive = gold.positive.mean()



In [45]:

    
# concatenate the two series into our final dataframe flattened_df
flattened_df = pd.concat([indexed_df, review_count], axis=1, join_axes=[indexed_df.index])
flattened_df.rename(columns={'movie_name': 'review_count'}, inplace=True)
flattened_df = pd.concat([flattened_df, star_avg], axis=1, join_axes=[indexed_df.index])
flattened_df.rename(columns={'stars': 'star_avg'}, inplace=True)
flattened_df = pd.concat([flattened_df, positive], axis=1, join_axes=[indexed_df.index])

Let's take a look at what our dataframe looks like now!



In [46]:

    
flattened_df.head()









    Out[46]:






  
    
      
      close_year
      gross
      open_year
      opening_gross
      opening_theaters
      ranking
      total_theaters
      year
      close_date
      open_date
      review_count
      star_avg
      positive
    
    
      title
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Home Alone
      1991
      285761243
      1990
      17081997
      1202
      1
      2173
      1990
      None
      1990-11-16
      NaN
      NaN
      NaN
    
    
      Ghost
      1991
      217631306
      1990
      12191540
      1101
      2
      1766
      1990
      None
      1990-07-13
      NaN
      NaN
      NaN
    
    
      Dances with Wolves
      1991
      184208848
      1990
      598257
      14
      3
      1636
      1990
      None
      1990-11-09
      NaN
      NaN
      NaN
    
    
      Pretty Woman
      1991
      178406268
      1990
      11280591
      1325
      4
      1811
      1990
      None
      1990-03-23
      2
      1
      0
    
    
      Teenage Mutant Ninja Turtles
      1991
      135265915
      1990
      25398367
      2006
      5
      2377
      1990
      None
      1990-03-30
      NaN
      NaN
      NaN



In [47]:

    
dftouse = flattened_df[~flattened_df['review_count'].map(np.isnan)]
dftouse.shape









    Out[47]:





(334, 13)



In [48]:

    
dftouse.head()









    Out[48]:






  
    
      
      close_year
      gross
      open_year
      opening_gross
      opening_theaters
      ranking
      total_theaters
      year
      close_date
      open_date
      review_count
      star_avg
      positive
    
    
      title
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Pretty Woman
      1991
      178406268
      1990
      11280591
      1325
      4
      1811
      1990
      None
      1990-03-23
      2
      1.000000
      0.0
    
    
      Dick Tracy
      1991
      103738726
      1990
      22543911
      2332
      9
      2332
      1990
      None
      1990-06-15
      24
      8.875000
      1.0
    
    
      Flatliners
      1991
      61489265
      1990
      10034685
      1319
      18
      1483
      1990
      None
      1990-08-10
      30
      6.966667
      0.8
    
    
      Problem Child
      1991
      53470891
      1990
      10026900
      1714
      21
      1769
      1990
      None
      1990-07-27
      8
      2.500000
      0.0
    
    
      Marked for Death
      1991
      46044396
      1990
      11790047
      1968
      27
      1974
      1990
      None
      1990-10-05
      3
      4.000000
      0.0



In [49]:

    
plt.scatter(y=dftouse.opening_gross, x=dftouse.star_avg)









    Out[49]:





<matplotlib.collections.PathCollection at 0x120a7c710>






    



/Users/alpkaancelik/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == str('face'):



In [50]:

    
plt.scatter(y=dftouse_four.opening_gross, x=dftouse_four.star_avg,s=dftouse_four.review_count)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-50-113626d5200e> in <module>()
----> 1 plt.scatter(y=dftouse_four.opening_gross, x=dftouse_four.star_avg,s=dftouse_four.review_count)

NameError: name 'dftouse_four' is not defined



In [160]:

    
plt.scatter(y=dftouse_seven.opening_gross, x=dftouse_seven.star_avg,s=dftouse_seven.review_count)









    Out[160]:





<matplotlib.collections.PathCollection at 0x111fa4990>



In [51]:

    
dftouse_four = flattened_df[flattened_df['star_avg'] <= 4]
dftouse_seven = flattened_df[flattened_df['star_avg'] >= 7]



In [52]:

    
dftouse_four.head()









    Out[52]:






  
    
      
      close_year
      gross
      open_year
      opening_gross
      opening_theaters
      ranking
      total_theaters
      year
      close_date
      open_date
      review_count
      star_avg
      positive
    
    
      title
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Pretty Woman
      1991
      178406268
      1990
      11280591
      1325
      4
      1811
      1990
      None
      1990-03-23
      2
      1.0
      0
    
    
      Problem Child
      1991
      53470891
      1990
      10026900
      1714
      21
      1769
      1990
      None
      1990-07-27
      8
      2.5
      0
    
    
      Marked for Death
      1991
      46044396
      1990
      11790047
      1968
      27
      1974
      1990
      None
      1990-10-05
      3
      4.0
      0
    
    
      Air America
      1991
      31053601
      1990
      8064480
      1902
      37
      1902
      1990
      None
      1990-08-10
      5
      2.8
      0
    
    
      Internal Affairs
      1991
      27734391
      1990
      5043516
      975
      43
      986
      1990
      None
      1990-01-12
      2
      3.5
      0



In [53]:

    
bobby_ols = ols('opening_gross ~ star_avg',dftouse).fit()
bobby_ols.summary()









    Out[53]:





OLS Regression Results

  Dep. Variable:       opening_gross     R-squared:             0.004 


  Model:                    OLS          Adj. R-squared:        0.001 


  Method:              Least Squares     F-statistic:           1.465 


  Date:              Thu, 03 Dec 2015    Prob (F-statistic):    0.227  


  Time:                  19:13:11        Log-Likelihood:      -6007.5 


  No. Observations:          334         AIC:                1.202e+04


  Df Residuals:              332         BIC:                1.203e+04


  Df Model:                    1                                      


  Covariance Type:       nonrobust                                    




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   1.769e+07   1.58e+06     11.170   0.000   1.46e+07  2.08e+07


  star_avg   -3.764e+05   3.11e+05     -1.210   0.227  -9.88e+05  2.35e+05




  Omnibus:        203.186    Durbin-Watson:         1.180


  Prob(Omnibus):   0.000     Jarque-Bera (JB):   1582.013


  Skew:            2.499     Prob(JB):               0.00


  Kurtosis:       12.418     Cond. No.               9.64



In [162]:

    
bobby_ols = ols('opening_gross ~ star_avg',dftouse_four).fit()
bobby_ols.summary()









    Out[162]:





OLS Regression Results

  Dep. Variable:       opening_gross     R-squared:             0.005


  Model:                    OLS          Adj. R-squared:        0.000


  Method:              Least Squares     F-statistic:           1.002


  Date:              Mon, 30 Nov 2015    Prob (F-statistic):    0.318 


  Time:                  22:37:19        Log-Likelihood:      -3954.6


  No. Observations:          220         AIC:                   7913.


  Df Residuals:              218         BIC:                   7920.


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   1.973e+07   3.31e+06      5.959   0.000   1.32e+07  2.63e+07


  star_avg   -1.297e+06    1.3e+06     -1.001   0.318  -3.85e+06  1.26e+06




  Omnibus:        126.033    Durbin-Watson:         1.069 


  Prob(Omnibus):   0.000     Jarque-Bera (JB):    716.861 


  Skew:            2.270     Prob(JB):           2.17e-156


  Kurtosis:       10.589     Cond. No.               9.19



In [159]:

    
bobby_ols = ols('opening_gross ~ star_avg',dftouse_seven).fit()
bobby_ols.summary()









    Out[159]:





OLS Regression Results

  Dep. Variable:       opening_gross     R-squared:             0.009


  Model:                    OLS          Adj. R-squared:       -0.003


  Method:              Least Squares     F-statistic:          0.7432


  Date:              Mon, 30 Nov 2015    Prob (F-statistic):    0.391 


  Time:                  22:34:49        Log-Likelihood:      -1553.5


  No. Observations:           86         AIC:                   3111.


  Df Residuals:               84         BIC:                   3116.


  Df Model:                    1                                     


  Covariance Type:       nonrobust                                   




               coef      std err       t       P>|t|  [95.0% Conf. Int.] 


  Intercept   3.719e+07   2.64e+07      1.408   0.163  -1.53e+07  8.97e+07


  star_avg   -2.671e+06    3.1e+06     -0.862   0.391  -8.83e+06  3.49e+06




  Omnibus:        81.009    Durbin-Watson:         1.855 


  Prob(Omnibus):   0.000    Jarque-Bera (JB):    666.603 


  Skew:            2.982    Prob(JB):           1.77e-145


  Kurtosis:       15.266    Cond. No.               123.

ASK ANDREW WHETHER WE NEED TEST/TRAIN FOR OUR REGRESSION

WORK FLOW 0 GOAL: DESCRIPTIVE STATISTICS

How many unique movies we have in our IMDB_df
how many reviews they have each
which years do they represent

WORK FLOW 1 GOAL: MAKE A VALENCE SCORE THAT CLOSELY MATCHES STARS

Create a valence score for each review
- summation
- average
- product 1a. Check how good our valence score is by testing on the stars

STEPHEN's EDITS number 2

WORK FLOW 2 GOAL: FIND A REGRESSION MODEL THAT BEST PREDICTS THE FOLLOWING Y VARIABLES

Regress on:
- opening_gross (controlled for opening weekend)
- opening_gross/opening_theater
- total_gross
- total_gross/total_theateer

Just realised that I was meant to take notes.

try making branches on github: each person should work on their own branch and not push to the origin until you are sure what you're doing is correct
share andrew on the github

Re: pretty woman question

useful to consider polarizing reviews actually predict the box office performance

Re: regression

if at the end of the day we choose to do simply box office gross vs. sentiment this is just a correlation, then it is more on the simper side of things
the difficulty should be more commensurate with the pset difficulty
should take it to the next level, a few possibilities
measures of dispersion on the sentiment analysis
different types of distributions you have in a given review
how frequently are certain words used
average length of words
average document lengths
ratio between high valence words and low valence words
we can do different sentiment scores based on different sentiment dictionaries
could use a BOW model (have the corpus of words in the reviews) & shrink down for regularization

We can adopt the hypothesis that stars predict box office gross Or hypothesis that evry polarized responses predict box office gross It's okay to have a dud hypothesis

Graded on the effort we put in - low unimpressive results should not be equated with not having much work to do. We can make up for it via lit reviews or extra research.

BY NEXT MONDAY: ANALYSIS SHOULD BE DONE, AND AT THAT POINT JUST TOUCHING UP THE VISUALIZATIONS

Do not underestimate the time that it takes to make a good screencast, website, and effective visualizations

ANDREW IS HUMAN
If we put in alot of work to our visualizations, it will go a long way :)
just as much thought put into the end result as with the analysis

Analysis



In [55]:

    
#getting inflation data and creating a new dataframe 
inflation = pd.read_csv("inf.csv")
print inflation









    



      DATE     VALUE
0   1/1/90  5.397956
1   1/1/91  4.234964
2   1/1/92  3.028820
3   1/1/93  2.951657
4   1/1/94  2.607442
5   1/1/95  2.805420
6   1/1/96  2.931204
7   1/1/97  2.337690
8   1/1/98  1.552279
9   1/1/99  2.188027
10  1/1/00  3.376857
11  1/1/01  2.826171
12  1/1/02  1.586032
13  1/1/03  2.270095
14  1/1/04  2.677237
15  1/1/05  3.392747
16  1/1/06  3.225944
17  1/1/07  2.852672
18  1/1/08  3.839100
19  1/1/09 -0.355546
20  1/1/10  1.640043
21  1/1/11  3.156842
22  1/1/12  2.069337
23  1/1/13  1.464833
24  1/1/14  1.622223



In [90]:

    
#Creating a dataframe of Cumulative Inflation
years_90 = range(1990,2015)
infdict = {}

infindex = 0
infvalue = 1

testlist = []
for row in inflation.values:
    currentval = 1 + (row[1]/100)
    cuminf = infvalue*currentval
    infdict[years_90[infindex]] = cuminf
    infindex += 1
    infvalue = cuminf
    testlist.append(cuminf)

inframe = pd.DataFrame(data=testlist, index=range(1990,2015))
#infdict exists in case we need it later



In [91]:

    
inframe



In [ ]:

    
adjframe =

Visualization

Milestone 3: Video and finishing touches

Screencast video

Website

Finishing touches



In [ ]:

	close_date	gross	open_date	opening_gross	opening_theaters	ranking	title	total_theaters	year
0	-	285761243	11/16	17081997	1202	1	Home Alone	2173	1990
1	-	217631306	7/13	12191540	1101	2	Ghost	1766	1990
2	-	184208848	11/9	598257	14	3	Dances with Wolves	1636	1990
3	-	178406268	3/23	11280591	1325	4	Pretty Woman	1811	1990
4	-	135265915	3/30	25398367	2006	5	Teenage Mutant Ninja Turtles	2377	1990

	happiness_rank	happiness_average	happiness_standard_deviation	twitter_rank	google_rank	nyt_rank	lyrics_rank
word
laughter	1	8.50	0.9313	3600	--	--	1728
happiness	2	8.44	0.9723	1853	2458	--	1230
love	3	8.42	1.1082	25	317	328	23
happy	4	8.30	0.9949	65	1372	1313	375
laughed	5	8.26	1.1572	3334	3542	--	2332

	close_year	gross	open_year	opening_gross	opening_theaters	ranking	title	total_theaters	year	close_date	open_date
0	1991	285761243	1990	17081997	1202	1	Home Alone	2173	1990	None	1990-11-16
1	1991	217631306	1990	12191540	1101	2	Ghost	1766	1990	None	1990-07-13
2	1991	184208848	1990	598257	14	3	Dances with Wolves	1636	1990	None	1990-11-09
3	1991	178406268	1990	11280591	1325	4	Pretty Woman	1811	1990	None	1990-03-23
4	1991	135265915	1990	25398367	2006	5	Teenage Mutant Ninja Turtles	2377	1990	None	1990-03-30
5	1991	122012643	1990	17161835	1225	6	The Hunt for Red October	1817	1990	None	1990-03-02
6	1991	119394840	1990	25533700	2060	7	Total Recall	2131	1990	None	1990-06-01
7	1991	117540947	1990	21744661	2507	8	Die Hard 2: Die Harder	2507	1990	None	1990-07-06
8	1991	103738726	1990	22543911	2332	9	Dick Tracy	2332	1990	None	1990-06-15
9	1991	91457688	1990	7918560	1833	10	Kindergarten Cop	1937	1990	None	1990-12-22
10	1991	87727583	1990	19089645	2019	11	Back to the Future Part III	2070	1990	None	1990-05-25
11	1991	86303188	1990	11718981	1349	12	Presumed Innocent	1451	1990	None	1990-07-27
12	1991	82670733	1990	15490445	2307	13	Days of Thunder	2307	1990	None	1990-06-29
13	1991	80818974	1990	19475559	2721	14	Another 48 HRS.	2721	1990	None	1990-06-08
14	1991	71609321	1990	13774642	1281	15	Three Men and a Little Lady	1614	1990	None	1990-11-21
15	1991	70978012	1990	15338160	1944	16	Bird on a Wire	2008	1990	None	1990-05-18
16	1991	66666062	1990	19558558	1901	17	The Godfather Part III	1922	1990	None	1990-12-25
17	1991	61489265	1990	10034685	1319	18	Flatliners	1483	1990	None	1990-08-10
18	1991	61276872	1990	10076834	1244	19	Misery	1370	1990	None	1990-11-30
19	1991	56362352	1990	159622	2	20	Edward Scissorhands	1372	1990	None	1990-12-07
20	1991	53470891	1990	10026900	1714	21	Problem Child	1769	1990	None	1990-07-27
21	1991	53208180	1990	8045760	1479	22	Arachnophobia	2005	1990	None	1990-07-20
22	1991	52096475	1990	417076	12	23	Awakenings	1330	1990	None	1990-12-22
23	1991	47789074	1990	8100640	1576	24	Look Who's Talking Too	1647	1990	None	1990-12-14
24	1991	47410827	1990	9213631	1301	25	Hard to Kill	1508	1990	None	1990-02-09
25	1991	46836214	1990	6368901	1070	26	Goodfellas	1328	1990	None	1990-09-19
26	1991	46044396	1990	11790047	1968	27	Marked for Death	1974	1990	None	1990-10-05
27	1991	45681173	1990	14145411	1768	28	Robocop 2	1806	1990	None	1990-06-22
28	1991	44645619	1990	7708029	1901	29	The Jungle Book (re-issue) (1990)	1923	1990	None	1990-07-13
29	1991	44143410	1990	8017438	1770	30	Young Guns II	1770	1990	None	1990-08-03
...	...	...	...	...	...	...	...	...	...	...	...
2570	2016	29467855	2015	8246267	1960	71	A Walk in the Woods	2158	2015	None	2015-09-02
2571	2015	27740955	2015	10542116	2766	72	Sinister 2	2799	2015	2015-10-29 00:00:00	2015-08-21
2572	2015	27288872	2015	8111264	3355	73	No Escape	3415	2015	2015-11-26 00:00:00	2015-08-26
2573	2016	26997690	2015	10812861	3082	74	The Last Witch Hunter	3082	2015	None	2015-10-23
2574	2016	26822144	2015	6610961	1603	75	Ricki and the Flash	2064	2015	None	2015-08-07
2575	2015	26501323	2015	15027415	2602	76	The Woman in Black 2: Angel of Death	2602	2015	2015-03-19 00:00:00	2015-01-02
2576	2015	26461644	2015	11012305	3171	77	Run All Night	3171	2015	2015-05-07 00:00:00	2015-03-13
2577	2016	26439912	2015	9880536	2960	78	The Night Before	2960	2015	None	2015-11-20
2578	2015	25801047	2015	10203437	2666	79	The Lazarus Effect	2666	2015	2015-06-11 00:00:00	2015-02-27
2579	2015	25442958	2015	237264	4	80	Ex Machina	2004	2015	2015-09-03 00:00:00	2015-04-10
2580	2015	22764410	2015	9808463	2720	81	The Gallows	2720	2015	2015-09-17 00:00:00	2015-07-10
2581	2015	22467450	2015	8326530	3261	82	Hitman: Agent 47	3273	2015	2015-10-15 00:00:00	2015-08-21
2582	2015	22348241	2015	8310252	2893	83	Project Almanac	2900	2015	2015-03-26 00:00:00	2015-01-30
2583	2015	21571189	2015	6213362	1823	84	Black or White	1823	2015	2015-05-14 00:00:00	2015-01-30
2584	2015	21067116	2015	9670235	2815	85	Aloha	2815	2015	2015-07-30 00:00:00	2015-05-29
2585	2016	21028459	2015	8317545	2603	86	Love the Coopers	2603	2015	None	2015-11-13
2586	2015	19375982	2015	4038962	2320	87	Shaun the Sheep Movie	2360	2015	2015-10-22 00:00:00	2015-08-05
2587	2015	18754371	2015	197000	12	88	Still Alice	1318	2015	2015-05-21 00:00:00	2015-01-16
2588	2016	18300124	2015	8070493	1656	89	Paranormal Activity: The Ghost Dimension	1656	2015	None	2015-10-23
2589	2015	17737646	2015	2434908	361	90	Mr. Holmes	898	2015	2015-11-05 00:00:00	2015-07-17
2590	2016	17717593	2015	521522	4	91	Steve Jobs	2493	2015	None	2015-10-09
2591	2015	17506470	2015	6100010	2002	92	Dope	2002	2015	2015-09-17 00:00:00	2015-06-19
2592	2015	17223265	2015	7217640	2875	93	Seventh Son	2875	2015	2015-03-19 00:00:00	2015-02-06
2593	2015	16432322	2015	4577861	2012	94	Monkey Kingdom	2012	2015	2015-07-23 00:00:00	2015-04-17
2594	2015	16029670	2015	7355622	3434	95	The Transporter Refueled	3434	2015	2015-11-19 00:00:00	2015-09-04
2595	2016	15012671	2015	6652996	2392	96	The Secret in their Eyes (2015)	2392	2015	None	2015-11-20
2596	2015	14674076	2015	160089	4	97	It Follows	1655	2015	2015-06-25 00:00:00	2015-03-13
2597	2015	14440985	2015	5454284	2778	98	American Ultra	2778	2015	2015-10-08 00:00:00	2015-08-21
2598	2016	14218236	2015	4002226	1553	99	Woodlawn	1553	2015	None	2015-10-16
2599	2016	13550619	2015	5002521	3003	100	Burnt	3003	2015	None	2015-10-30

	movie_id	movie_name	positive	stars	text	url
0	10027	Titanic	True	7	Sure, Titanic was a good movie, the first time...	http://www.imdb.com/title/tt0120338/usercommen...
1	10028	Titanic	True	10	When I saw this movie I was stunned by what a ...	http://www.imdb.com/title/tt0120338/usercommen...
2	10029	Titanic	True	10	Why do people bitch about this movie and not a...	http://www.imdb.com/title/tt0120338/usercommen...
3	10030	Titanic	True	10	What's inexplicable? Firstly, the hatred towar...	http://www.imdb.com/title/tt0120338/usercommen...
4	10031	Titanic	True	10	Previously, I wrote that I loved "Titanic", cr...	http://www.imdb.com/title/tt0120338/usercommen...

	close_year	gross	open_year	opening_gross	opening_theaters	ranking	total_theaters	year	close_date	open_date	review_count	star_avg	positive
title
Home Alone	1991	285761243	1990	17081997	1202	1	2173	1990	None	1990-11-16	NaN	NaN	NaN
Ghost	1991	217631306	1990	12191540	1101	2	1766	1990	None	1990-07-13	NaN	NaN	NaN
Dances with Wolves	1991	184208848	1990	598257	14	3	1636	1990	None	1990-11-09	NaN	NaN	NaN
Pretty Woman	1991	178406268	1990	11280591	1325	4	1811	1990	None	1990-03-23	2	1	0
Teenage Mutant Ninja Turtles	1991	135265915	1990	25398367	2006	5	2377	1990	None	1990-03-30	NaN	NaN	NaN

Dep. Variable:	opening_gross	R-squared:	0.004
Model:	OLS	Adj. R-squared:	0.001
Method:	Least Squares	F-statistic:	1.465
Date:	Thu, 03 Dec 2015	Prob (F-statistic):	0.227
Time:	19:13:11	Log-Likelihood:	-6007.5
No. Observations:	334	AIC:	1.202e+04
Df Residuals:	332	BIC:	1.203e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	1.769e+07	1.58e+06	11.170	0.000	1.46e+07 2.08e+07
star_avg	-3.764e+05	3.11e+05	-1.210	0.227	-9.88e+05 2.35e+05

Omnibus:	203.186	Durbin-Watson:	1.180
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1582.013
Skew:	2.499	Prob(JB):	0.00
Kurtosis:	12.418	Cond. No.	9.64

Omnibus:	126.033	Durbin-Watson:	1.069
Prob(Omnibus):	0.000	Jarque-Bera (JB):	716.861
Skew:	2.270	Prob(JB):	2.17e-156
Kurtosis:	10.589	Cond. No.	9.19

Omnibus:	81.009	Durbin-Watson:	1.855
Prob(Omnibus):	0.000	Jarque-Bera (JB):	666.603
Skew:	2.982	Prob(JB):	1.77e-145
Kurtosis:	15.266	Cond. No.	123.

	0
1990	1.053980
1991	1.098615
1992	1.131890
1993	1.165300
1994	1.195684
1995	1.229228
1996	1.265259
1997	1.294837
1998	1.314937
1999	1.343708
2000	1.389083
2001	1.428341
2002	1.450995
2003	1.483934
2004	1.523662
2005	1.575356
2006	1.626176
2007	1.672566
2008	1.736777
2009	1.730602
2010	1.758985
2011	1.814513
2012	1.852062
2013	1.879191
2014	1.909676