CS109 Final Project Process Book

Background & Motivation

Social media and entertainment are such pervasive parts of millennials' lives. We want to study the intersection of these two. Is it possible to predict box office success of film through sentiments expressed on social media? Stay tuned for more!

CS109 Final Project Process Book



In [1]:

    
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Milestone 1: Scrape and prepare data before thanksgiving

For our project we will be using data from 3 different sources

Box Office Mojo (BOM)

Large Movie Review Dataset

AFINN-111 Dictionary

In this first milestone, we will get all the data into a format that we can start running analysis on!

Scraping and cleaning Box Office Mojo

First we import the requests and BeautifulSoup libraries to make working with HTTP requests easier, and then easily transfer HTML content to Python data structures.



In [2]:

    
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Secondly, we prepare the data frame movie_df to store the data that we will scrape from BOM. We give this dataframe 9 columns:

* ranking: the ranking of the movie in its release year by gross
* title: movie title
* gross: how much the movie grossed while in theatres
* Total_theaters: the total number of theaters that showed this movie
* opening_gross: how much the movie grossed in the opening weekend (Fri-Sun)
* opening_theaters: the total number of theaters that showed this movie in the opening weekend (Fri-Sun)
* open_date: date of opening
* close_date: date of closing
* year: year of release



In [3]:

    
movie_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])

Now we write a function rowInfoGrabber that we will call in a loop over the table on the BOM webpage to grab the attributes and save them into the corresponding columns in movie_df.



In [4]:

    
def rowInfoGrabber(r):
    info = []
    # Ranking
    info.append(int(r.find("font").get_text()))
    # Title
    info.append(r.find("a").get_text())
    # Gross
    info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
    '''
    For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
    In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
    if right above it in the movie table and differs in gross income by about $1 million, which is a small 
    difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
    '''
    # Total number of theaters
    if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
    # Opening Gross
    if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
        info.append(10015822)
    else: 
        info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
    # Opening Number of Theaters
    if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
    # Date of Opening
    info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
    # Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
    if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
        info.append('-')
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
    return info

This is the image:



In [5]:

    
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]

Finally we're ready to scrape!
Because IMDB was created in 1990, we will scrape that far back in BOM. So we're scraping the past 26 years (1990 - 2015). Also note that because the HTML was changed starting in 2001, our scraping will be a little different before and after then.



In [6]:

    
%%time
years = [1990 + i for i in range(26)]
for year in years:
    pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
    soup = BeautifulSoup(pageText.text, "html.parser")
    movieTable = soup.find("td", attrs={"colspan":"3"})
    movieRows = movieTable.find("table").find_all("tr")[2:102]
    print year
    movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
    year_df = pd.DataFrame(movie_dicts)
    year_df['year'] = year
    movie_df = movie_df.append(year_df, ignore_index=True)
    time.sleep(1)









    



1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
CPU times: user 10.6 s, sys: 171 ms, total: 10.8 s
Wall time: 41.6 s



In [7]:

    
movie_df.shape









    Out[7]:





(2600, 9)



In [8]:

    
movie_df.head()









    Out[8]:






  
    
      
      close_date
      gross
      open_date
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
    
  
  
    
      0
       -
       285761243
       11/16
       17081997
       1202
       1
                         Home Alone
       2173
       1990
    
    
      1
       -
       217631306
        7/13
       12191540
       1101
       2
                              Ghost
       1766
       1990
    
    
      2
       -
       184208848
        11/9
         598257
         14
       3
                 Dances with Wolves
       1636
       1990
    
    
      3
       -
       178406268
        3/23
       11280591
       1325
       4
                       Pretty Woman
       1811
       1990
    
    
      4
       -
       135265915
        3/30
       25398367
       2006
       5
       Teenage Mutant Ninja Turtles
       2377
       1990

Because some films do not have a close date, we will have to be careful with the close date!



In [9]:

    
# if we decide it's worth just dumping the movies with no close_date, we can use the code below
# movie_df=movie_df[movie_df.close_date != '-'].reset_index(drop=True)

Next, we combine the close_date, open_date and year columns into two columns close_date and open_date that are time series. This will make it easier for us to work with the data in the future.



In [10]:

    
# splitting the close_date and open_date into the respective month and day
movie_df['close_month'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[:x.find('/')])
movie_df['close_day'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
movie_df['open_month'] = movie_df['open_date'].map(lambda x: x[:x.find('/')])
movie_df['open_day'] = movie_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])

# dropping the old close_date and open_date
movie_df = movie_df.drop('close_date', 1)
movie_df = movie_df.drop('open_date', 1)

# creating an open_year by turning the year column into a string and getting rid of trailing bits
movie_df['open_year'] = movie_df.year.astype(str)
movie_df['open_year'] = movie_df.open_year.map(lambda x: x[:x.find('.')])

# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = movie_df['close_month'].astype(int)
open_month = movie_df['open_month'].astype(int)
year = movie_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
    if close_month[i] >= open_month[i]:
        close_year.append(year[i])
    else:
        close_year.append(year[i]+1) 
movie_df['close_year'] = close_year
movie_df['close_year'] = movie_df['close_year'].astype(str)



In [11]:

    
movie_df.head()









    Out[11]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      year
      close_month
      close_day
      open_month
      open_day
      open_year
      close_year
    
  
  
    
      0
       285761243
       17081997
       1202
       1
                         Home Alone
       2173
       1990
       0
       0
       11
       16
       1990
       1991
    
    
      1
       217631306
       12191540
       1101
       2
                              Ghost
       1766
       1990
       0
       0
        7
       13
       1990
       1991
    
    
      2
       184208848
         598257
         14
       3
                 Dances with Wolves
       1636
       1990
       0
       0
       11
        9
       1990
       1991
    
    
      3
       178406268
       11280591
       1325
       4
                       Pretty Woman
       1811
       1990
       0
       0
        3
       23
       1990
       1991
    
    
      4
       135265915
       25398367
       2006
       5
       Teenage Mutant Ninja Turtles
       2377
       1990
       0
       0
        3
       30
       1990
       1991



In [12]:

    
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in movie_df.iterrows():
    if row.close_day != '0':
        close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
    else: 
        close_date.append(None)
movie_df['close_date'] = close_date

movie_df['open_date']=movie_df.open_year + '-' + movie_df.open_month + '-' + movie_df.open_day
movie_df['open_date']=movie_df['open_date'].apply(pd.datetools.parse)



In [13]:

    
# dropping unnecessary columns
movie_df = movie_df.drop('close_day', 1)
movie_df = movie_df.drop('close_month', 1)
movie_df = movie_df.drop('close_year', 1)
movie_df = movie_df.drop('open_day', 1)
movie_df = movie_df.drop('open_month', 1)
movie_df = movie_df.drop('open_year', 1)
movie_df = movie_df.drop('year', 1)

Let's take a look at the data, now!



In [14]:

    
movie_df.head()









    Out[14]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      close_date
      open_date
    
  
  
    
      0
       285761243
       17081997
       1202
       1
                         Home Alone
       2173
       None
      1990-11-16
    
    
      1
       217631306
       12191540
       1101
       2
                              Ghost
       1766
       None
      1990-07-13
    
    
      2
       184208848
         598257
         14
       3
                 Dances with Wolves
       1636
       None
      1990-11-09
    
    
      3
       178406268
       11280591
       1325
       4
                       Pretty Woman
       1811
       None
      1990-03-23
    
    
      4
       135265915
       25398367
       2006
       5
       Teenage Mutant Ninja Turtles
       2377
       None
      1990-03-30

Let's take a look at if we can get the run times for each movie!



In [15]:

    
run_time=[]
for index, row in movie_df.iterrows():
    if row.close_date != None:
        run_time.append(row['close_date']-row['open_date'])
    else: 
        run_time.append('N/A')



In [16]:

    
movie_df.head()









    Out[16]:






  
    
      
      gross
      opening_gross
      opening_theaters
      ranking
      title
      total_theaters
      close_date
      open_date
    
  
  
    
      0
       285761243
       17081997
       1202
       1
                         Home Alone
       2173
       None
      1990-11-16
    
    
      1
       217631306
       12191540
       1101
       2
                              Ghost
       1766
       None
      1990-07-13
    
    
      2
       184208848
         598257
         14
       3
                 Dances with Wolves
       1636
       None
      1990-11-09
    
    
      3
       178406268
       11280591
       1325
       4
                       Pretty Woman
       1811
       None
      1990-03-23
    
    
      4
       135265915
       25398367
       2006
       5
       Teenage Mutant Ninja Turtles
       2377
       None
      1990-03-30

Looks like the data is ready for us to use! Let's save this data so we're ready to use it next time.



In [17]:

    
! pip install pymongo









    



Requirement already satisfied (use --upgrade to upgrade): pymongo in /anaconda/lib/python2.7/site-packages



In [18]:

    
# Save the movie Dictionaries corresponding to each row of the BoxOfficeMojo table.
import json # (dong)
import pymongo
from bson import json_util

# Make a dictionary out of the dataset for storage in JSON format.
movieSaved = {feature: movie_df[feature].values.tolist() for feature in movie_df.columns.values}
fp = open("allMovies.json","w")
json.dump(movieSaved, fp, default=json_util.default)
fp.close()

Loading and preparing IMDB review dataset

We have cleaned up our IMDB review dataset in the ipython notebook New.ipynb. From that notebook we have been able to save dictionaries of our data, which we will now call.



In [25]:

    
with open("train_df_dict.json", "r") as fd:
    train_df_dict = json.load(fd)
with open("test_df_dict.json", "r") as fd:
    test_df_dict = json.load(fd)



In [26]:

    
train_df = pd.DataFrame(train_df_dict)
test_df = pd.DataFrame(test_df_dict)

The Stanford group distinguishes between train and test because this was relevant for their project. This may prove to be useful later, so we will keep two separate dataframes. However, for our purposes at the moment, we can combine them since the BOM data will serve as our true data set.



In [27]:

    
IMDB_df = train_df.append(test_df)

We want to figure out which movies from IMDB_df are also present in our BOM DF. So let's get all the movie titles in our BOM table.



In [28]:

    
BOM_movie_list = movie_df.title.values.tolist()

Now let's create a mask over IMDB_df, the boolean values of which indicate whether or not a movie is in the BOM list.



In [29]:

    
movie_mask = [(movie in BOM_movie_list) for movie in IMDB_df.movie_name]



In [30]:

    
sum(movie_mask)









    Out[30]:





4111

We can now create out final, relevant IMDB data frame, with only those movies that also appear in the BOM tables from 1990 - 2015.



In [31]:

    
IMDB_dftouse=IMDB_df[movie_mask]

Finally we want to save our dictionary of IMDB_dftouse into a JSON file for storage.



In [32]:

    
IMDB_dftouse_dict = {feature: IMDB_dftouse[feature].values.tolist() for feature in IMDB_dftouse.columns.values}

fp = open("IMDB_dftouse_dict.json","w")
json.dump(IMDB_dftouse_dict, fp)
fp.close()



In [33]:

    
with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB_dftouse_dict = json.load(fd)

Loading and preparing AFINN dictionary

We wil read in the AFINN-111 dictionary from the tab-delimited .txt file and save it into a pandas dataframe dictionary.



In [19]:

    
words = []
valences = []

with open('AFINN/AFINN-111.txt','rb') as source:
    for line in source:
        fields = line.split('\t')
        words.append(fields[0])
        valences.append(fields[1])
dictionary = pd.DataFrame({'words': words, 'valences':valences})

Because the valences are read in as a string, we will clean it up and change it to an int so it is ready for manipulation.



In [20]:

    
dictionary['valences'] = dictionary['valences'].map(lambda x: x if x.find('\n')== -1 else x[:x.find('\n')])
dictionary['valences']=dictionary['valences'].astype(int)

Let's take a look at the dictionary!



In [21]:

    
dictionary.head()









    Out[21]:






  
    
      
      valences
      words
    
  
  
    
      0
      -2
         abandon
    
    
      1
      -2
       abandoned
    
    
      2
      -2
        abandons
    
    
      3
      -2
        abducted
    
    
      4
      -2
       abduction

Looks like the dictionary is ready for us to use! Let's save it so we have it in the event that anything happens.



In [24]:

    
# dictionary.to_csv("dictionary.csv", index=False)
dictionary=pd.read_csv("dictionary.csv")

Milestone 2: Analysing and visualizing the data

Descriptive statistics

Analysis

Visualization

Milestone 3: Video and finishing touches

Screencast video

Website

Finishing touches



In [ ]:

	close_date	gross	open_date	opening_gross	opening_theaters	ranking	title	total_theaters	year
0	-	285761243	11/16	17081997	1202	1	Home Alone	2173	1990
1	-	217631306	7/13	12191540	1101	2	Ghost	1766	1990
2	-	184208848	11/9	598257	14	3	Dances with Wolves	1636	1990
3	-	178406268	3/23	11280591	1325	4	Pretty Woman	1811	1990
4	-	135265915	3/30	25398367	2006	5	Teenage Mutant Ninja Turtles	2377	1990