CS109 Final Project Process Book

Background & Motivation

Social media and entertainment are such pervasive parts of millennials' lives. We want to study the intersection of these two. Is it possible to predict box office success of film through sentiments expressed on social media? Stay tuned for more!


In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

Milestone 1: Scrape and prepare data before thanksgiving

For our project we will be using data from 3 different sources

    Box Office Mojo (BOM) (http://www.boxofficemojo.com) is a website that aggregates, in a table, a list of all movies released in a year and attributes such as how much it grossed in the opening week, how much it grossed in total and how long it aired for
    Large Movie Review Dataset (http://ai.stanford.edu/~amaas/data/sentiment/) is a polarized dataset of movie reviews from IMDB prepared by Maas et al from Stanford. The dataset contains 25,000 entries in the training set and 25,000 entries in the test set.
    AFINN-111 Dictionary (http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=60100) is a dictionary list of 2477 english words and phrases rated for valence with an integer from -5 to 5. Originally prepared by Finn Årup Nielsen

In this first milestone, we will get all the data into a format that we can start running analysis on!

Scraping and cleaning Box Office Mojo

First we import the requests and BeautifulSoup libraries to make working with HTTP requests easier, and then easily transfer HTML content to Python data structures.


In [2]:
from bs4 import BeautifulSoup
# The "requests" library makes working with HTTP requests easier
# than the built-in urllib libraries.
import requests

Secondly, we prepare the data frame movie_df to store the data that we will scrape from BOM. We give this dataframe 9 columns:

* ranking: the ranking of the movie in its release year by gross
* title: movie title
* gross: how much the movie grossed while in theatres
* Total_theaters: the total number of theaters that showed this movie
* opening_gross: how much the movie grossed in the opening weekend (Fri-Sun)
* opening_theaters: the total number of theaters that showed this movie in the opening weekend (Fri-Sun)
* open_date: date of opening
* close_date: date of closing
* year: year of release

In [3]:
movie_df = pd.DataFrame(columns=['close_date', 'gross', 'open_date', 'opening_gross', 'opening_theaters','ranking','title','total_theaters','year'])

Now we write a function rowInfoGrabber that we will call in a loop over the table on the BOM webpage to grab the attributes and save them into the corresponding columns in movie_df.


In [4]:
def rowInfoGrabber(r):
    info = []
    # Ranking
    info.append(int(r.find("font").get_text()))
    # Title
    info.append(r.find("a").get_text())
    # Gross
    info.append(int(r.find("td", attrs={"align":"right"}).find("b").get_text().strip("$").replace(",","")))
    '''
    For the next 3 categories, we need to deal with the 2000 Anomaly "Fantasia" where there are missing numbers.
    In this case I have chosen to replace the missing values 'N/A' with the values from 'Final Destination', which
    if right above it in the movie table and differs in gross income by about $1 million, which is a small 
    difference. See the picture below for a snapshot of the anomaly in the movie table from 2000.
    '''
    # Total number of theaters
    if r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td",attrs={"align":"right"})[1].find("font").get_text().replace(",","")))
    # Opening Gross
    if r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","") == 'N/A':
        info.append(10015822)
    else: 
        info.append(int(r.find_all("td", attrs={"align":"right"})[2].find("font").get_text().strip("$").replace(",","")))
    # Opening Number of Theaters
    if r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","") == 'N/A':
        info.append(2587)
    else:
        info.append(int(r.find_all("td", attrs={"align":"right"})[3].find("font").get_text().replace(",","")))
    # Date of Opening
    info.append(r.find_all("td", attrs={"align":"right"})[4].find("a").get_text())
    # Date of Closing: Before 2002 they didn't have a "closing" date in their tables. We must account for this.
    if (len(r.find_all("td", attrs={"align":"right"})) <= 5):
        info.append('-')
    else:
        info.append(r.find_all("td", attrs={"align":"right"})[5].find("font").get_text())
    return info

This is the image:


In [5]:
fields = ["ranking", "title", "gross", "total_theaters", "opening_gross", "opening_theaters", "open_date", "close_date"]

Finally we're ready to scrape!
Because IMDB was created in 1990, we will scrape that far back in BOM. So we're scraping the past 26 years (1990 - 2015). Also note that because the HTML was changed starting in 2001, our scraping will be a little different before and after then.


In [6]:
%%time
years = [1990 + i for i in range(26)]
for year in years:
    pageText = requests.get("http://www.boxofficemojo.com/yearly/chart/?yr=%(yr)d&p=.htm" % {'yr':year})
    soup = BeautifulSoup(pageText.text, "html.parser")
    movieTable = soup.find("td", attrs={"colspan":"3"})
    movieRows = movieTable.find("table").find_all("tr")[2:102]
    print year
    movie_dicts = [dict(zip(fields, rowInfoGrabber(row))) for row in movieRows]
    year_df = pd.DataFrame(movie_dicts)
    year_df['year'] = year
    movie_df = movie_df.append(year_df, ignore_index=True)
    time.sleep(1)


1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
CPU times: user 10.6 s, sys: 171 ms, total: 10.8 s
Wall time: 41.6 s

In [7]:
movie_df.shape


Out[7]:
(2600, 9)

In [8]:
movie_df.head()


Out[8]:
close_date gross open_date opening_gross opening_theaters ranking title total_theaters year
0 - 285761243 11/16 17081997 1202 1 Home Alone 2173 1990
1 - 217631306 7/13 12191540 1101 2 Ghost 1766 1990
2 - 184208848 11/9 598257 14 3 Dances with Wolves 1636 1990
3 - 178406268 3/23 11280591 1325 4 Pretty Woman 1811 1990
4 - 135265915 3/30 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990

Because some films do not have a close date, we will have to be careful with the close date!


In [9]:
# if we decide it's worth just dumping the movies with no close_date, we can use the code below
# movie_df=movie_df[movie_df.close_date != '-'].reset_index(drop=True)

Next, we combine the close_date, open_date and year columns into two columns close_date and open_date that are time series. This will make it easier for us to work with the data in the future.


In [10]:
# splitting the close_date and open_date into the respective month and day
movie_df['close_month'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[:x.find('/')])
movie_df['close_day'] = movie_df['close_date'].map(lambda x: '0' if x=='-' else x[x.find('/')+1:len(x)])
movie_df['open_month'] = movie_df['open_date'].map(lambda x: x[:x.find('/')])
movie_df['open_day'] = movie_df['open_date'].map(lambda x: x[x.find('/')+1:len(x)])

# dropping the old close_date and open_date
movie_df = movie_df.drop('close_date', 1)
movie_df = movie_df.drop('open_date', 1)

# creating an open_year by turning the year column into a string and getting rid of trailing bits
movie_df['open_year'] = movie_df.year.astype(str)
movie_df['open_year'] = movie_df.open_year.map(lambda x: x[:x.find('.')])

# creating a close_year column, by looking at whether the close month is earlier/later than the open month in the year
close_month = movie_df['close_month'].astype(int)
open_month = movie_df['open_month'].astype(int)
year = movie_df['year'].astype(int)
close_year=[]
for i in range (0, len(year)):
    if close_month[i] >= open_month[i]:
        close_year.append(year[i])
    else:
        close_year.append(year[i]+1) 
movie_df['close_year'] = close_year
movie_df['close_year'] = movie_df['close_year'].astype(str)

In [11]:
movie_df.head()


Out[11]:
gross opening_gross opening_theaters ranking title total_theaters year close_month close_day open_month open_day open_year close_year
0 285761243 17081997 1202 1 Home Alone 2173 1990 0 0 11 16 1990 1991
1 217631306 12191540 1101 2 Ghost 1766 1990 0 0 7 13 1990 1991
2 184208848 598257 14 3 Dances with Wolves 1636 1990 0 0 11 9 1990 1991
3 178406268 11280591 1325 4 Pretty Woman 1811 1990 0 0 3 23 1990 1991
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 1990 0 0 3 30 1990 1991

In [12]:
# making close_date and open_date by concatenating the year, month and day
import datetime
close_date = []
for index, row in movie_df.iterrows():
    if row.close_day != '0':
        close_date.append(datetime.datetime(int(row.close_year), int(row.close_month), int(row.close_day)))
    else: 
        close_date.append(None)
movie_df['close_date'] = close_date

movie_df['open_date']=movie_df.open_year + '-' + movie_df.open_month + '-' + movie_df.open_day
movie_df['open_date']=movie_df['open_date'].apply(pd.datetools.parse)

In [13]:
# dropping unnecessary columns
movie_df = movie_df.drop('close_day', 1)
movie_df = movie_df.drop('close_month', 1)
movie_df = movie_df.drop('close_year', 1)
movie_df = movie_df.drop('open_day', 1)
movie_df = movie_df.drop('open_month', 1)
movie_df = movie_df.drop('open_year', 1)
movie_df = movie_df.drop('year', 1)

Let's take a look at the data, now!


In [14]:
movie_df.head()


Out[14]:
gross opening_gross opening_theaters ranking title total_theaters close_date open_date
0 285761243 17081997 1202 1 Home Alone 2173 None 1990-11-16
1 217631306 12191540 1101 2 Ghost 1766 None 1990-07-13
2 184208848 598257 14 3 Dances with Wolves 1636 None 1990-11-09
3 178406268 11280591 1325 4 Pretty Woman 1811 None 1990-03-23
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 None 1990-03-30

Let's take a look at if we can get the run times for each movie!


In [15]:
run_time=[]
for index, row in movie_df.iterrows():
    if row.close_date != None:
        run_time.append(row['close_date']-row['open_date'])
    else: 
        run_time.append('N/A')

In [16]:
movie_df.head()


Out[16]:
gross opening_gross opening_theaters ranking title total_theaters close_date open_date
0 285761243 17081997 1202 1 Home Alone 2173 None 1990-11-16
1 217631306 12191540 1101 2 Ghost 1766 None 1990-07-13
2 184208848 598257 14 3 Dances with Wolves 1636 None 1990-11-09
3 178406268 11280591 1325 4 Pretty Woman 1811 None 1990-03-23
4 135265915 25398367 2006 5 Teenage Mutant Ninja Turtles 2377 None 1990-03-30

Looks like the data is ready for us to use! Let's save this data so we're ready to use it next time.


In [17]:
! pip install pymongo


Requirement already satisfied (use --upgrade to upgrade): pymongo in /anaconda/lib/python2.7/site-packages

In [18]:
# Save the movie Dictionaries corresponding to each row of the BoxOfficeMojo table.
import json # (dong)
import pymongo
from bson import json_util

# Make a dictionary out of the dataset for storage in JSON format.
movieSaved = {feature: movie_df[feature].values.tolist() for feature in movie_df.columns.values}
fp = open("allMovies.json","w")
json.dump(movieSaved, fp, default=json_util.default)
fp.close()

Loading and preparing IMDB review dataset

We have cleaned up our IMDB review dataset in the ipython notebook New.ipynb. From that notebook we have been able to save dictionaries of our data, which we will now call.


In [25]:
with open("train_df_dict.json", "r") as fd:
    train_df_dict = json.load(fd)
with open("test_df_dict.json", "r") as fd:
    test_df_dict = json.load(fd)

In [26]:
train_df = pd.DataFrame(train_df_dict)
test_df = pd.DataFrame(test_df_dict)

The Stanford group distinguishes between train and test because this was relevant for their project. This may prove to be useful later, so we will keep two separate dataframes. However, for our purposes at the moment, we can combine them since the BOM data will serve as our true data set.


In [27]:
IMDB_df = train_df.append(test_df)

We want to figure out which movies from IMDB_df are also present in our BOM DF. So let's get all the movie titles in our BOM table.


In [28]:
BOM_movie_list = movie_df.title.values.tolist()

Now let's create a mask over IMDB_df, the boolean values of which indicate whether or not a movie is in the BOM list.


In [29]:
movie_mask = [(movie in BOM_movie_list) for movie in IMDB_df.movie_name]

In [30]:
sum(movie_mask)


Out[30]:
4111

We can now create out final, relevant IMDB data frame, with only those movies that also appear in the BOM tables from 1990 - 2015.


In [31]:
IMDB_dftouse=IMDB_df[movie_mask]

Finally we want to save our dictionary of IMDB_dftouse into a JSON file for storage.


In [32]:
IMDB_dftouse_dict = {feature: IMDB_dftouse[feature].values.tolist() for feature in IMDB_dftouse.columns.values}

fp = open("IMDB_dftouse_dict.json","w")
json.dump(IMDB_dftouse_dict, fp)
fp.close()

In [33]:
with open("IMDB_dftouse_dict.json", "r") as fd:
    IMDB_dftouse_dict = json.load(fd)

Loading and preparing AFINN dictionary

We wil read in the AFINN-111 dictionary from the tab-delimited .txt file and save it into a pandas dataframe dictionary.


In [19]:
words = []
valences = []

with open('AFINN/AFINN-111.txt','rb') as source:
    for line in source:
        fields = line.split('\t')
        words.append(fields[0])
        valences.append(fields[1])
dictionary = pd.DataFrame({'words': words, 'valences':valences})

Because the valences are read in as a string, we will clean it up and change it to an int so it is ready for manipulation.


In [20]:
dictionary['valences'] = dictionary['valences'].map(lambda x: x if x.find('\n')== -1 else x[:x.find('\n')])
dictionary['valences']=dictionary['valences'].astype(int)

Let's take a look at the dictionary!


In [21]:
dictionary.head()


Out[21]:
valences words
0 -2 abandon
1 -2 abandoned
2 -2 abandons
3 -2 abducted
4 -2 abduction

Looks like the dictionary is ready for us to use! Let's save it so we have it in the event that anything happens.


In [24]:
# dictionary.to_csv("dictionary.csv", index=False)
dictionary=pd.read_csv("dictionary.csv")

Milestone 2: Analysing and visualizing the data

Descriptive statistics

Analysis

Visualization

Milestone 3: Video and finishing touches

Screencast video

Website

Finishing touches


In [ ]: