Description of the project: This project is figuring out the correlation
between the amount of revenue a movie creates and the month it was released in,
and the correlation betwen the amount of revenue a movie creates and the production cost.
For the size of the data, movies released from 1995-2016 are in the dataset.
The data used in this project is imported from 'the-numbers.com.' This site has the data of all movies with the information of their ['title','release_date','production cost', 'domestic gross', 'worldwide gross']. We will look at domestic gross and worldwide gross separately to avoid bias. Also, the site can create different report with certain standards. EX)Movie released within year(1995-2016), production cost(0-1 mil USD), and releaded month(January). With these inbuilt functions, we create the online table, import the table through read_html, play around with them to get more useful data, and then do the analysis/visulaization.
The reason that we import so many separate tables is for two reasons.
In [1]:
import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Below is importing the different tables (according to month).
In [2]:
new_table_1 = pd.read_html("http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/1/1/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc")
new_table_2 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/2/2/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_3 = pd.read_html("http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/3/3/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc")
new_table_4 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/4/4/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_5 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/5/5/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_6 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/6/6/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_7 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/7/7/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_8 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/8/8/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_9 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/9/9/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_10 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/10/10/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_11 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/11/11/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_12 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/12/12/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
In [3]:
#Creating list with all the dataframes
all_of_nt = [new_table_1[0], new_table_2[0], new_table_3[0],new_table_4[0], new_table_5[0],new_table_6[0],new_table_7[0],new_table_8[0],new_table_9[0],new_table_10[0],new_table_11[0],new_table_12[0]]
In [4]:
#Stripping unused parts of the table
for nt in all_of_nt:
nt['DomesticBox Office'] = nt['DomesticBox Office'].str.strip('$')
nt['DomesticBox Office'] = nt['DomesticBox Office'].str.replace(',','')
nt['DomesticBox Office'] = nt['DomesticBox Office'].astype(int)
nt['InternationalBox Office'] = nt['InternationalBox Office'].str.strip('$')
nt['InternationalBox Office'] = nt['InternationalBox Office'].str.replace(',','')
nt['InternationalBox Office'] = nt['InternationalBox Office'].astype(int)
nt['WorldwideBox Office'] = nt['WorldwideBox Office'].str.strip('$')
nt['WorldwideBox Office'] = nt['WorldwideBox Office'].str.replace(',','')
nt['WorldwideBox Office'] = nt['WorldwideBox Office'].astype(int)
In [5]:
avg_dom_rev = []
for nt2 in all_of_nt:
avg_dom_rev.append(nt2['DomesticBox Office'].mean())
dom_rev = pd.DataFrame(avg_dom_rev)
dom_rev = dom_rev.transpose()
dom_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dom_rev.rename(index={0: 'Domestic Revenue'})
Out[5]:
In [6]:
row = dom_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'Domestic BoxOffice Revenue by month (in 100 million USD)')
Out[6]:
In [7]:
avg_int_rev = []
for nt3 in all_of_nt:
avg_int_rev.append(nt3['InternationalBox Office'].mean())
int_rev = pd.DataFrame(avg_int_rev)
int_rev = int_rev.transpose()
int_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
int_rev.rows = ['International Revenue']
int_rev.rename(index={0: 'International BoxOffice Revenue'})
Out[7]:
In [8]:
row = int_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'International BoxOffice Revenue by month (in 100 million USD)')
Out[8]:
In [9]:
avg_ww_rev = []
for nt4 in all_of_nt:
avg_ww_rev.append(nt4['WorldwideBox Office'].mean())
ww_rev = pd.DataFrame(avg_ww_rev)
ww_rev = ww_rev.transpose()
ww_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
ww_rev.rows = ['Worldwide Revenue']
ww_rev.rename(index={0: 'Worldwide BoxOffice Revenue'})
Out[9]:
In [10]:
row = int_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'Worldw BoxOffice Revenue by month (in 100 million USD)')
Out[10]:
Now we import the different tables filtered by the different production costs. [0-1,1-10,11-50,51-120,120-] (mil USD) They are divided into 5 different categories.
In [11]:
prod_table_1 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/1/1995/2016/None/None/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
prod_table_2 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/1/10/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_3 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/10/50/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_4 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/50/120/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_5 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/120/None/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
In [12]:
#Making list with all the dataframes.
prod = [prod_table_1[0], prod_table_2[0], prod_table_3[0], prod_table_4[0], prod_table_5[0]]
In [13]:
for tab in prod:
tab['DomesticBox Office'] = tab['DomesticBox Office'].str.strip('$')
tab['DomesticBox Office'] = tab['DomesticBox Office'].str.replace(',','')
tab['DomesticBox Office'] = tab['DomesticBox Office'].astype(int)
tab['InternationalBox Office'] = tab['InternationalBox Office'].str.strip('$')
tab['InternationalBox Office'] = tab['InternationalBox Office'].str.replace(',','')
tab['InternationalBox Office'] = tab['InternationalBox Office'].astype(int)
In [14]:
#Making the right format of dataframe.
avg_dom_rev2 = []
for tab2 in prod:
avg_dom_rev2.append(tab2['DomesticBox Office'].mean())
dom_rev2 = pd.DataFrame(avg_dom_rev2)
dom_rev2 = dom_rev2.transpose()
dom_rev2.columns = ['0-1', '1-10', '11-50', '51-120','120-']
dom_rev2.rename(index={0: 'Domestic Revenue'})
Out[14]:
In [15]:
row = dom_rev2.iloc[0]
row.plot(kind = 'bar',figsize = (12,5), title = 'Domestic BoxOffice Revenue by production cost (x axis in 1 mil USD, y axis in 100 million USD)')
Out[15]:
In [16]:
#Making the right form of dataframe.
avg_int_rev2 = []
for tab3 in prod:
avg_int_rev2.append(tab3['InternationalBox Office'].mean())
int_rev2 = pd.DataFrame(avg_int_rev2)
int_rev2 = int_rev2.transpose()
int_rev2.columns = ['0-1', '1-10', '11-50', '51-120','120-']
int_rev2.rename(index={0: 'International Revenue'})
Out[16]:
In [17]:
row = int_rev2.iloc[0]
row.plot(kind = 'bar',figsize = (12,5), title = 'International BoxOffice Revenue by production cost (x axis in 1 mil USD, y axis in 100 million USD)')
Out[17]:
Correlation between released month and BoxOffice revenue For Domestic BoxOffice: May and December had the most revenue followed by June, July and November. For International BoxOffice: May had the most revenue, outstanding over any other month. For Worldwide BoxOffice: May was the best month to release a movie. July was second, followed by November. January and September, generally the beginning of the semester turned out to be the worst time to release a movie.
Correlation between production cost and BoxOffice Revenue For Domestic BoxOffice: There was a decrease from the first index to the second index. A big jump from the second to third index. And then, there is a gradual growth. But if we look at the actual index, the revenue doesn’t increase as substantially as the production costs increase. For International BoxOffice: There is a slight increase from first index to second index. And compared to Domestic BoxOffice, the increase of revenue over the last three indexes are larger.