**Data Bootcamp Final Project**

**Success rate of movies: released date and production cost**


Name: Jaehurn Nam
Net ID: jn1402
N#: N10448338

Description of the project: This project is figuring out the correlation
between the amount of revenue a movie creates and the month it was released in,
and the correlation betwen the amount of revenue a movie creates and the production cost.
For the size of the data, movies released from 1995-2016 are in the dataset.

The data used in this project is imported from 'the-numbers.com.' This site has the data of all movies with the information of their ['title','release_date','production cost', 'domestic gross', 'worldwide gross']. We will look at domestic gross and worldwide gross separately to avoid bias. Also, the site can create different report with certain standards. EX)Movie released within year(1995-2016), production cost(0-1 mil USD), and releaded month(January). With these inbuilt functions, we create the online table, import the table through read_html, play around with them to get more useful data, and then do the analysis/visulaization.

The reason that we import so many separate tables is for two reasons.

  1. If we import the whole dataset table, the table size is too large, that jupyter notebook lacks for minutes, and becomes time-inefficient.
  2. The ready-filtered tables have the same sizes (100 movies in each table). So, as we have the same size of data size for each table, there would be less bias.

In [1]:
import pandas as pd
import sys
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

Below is importing the different tables (according to month).


In [2]:
new_table_1 = pd.read_html("http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/1/1/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc")
new_table_2 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/2/2/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_3 = pd.read_html("http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/3/3/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc")
new_table_4 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/4/4/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_5 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/5/5/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_6 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/6/6/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_7 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/7/7/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_8 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/8/8/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_9 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/9/9/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_10 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/10/10/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_11 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/11/11/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
new_table_12 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/None/1995/2016/12/12/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')

In [3]:
#Creating list with all the dataframes
all_of_nt = [new_table_1[0], new_table_2[0], new_table_3[0],new_table_4[0], new_table_5[0],new_table_6[0],new_table_7[0],new_table_8[0],new_table_9[0],new_table_10[0],new_table_11[0],new_table_12[0]]

In [4]:
#Stripping unused parts of the table
for nt in all_of_nt:
    nt['DomesticBox Office'] = nt['DomesticBox Office'].str.strip('$')
    nt['DomesticBox Office'] = nt['DomesticBox Office'].str.replace(',','')
    nt['DomesticBox Office'] = nt['DomesticBox Office'].astype(int)
    
    nt['InternationalBox Office'] = nt['InternationalBox Office'].str.strip('$')
    nt['InternationalBox Office'] = nt['InternationalBox Office'].str.replace(',','')
    nt['InternationalBox Office'] = nt['InternationalBox Office'].astype(int)
    
    nt['WorldwideBox Office'] = nt['WorldwideBox Office'].str.strip('$')
    nt['WorldwideBox Office'] = nt['WorldwideBox Office'].str.replace(',','')
    nt['WorldwideBox Office'] = nt['WorldwideBox Office'].astype(int)

In [5]:
avg_dom_rev = []
for nt2 in all_of_nt:
    avg_dom_rev.append(nt2['DomesticBox Office'].mean())

dom_rev = pd.DataFrame(avg_dom_rev)
dom_rev = dom_rev.transpose()
dom_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dom_rev.rename(index={0: 'Domestic Revenue'})


Out[5]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Domestic Revenue 54280896.5 80164840.07 1.040786e+08 75980992.46 1.873748e+08 1.693207e+08 1.639872e+08 94970377.32 65019217.97 77963171.19 1.576578e+08 1.683810e+08

In [6]:
row = dom_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'Domestic BoxOffice Revenue by month (in 100 million USD)')


Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x119501fd0>

In [7]:
avg_int_rev = []
for nt3 in all_of_nt:
    avg_int_rev.append(nt3['InternationalBox Office'].mean())

int_rev = pd.DataFrame(avg_int_rev)
int_rev = int_rev.transpose()
int_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
int_rev.rows = ['International Revenue']
int_rev.rename(index={0: 'International BoxOffice Revenue'})


Out[7]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
International BoxOffice Revenue 42243312.24 73733818.44 1.087151e+08 83268876.47 2.937225e+08 218412599.4 2.315711e+08 86644519.38 64704167.31 85337969.34 2.296659e+08 2.247314e+08

In [8]:
row = int_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'International BoxOffice Revenue by month (in 100 million USD)')


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187d4358>

In [9]:
avg_ww_rev = []
for nt4 in all_of_nt:
    avg_ww_rev.append(nt4['WorldwideBox Office'].mean())

ww_rev = pd.DataFrame(avg_ww_rev)
ww_rev = ww_rev.transpose()
ww_rev.columns = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
ww_rev.rows = ['Worldwide Revenue']
ww_rev.rename(index={0: 'Worldwide BoxOffice Revenue'})


Out[9]:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Worldwide BoxOffice Revenue 96524208.74 1.538987e+08 2.127937e+08 1.592499e+08 4.810973e+08 3.877333e+08 3.955583e+08 181614896.7 1.297234e+08 1.633011e+08 3.873237e+08 3.931124e+08

In [10]:
row = int_rev.iloc[0]
row.plot(kind = 'bar',figsize = (10,5), title = 'Worldw BoxOffice Revenue by month (in 100 million USD)')


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x1186cb4e0>

Now we import the different tables filtered by the different production costs. [0-1,1-10,11-50,51-120,120-] (mil USD) They are divided into 5 different categories.


In [11]:
prod_table_1 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/None/1/1995/2016/None/None/None/None?show-release-date=On&show-domestic-box-office=On&show-international-box-office=On&show-worldwide-box-office=On&view-order-by=domestic-box-office&view-order-direction=desc')
prod_table_2 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/1/10/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_3 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/10/50/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_4 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/50/120/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')
prod_table_5 = pd.read_html('http://www.the-numbers.com/movies/report/All/All/All/All/All/All/All/All/All/120/None/1995/2016/None/None/None/None?show-release-date=On&view-order-by=domestic-box-office&view-order-direction=desc&show-domestic-box-office=On&show-international-box-office=On')

In [12]:
#Making list with all the dataframes.
prod = [prod_table_1[0], prod_table_2[0], prod_table_3[0], prod_table_4[0], prod_table_5[0]]

In [13]:
for tab in prod:
    tab['DomesticBox Office'] = tab['DomesticBox Office'].str.strip('$')
    tab['DomesticBox Office'] = tab['DomesticBox Office'].str.replace(',','')
    tab['DomesticBox Office'] = tab['DomesticBox Office'].astype(int)
    
    tab['InternationalBox Office'] = tab['InternationalBox Office'].str.strip('$')
    tab['InternationalBox Office'] = tab['InternationalBox Office'].str.replace(',','')
    tab['InternationalBox Office'] = tab['InternationalBox Office'].astype(int)

In [14]:
#Making the right format of dataframe.
avg_dom_rev2 = []
for tab2 in prod:
    avg_dom_rev2.append(tab2['DomesticBox Office'].mean())
    
dom_rev2 = pd.DataFrame(avg_dom_rev2)
dom_rev2 = dom_rev2.transpose()
dom_rev2.columns = ['0-1', '1-10', '11-50', '51-120','120-']
dom_rev2.rename(index={0: 'Domestic Revenue'})


Out[14]:
0-1 1-10 11-50 51-120 120-
Domestic Revenue 55006338.88 53659767.65 1.487027e+08 2.235013e+08 3.083216e+08

In [15]:
row = dom_rev2.iloc[0]
row.plot(kind = 'bar',figsize = (12,5), title = 'Domestic BoxOffice Revenue by production cost (x axis in 1 mil USD, y axis in 100 million USD)')


Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x118aae7f0>

In [16]:
#Making the right form of dataframe.

avg_int_rev2 = []
for tab3 in prod:
    avg_int_rev2.append(tab3['InternationalBox Office'].mean())
    
int_rev2 = pd.DataFrame(avg_int_rev2)
int_rev2 = int_rev2.transpose()
int_rev2.columns = ['0-1', '1-10', '11-50', '51-120','120-']
int_rev2.rename(index={0: 'International Revenue'})


Out[16]:
0-1 1-10 11-50 51-120 120-
International Revenue 34347373.45 35378870.55 1.181624e+08 286068538.1 5.318220e+08

In [17]:
row = int_rev2.iloc[0]
row.plot(kind = 'bar',figsize = (12,5), title = 'International BoxOffice Revenue by production cost (x axis in 1 mil USD, y axis in 100 million USD)')


Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1187d75c0>

Short Concluding Remarks

Correlation between released month and BoxOffice revenue For Domestic BoxOffice: May and December had the most revenue followed by June, July and November. For International BoxOffice: May had the most revenue, outstanding over any other month. For Worldwide BoxOffice: May was the best month to release a movie. July was second, followed by November. January and September, generally the beginning of the semester turned out to be the worst time to release a movie.

Correlation between production cost and BoxOffice Revenue For Domestic BoxOffice: There was a decrease from the first index to the second index. A big jump from the second to third index. And then, there is a gradual growth. But if we look at the actual index, the revenue doesn’t increase as substantially as the production costs increase. For International BoxOffice: There is a slight increase from first index to second index. And compared to Domestic BoxOffice, the increase of revenue over the last three indexes are larger.

Thank you!