What Makes a Movie Profitable? Indicators of Box Office Success

Solana San Pietro and Anthony Eid

Data Bootcamp Spring 2017 UG Section Final Project

Background

In 2016, the global film industry generated 38 billion US dollars, and revenues are projected to increase to nearly 50 billion US dollars by 2020. However, although a highly profitable industry as a whole, every movie produced also requires a high level of investment from studios, producers, etc. with little guarantee that there will be a return on investment. A movie's profitability is increasingly elusive with the shift in consumer behavior towards media consumption and rising marketing costs.

The unpredicatability of a movie's profitability was highlighted recently with the failure of the movie "The Great Wall". Financed through a joint partnership between the US and China, the movie was expected to be a huge success. Nonetheless, the movie only made 34.8 million US dollars in the North American box office, in comparison with a production budget of 150 million US dollars.

On the other hand, Amazon has been investing over 3 billion US dollars annually in original content with various success stories such as their investment in "Manchester in the Sea", which recieved six Oscar nominations in 2017.

The profitability of an individual film is highly unpredictable, but industry investors would be put at ease if there could be some understanding of whether their investments will be worthwhile. Therefore, we want to look at the best indicators to predict a film’s success. We define a film's success by its domestic box office revenue.

Data

We acquired our data from the following website: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

The "IMDB 5000 Movie Set" contains movie data for over 5000 movies made in the past 100 years. The information was scraped from IMDB's database, and we were able to download it as a CSV file.


In [190]:
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import time
import seaborn as sns
from pandas_datareader import wb, data as web

# plotly imports
from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

%matplotlib inline
plotly.offline.init_notebook_mode(connected=True)

print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())

csv = "/Users/AnthonyEid/Desktop/movie_metadata.csv"

df = pd.read_csv(csv)
df.head()


Python version: 3.5.2 |Anaconda custom (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.18.1
Matplotlib version:  1.5.3
Today:  2017-05-12
Out[190]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

The "IMDB 5000" dataset included more information than we thought would be necesary or relevant to understanding what factors impact if a movie generates high revenue so we dropped these columns, only keeping data on the following:

Director
Gross
Genre
Actor
Movie Name
Country
Rating 
Budget
Year
IMBD Score

In [191]:
df.drop(df.columns[[0,2,3,4,5,7,13,15,16,17,18,19,24,26,27]], axis=1, inplace=True)

In [192]:
df.columns = ['Director','Actor 1','Gross','Genre','Actor 2','Movie Name','Number of Votes','Actor 3','Country', 'Rating', 'Budget', 'Year', 'IMDB Score']

In [193]:
df.mean()


Out[193]:
Gross              4.846841e+07
Number of Votes    8.366816e+04
Budget             3.975262e+07
Year               2.002471e+03
IMDB Score         6.442138e+00
dtype: float64

In [194]:
print('Variable dtypes:\n', df.dtypes, sep='')
df.head()


Variable dtypes:
Director            object
Actor 1             object
Gross              float64
Genre               object
Actor 2             object
Movie Name          object
Number of Votes      int64
Actor 3             object
Country             object
Rating              object
Budget             float64
Year               float64
IMDB Score         float64
dtype: object
Out[194]:
Director Actor 1 Gross Genre Actor 2 Movie Name Number of Votes Actor 3 Country Rating Budget Year IMDB Score
0 James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 Wes Studi USA PG-13 237000000.0 2009.0 7.9
1 Gore Verbinski Orlando Bloom 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 Jack Davenport USA PG-13 300000000.0 2007.0 7.1
2 Sam Mendes Rory Kinnear 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 Stephanie Sigman UK PG-13 245000000.0 2015.0 6.8
3 Christopher Nolan Christian Bale 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 Joseph Gordon-Levitt USA PG-13 250000000.0 2012.0 8.5
4 Doug Walker Rob Walker NaN Documentary Doug Walker Star Wars: Episode VII - The Force Awakens  ... 8 NaN NaN NaN NaN NaN 7.1

In [195]:
df = df.sort_values('Gross', ascending=False).iloc[:6000]
df.head(5)


Out[195]:
Director Actor 1 Gross Genre Actor 2 Movie Name Number of Votes Actor 3 Country Rating Budget Year IMDB Score
0 James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 Wes Studi USA PG-13 237000000.0 2009.0 7.9
26 James Cameron Kate Winslet 658672302.0 Drama|Romance Leonardo DiCaprio Titanic 793059 Gloria Stuart USA PG-13 200000000.0 1997.0 7.7
29 Colin Trevorrow Judy Greer 652177271.0 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard Jurassic World 418214 Omar Sy USA PG-13 150000000.0 2015.0 7.0
794 Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth The Avengers 995415 Scarlett Johansson USA PG-13 220000000.0 2012.0 8.1
17 Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth The Avengers 995415 Scarlett Johansson USA PG-13 220000000.0 2012.0 8.1

To further clean up our data we set the index and removed duplicates.


In [196]:
df = df.drop_duplicates('Movie Name')
df = df[pd.notnull(df['Gross']) & pd.notnull(df['Budget'])]

df = df.set_index('Movie Name')
df.head()


Out[196]:
Director Actor 1 Gross Genre Actor 2 Number of Votes Actor 3 Country Rating Budget Year IMDB Score
Movie Name
Avatar James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 237000000.0 2009.0 7.9
Titanic James Cameron Kate Winslet 658672302.0 Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 200000000.0 1997.0 7.7
Jurassic World Colin Trevorrow Judy Greer 652177271.0 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 150000000.0 2015.0 7.0
The Avengers Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 220000000.0 2012.0 8.1
The Dark Knight Christopher Nolan Heath Ledger 533316061.0 Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 185000000.0 2008.0 9.0

In [197]:
df['Year'] = df['Year'].astype(int) #change year to integer to get rid of decimal

print('Variable dtypes:\n', df.dtypes, sep='')
df.head()


Variable dtypes:
Director            object
Actor 1             object
Gross              float64
Genre               object
Actor 2             object
Number of Votes      int64
Actor 3             object
Country             object
Rating              object
Budget             float64
Year                 int64
IMDB Score         float64
dtype: object
Out[197]:
Director Actor 1 Gross Genre Actor 2 Number of Votes Actor 3 Country Rating Budget Year IMDB Score
Movie Name
Avatar James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 237000000.0 2009 7.9
Titanic James Cameron Kate Winslet 658672302.0 Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 200000000.0 1997 7.7
Jurassic World Colin Trevorrow Judy Greer 652177271.0 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 150000000.0 2015 7.0
The Avengers Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 220000000.0 2012 8.1
The Dark Knight Christopher Nolan Heath Ledger 533316061.0 Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 185000000.0 2008 9.0

We also noticed the dataset included movies from countries all over the world, but many of these international moives, particularly ones from smaller countries, did not adjust the budget to be in US dollars. For example,the South Korean movie "Lady Vengenace" had a budget of over 4 billion. However, this number is only so large because this is the value of their budget in Korean won.

To adjust for this problem we simply chose to only use data from the US and UK movie industries. Although the UK is a foreign country, many huge Hollywood movies, such as the Bond movies and Harry Potter movies, are listed as UK movies so it is important that we include these films. Furthermore, the difference between dollar and pound tend to not be as dramatic like in the case of South Korea.


In [198]:
df_bybudget = df.sort_values('Budget', ascending=False).iloc[:6000]
df_bybudget.head(10)


Out[198]:
Director Actor 1 Gross Genre Actor 2 Number of Votes Actor 3 Country Rating Budget Year IMDB Score
Movie Name
Lady Vengeance Chan-wook Park Yeong-ae Lee 211667.0 Crime|Drama Min-sik Choi 53508 Hye-jeong Kang South Korea R 4.200000e+09 2005 7.7
Fateless Lajos Koltai Péter Fancsikai 195888.0 Drama|Romance|War Marcell Nagy 5603 Bálint Péntek Hungary R 2.500000e+09 2005 7.1
Princess Mononoke Hayao Miyazaki Jada Pinkett Smith 2298191.0 Adventure|Animation|Fantasy Minnie Driver 221552 Billy Crudup Japan PG-13 2.400000e+09 1997 8.4
Steamboy Katsuhiro Ôtomo Robin Atkin Downes 410388.0 Action|Adventure|Animation|Family|Sci-Fi|Thriller William Hootkins 13727 Rosalind Ayres Japan PG-13 2.127520e+09 2004 6.9
Akira Katsuhiro Ôtomo Takeshi Kusao 439162.0 Action|Animation|Sci-Fi Mitsuo Iwata 106160 Tesshô Genda Japan R 1.100000e+09 1988 8.1
Godzilla 2000 Takao Okawara Naomi Nishida 10037390.0 Action|Adventure|Drama|Sci-Fi|Thriller Hiroshi Abe 5442 Sakae Kimura Japan PG 1.000000e+09 1999 6.0
Kabhi Alvida Naa Kehna Karan Johar John Abraham 3275443.0 Drama Shah Rukh Khan 13998 Preity Zinta India R 7.000000e+08 2006 6.0
Tango Carlos Saura Juan Luis Galiardo 1687311.0 Drama|Musical Mía Maestro 2412 Miguel Ángel Solá Spain PG-13 7.000000e+08 1998 7.2
Kites Anurag Basu Steven Michael Quezada 1602466.0 Action|Drama|Romance|Thriller Bárbara Mori 9673 Kabir Bedi India NaN 6.000000e+08 2010 6.0
Red Cliff John Woo Tony Chiu Wai Leung 626809.0 Action|Adventure|Drama|History|War Takeshi Kaneshiro 36894 Wei Zhao China R 5.536320e+08 2008 7.4

In [199]:
vlist = ['USA', 'UK']
df = df[df['Country'].isin(vlist)]
df.head(5)


Out[199]:
Director Actor 1 Gross Genre Actor 2 Number of Votes Actor 3 Country Rating Budget Year IMDB Score
Movie Name
Avatar James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 237000000.0 2009 7.9
Titanic James Cameron Kate Winslet 658672302.0 Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 200000000.0 1997 7.7
Jurassic World Colin Trevorrow Judy Greer 652177271.0 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 150000000.0 2015 7.0
The Avengers Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 220000000.0 2012 8.1
The Dark Knight Christopher Nolan Heath Ledger 533316061.0 Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 185000000.0 2008 9.0

In [200]:
def millions(number):
    '''show dollar value in millions'''
    return number*(1/1000000)

In [201]:
budgetmillions = millions(df['Budget'])
df = df.assign(Budget_inmillions=budgetmillions)
grossmillions = millions(df['Gross'])
df = df.assign(Gross_inmillions=grossmillions)

In [202]:
df.head(5)


Out[202]:
Director Actor 1 Gross Genre Actor 2 Number of Votes Actor 3 Country Rating Budget Year IMDB Score Budget_inmillions Gross_inmillions
Movie Name
Avatar James Cameron Joel David Moore 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 237000000.0 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet 658672302.0 Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 200000000.0 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer 652177271.0 Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 150000000.0 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. 623279547.0 Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 220000000.0 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger 533316061.0 Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 185000000.0 2008 9.0 185.0 533.316061

In [203]:
df = df.drop(['Budget','Gross'], 1)

In [204]:
df.head(5)


Out[204]:
Director Actor 1 Genre Actor 2 Number of Votes Actor 3 Country Rating Year IMDB Score Budget_inmillions Gross_inmillions
Movie Name
Avatar James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 2008 9.0 185.0 533.316061

In [205]:
df = df.rename(columns={'Budget_inmillions': 'Budget', 'Gross_inmillions': 'Gross'})
df.head(5)


Out[205]:
Director Actor 1 Genre Actor 2 Number of Votes Actor 3 Country Rating Year IMDB Score Budget Gross
Movie Name
Avatar James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 2008 9.0 185.0 533.316061

In [206]:
print('Variable dtypes:\n', df.dtypes, sep='')
df.head()


Variable dtypes:
Director            object
Actor 1             object
Genre               object
Actor 2             object
Number of Votes      int64
Actor 3             object
Country             object
Rating              object
Year                 int64
IMDB Score         float64
Budget             float64
Gross              float64
dtype: object
Out[206]:
Director Actor 1 Genre Actor 2 Number of Votes Actor 3 Country Rating Year IMDB Score Budget Gross
Movie Name
Avatar James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 2008 9.0 185.0 533.316061

Analysis

Before we began comparing revenue, which in this case is our measure for a movie's success, to other factors we wanted to get an understanding for the revenue dataset itself.

We created a boxplot to depict the distribution of average box office revenues.


In [269]:
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(11, 6))
ax = sns.boxplot(x=df["Gross"])
ax.axes.set_title('Distribution of Gross Box Office Revenue', fontsize=15,color="r",alpha=0.5)


Out[269]:
<matplotlib.text.Text at 0x1284dbf60>

In [270]:
mean = df['Gross'].mean()
median = df['Gross'].median()
std = df['Gross'].std()
print('The mean is', mean)
print('The median is', median)
print('The standard deviation', std)


The mean is 54.44671578554139
The median is 31.723874
The standard deviation 70.86491298659033

Budget vs. Gross Revenue

We first wanted to look at the relationship between Budget and Gross profit. Big Hollywood studios tend to invest millions of dollars in productions they want to see perform well. It is well known that movies with big budgets often become global blockbusters, given the amount invested in special effects, marketing & advertising, etc.

We began by generating a standard scatterplot to visualize whether there is a relationship between Budget and Gross Revenue.


In [209]:
df_budget = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Country','Rating','IMDB Score','Number of Votes'], 1)
df_budget.head(5)


Out[209]:
Year Budget Gross
Movie Name
Avatar 2009 237.0 760.505847
Titanic 1997 200.0 658.672302
Jurassic World 2015 150.0 652.177271
The Avengers 2012 220.0 623.279547
The Dark Knight 2008 185.0 533.316061

In [210]:
sns.regplot(x = 'Gross', y = 'Budget', data=df_budget)


Out[210]:
<matplotlib.axes._subplots.AxesSubplot at 0x122dcc668>

In [211]:
from pandas.stats.api import ols
regression = ols(y=df_budget['Budget'],x=df_budget['Gross'])
regression


/Users/AnthonyEid/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning:

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html

Out[211]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         3306
Number of Degrees of Freedom:   2

R-squared:         0.4245
Adj R-squared:     0.4243

Rmse:             32.5246

F-stat (1, 3304):  2437.3292, p-value:     0.0000

Degrees of Freedom: model 1, resid 3304

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.3941     0.0080      49.37     0.0000     0.3785     0.4098
     intercept    17.0559     0.7134      23.91     0.0000    15.6577    18.4542
---------------------------------End of Summary---------------------------------

We found that there is a positive correlation between budget and gross (regression line, R-squared = 0.43). This indicates that the higher the average production budget, the higher the movie's box office performance. This can be explained by a number of factors: increased spending on marketing, expensive special effects to attract a wider audience, elaborate action sequences, famous high-paid actors, etc.

However, one issue in our data set is that we didn't account for inflation, and considering it includes movies from the 1930s onwards, it wouldn't make sense to compare old movies to newer ones.

Adjusting For Inflation

To take into account the inflation rate, and the fact that for this reason more recent movies will find it easier to perform better, we adjusted both budget and box office revenue for inflation.

To do so, we used historical CPI data (1913 to 2016) taken from the following website: http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/


In [212]:
url = "/Users/AnthonyEid/Desktop/CPI_data.csv"

df_CPI = pd.read_csv(url)

print('\n', df_CPI)


      Year      Avg
0    1913    9.900
1    1914   10.000
2    1915   10.100
3    1916   10.900
4    1917   12.800
5    1918   15.100
6    1919   17.300
7    1920   20.000
8    1921   17.900
9    1922   16.800
10   1923   17.100
11   1924   17.100
12   1925   17.500
13   1926   17.700
14   1927   17.400
15   1928   17.100
16   1929   17.100
17   1930   16.700
18   1931   15.200
19   1932   13.700
20   1933   13.000
21   1934   13.400
22   1935   13.700
23   1936   13.900
24   1937   14.400
25   1938   14.100
26   1939   13.900
27   1940   14.000
28   1941   14.700
29   1942   16.300
..    ...      ...
74   1987  113.600
75   1988  118.300
76   1989  124.000
77   1990  130.700
78   1991  136.200
79   1992  140.300
80   1993  144.500
81   1994  148.200
82   1995  152.400
83   1996  156.900
84   1997  160.500
85   1998  163.000
86   1999  166.600
87   2000  172.200
88   2001  177.100
89   2002  179.900
90   2003  184.000
91   2004  188.900
92   2005  195.300
93   2006  201.600
94   2007  207.300
95   2008  215.303
96   2009  214.537
97   2010  218.056
98   2011  224.939
99   2012  229.594
100  2013  232.957
101  2014  236.736
102  2015  237.017
103  2016  240.007

[104 rows x 2 columns]

In [213]:
df_CPI.head()


Out[213]:
Year Avg
0 1913 9.9
1 1914 10.0
2 1915 10.1
3 1916 10.9
4 1917 12.8

In [214]:
df_CPI.columns = ['Year','CPI Avg']

In [215]:
df_budget = df_budget.reset_index()
df_inflation = pd.merge(df_budget, df_CPI, on='Year')
df_inflation.head()


Out[215]:
Movie Name Year Budget Gross CPI Avg
0 Avatar 2009 237.0 760.505847 214.537
1 Transformers: Revenge of the Fallen 2009 200.0 402.076689 214.537
2 Harry Potter and the Half-Blood Prince 2009 250.0 301.956980 214.537
3 The Twilight Saga: New Moon 2009 50.0 296.623634 214.537
4 Up 2009 175.0 292.979556 214.537

In [216]:
df_inflation = df_inflation.set_index("Movie Name")

In [217]:
df_inflation = df_inflation.sort_values('Year', ascending=False).iloc[:6000]
df_inflation.head(5)


Out[217]:
Year Budget Gross CPI Avg
Movie Name
Money Monster 2016 27.0 41.008532 240.007
Hail, Caesar! 2016 22.0 29.997095 240.007
Lights Out 2016 4.9 56.536016 240.007
Bad Moms 2016 20.0 55.461307 240.007
Neighbors 2: Sorority Rising 2016 35.0 55.291815 240.007

In [218]:
df_inflation = df_inflation[pd.notnull(df_inflation['Gross'])]
df_inflation = df_inflation[pd.notnull(df_inflation['Budget'])]

In [219]:
df_inflation.tail()


Out[219]:
Year Budget Gross CPI Avg
Movie Name
Modern Times 1936 1.500 0.163245 13.9
Top Hat 1935 0.609 3.000000 13.7
42nd Street 1933 0.439 2.300000 13.0
The Broadway Melody 1929 0.379 2.808000 17.1
Over the Hill to the Poorhouse 1920 0.100 3.000000 20.0

To find the values of Gross and Budget in 2016 dollars, we need to adjust for inflation using the CPI yearly average. The CPI for 2016 is 240.007


In [220]:
df_inflation["Budget In 2016 $"] = (240.007/df_inflation["CPI Avg"])*df_inflation["Budget"]
df_inflation.head()


Out[220]:
Year Budget Gross CPI Avg Budget In 2016 $
Movie Name
Money Monster 2016 27.0 41.008532 240.007 27.0
Hail, Caesar! 2016 22.0 29.997095 240.007 22.0
Lights Out 2016 4.9 56.536016 240.007 4.9
Bad Moms 2016 20.0 55.461307 240.007 20.0
Neighbors 2: Sorority Rising 2016 35.0 55.291815 240.007 35.0

To find box office grosses equivalent to 2016, we adjusted for inflation and population growth


In [221]:
df_inflation["Gross In 2016 $"] = (240.007/df_inflation["CPI Avg"])*df_inflation["Gross"]
df_inflation.tail(15)


Out[221]:
Year Budget Gross CPI Avg Budget In 2016 $ Gross In 2016 $
Movie Name
Annie Get Your Gun 1950 3.768785 8.000000 24.1 37.532564 79.670373
The Pirate 1948 3.700000 2.956000 24.1 36.847548 29.438203
The Lady from Shanghai 1947 2.300000 0.007927 22.3 24.754085 0.085315
Duel in the Sun 1946 8.000000 20.400000 19.5 98.464410 251.084246
The Best Years of Our Lives 1946 2.100000 23.650000 19.5 25.846908 291.085413
Pinocchio 1940 2.600000 84.300000 14.0 44.572729 1445.185007
Fantasia 1940 2.280000 76.400000 14.0 39.086854 1309.752486
Gone with the Wind 1939 3.977000 198.655278 13.9 68.669629 3430.119231
The Wizard of Oz 1939 2.800000 22.202612 13.9 48.346734 383.365633
Snow White and the Seven Dwarfs 1937 2.000000 184.925485 14.4 33.334306 3082.181311
Modern Times 1936 1.500000 0.163245 13.9 25.900036 2.818701
Top Hat 1935 0.609000 3.000000 13.7 10.668924 52.556277
42nd Street 1933 0.439000 2.300000 13.0 8.104852 42.462777
The Broadway Melody 1929 0.379000 2.808000 17.1 5.319453 39.411676
Over the Hill to the Poorhouse 1920 0.100000 3.000000 20.0 1.200035 36.001050

In [222]:
sns.regplot(x = 'Gross In 2016 $', y = 'Budget In 2016 $', data=df_inflation)


Out[222]:
<matplotlib.axes._subplots.AxesSubplot at 0x123ec3f28>

In [223]:
from pandas.stats.api import ols
regression = ols(y=df_inflation['Budget In 2016 $'],x=df_inflation['Gross In 2016 $'])
regression


/Users/AnthonyEid/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning:

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html

Out[223]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         3306
Number of Degrees of Freedom:   2

R-squared:         0.1538
Adj R-squared:     0.1535

Rmse:             45.9424

F-stat (1, 3304):   600.4033, p-value:     0.0000

Degrees of Freedom: model 1, resid 3304

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.1400     0.0057      24.50     0.0000     0.1288     0.1512
     intercept    37.8828     0.9176      41.28     0.0000    36.0843    39.6813
---------------------------------End of Summary---------------------------------

We find that correlation is lower now. Why? Let's try to eliminate outliers.


In [224]:
df_inflation1 = df_inflation[((df_inflation['Gross In 2016 $'] - df_inflation['Gross In 2016 $'].mean()) / df_inflation['Gross In 2016 $'].std()).abs() < 3]

In [225]:
sns.regplot(x = 'Gross In 2016 $', y = 'Budget In 2016 $', data=df_inflation1)


Out[225]:
<matplotlib.axes._subplots.AxesSubplot at 0x123ea3630>

In [226]:
from pandas.stats.api import ols
regression = ols(y=df_inflation1['Budget In 2016 $'],x=df_inflation1['Gross In 2016 $'])
regression


/Users/AnthonyEid/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning:

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html

Out[226]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         3268
Number of Degrees of Freedom:   2

R-squared:         0.3620
Adj R-squared:     0.3618

Rmse:             39.3802

F-stat (1, 3266):  1852.9098, p-value:     0.0000

Degrees of Freedom: model 1, resid 3266

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.3542     0.0082      43.05     0.0000     0.3380     0.3703
     intercept    23.9459     0.8940      26.79     0.0000    22.1937    25.6982
---------------------------------End of Summary---------------------------------

In elimating outliers, our R-squared increased from 0.15 to 0.36, but this is still a lower correlation from the one we found prior to adjusting for inflation (0.43).

Over time, movie tastes would have changed, meaning the characteristics of a profitable movie would have also changed, so it didn't make sense for us to compare movies from significantly different eras.

Focusing on data post-2000

Instead of adjusting for inflation, we chose to simply focus on movies post-2000. This time period restriction was placed so that gross revenue comparisons wouldn’t be significantly impacted by the rates of inflation, and to account for changes in movie-going habits.


In [227]:
df_budget = df_budget.sort_values('Year', ascending=False).iloc[:6000]
df_budget.head()


Out[227]:
Movie Name Year Budget Gross
1517 The Boy 2016 10.0 35.794166
479 The Angry Birds Movie 2016 73.0 107.225164
255 X-Men: Apocalypse 2016 178.0 154.985087
368 Central Intelligence 2016 50.0 126.088877
1678 Gods of Egypt 2016 140.0 31.141074

In [228]:
yearlist = range(2000,2017)
df_newbudget = df_budget[df_budget['Year'].isin(yearlist)]
df_newbudget.tail()


Out[228]:
Movie Name Year Budget Gross
1287 The Replacements 2000 50.0 44.737059
370 Erin Brockovich 2000 52.0 125.548685
2780 Requiem for a Dream 2000 4.5 3.609278
2753 Whipped 2000 3.0 4.142507
1944 Titan A.E. 2000 75.0 22.751979

In [229]:
sns.regplot(x = 'Gross', y = 'Budget', data=df_newbudget)


Out[229]:
<matplotlib.axes._subplots.AxesSubplot at 0x1244e4dd8>

In [230]:
regression = ols(y=df_newbudget['Budget'],x=df_newbudget['Gross'])
regression


/Users/AnthonyEid/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning:

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html

Out[230]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         2364
Number of Degrees of Freedom:   2

R-squared:         0.5098
Adj R-squared:     0.5096

Rmse:             32.9062

F-stat (1, 2362):  2456.5680, p-value:     0.0000

Degrees of Freedom: model 1, resid 2362

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x     0.4561     0.0092      49.56     0.0000     0.4381     0.4742
     intercept    17.7295     0.8485      20.90     0.0000    16.0666    19.3925
---------------------------------End of Summary---------------------------------

We now found that the R-squared increased to 0.51, showing a relatively strong correlation between budget and box office grosses, confirming our initital hypothesis that higher production costs, on average, lead to higher revenues (given the amount invested in star power, marketing, etc.)

Budget Ranges

To better visualize this relationship, we grouped different budget ranges into buckets (0-1, 1-10, 10-50, 50-100, 100-150, and 150-300) and generated a bar chart of gross revenues.


In [231]:
df_budget = df_budget.sort_values('Budget', ascending=False).iloc[:6000]
df_budget.head()


Out[231]:
Movie Name Year Budget Gross
48 Pirates of the Caribbean: At World's End 2007 300.0 309.404152
786 John Carter 2012 263.7 73.058679
147 Tangled 2010 260.0 200.807262
35 Spider-Man 3 2007 258.0 336.530303
17 Captain America: Civil War 2016 250.0 407.197282

In [232]:
df_budget.tail()


Out[232]:
Movie Name Year Budget Gross
3295 Newlyweds 2011 0.009000 0.004584
2896 El Mariachi 1992 0.007000 2.040920
3067 Primer 2004 0.007000 0.424760
3197 My Date with Drew 2004 0.001100 0.085222
3038 Tarnation 2003 0.000218 0.592014

In [233]:
bins = [0,1,10,50,100,150,301]
group_names = ['0-1','1-10','10-50','50-100','100-150','150-300']
categories = pd.cut(df_budget['Budget'], bins, labels=group_names)
df_budget['Avg Budget'] = pd.cut(df_budget['Budget'], bins, labels=group_names)
categories.head(5)


Out[233]:
48     150-300
786    150-300
147    150-300
35     150-300
17     150-300
Name: Budget, dtype: category
Categories (6, object): [0-1 < 1-10 < 10-50 < 50-100 < 100-150 < 150-300]

In [234]:
df_budget.tail()


Out[234]:
Movie Name Year Budget Gross Avg Budget
3295 Newlyweds 2011 0.009000 0.004584 0-1
2896 El Mariachi 1992 0.007000 2.040920 0-1
3067 Primer 2004 0.007000 0.424760 0-1
3197 My Date with Drew 2004 0.001100 0.085222 0-1
3038 Tarnation 2003 0.000218 0.592014 0-1

In [235]:
pd.value_counts(df_budget['Avg Budget'])


Out[235]:
10-50      1606
1-10        690
50-100      551
0-1         195
100-150     165
150-300      99
Name: Avg Budget, dtype: int64

In [236]:
f, ax = plt.subplots(figsize=(10, 5))
Budget = sns.barplot(x="Avg Budget", y="Gross", data=df_budget, palette="Set3")
Budget.axes.set_title('Average Budget v. Gross Revenue', fontsize=15,color="r",alpha=0.5)


Out[236]:
<matplotlib.text.Text at 0x1246a8828>

As can be seen above, the higher the budget range, the more financially successful movies are, on average.

IMDB Rating vs. Gross Revenue

Next, we chose to look at the relationship between the IMDB Score and Gross Revenue. The IMBD score is a weighted average of votes the movie recieved by IMDB users. It is meant to reflect how the general public recieves a movie.

Understanding this relationship is important because it can highlight whether the quality of the movie produced should be considered. Although, scores are tallied after the movie is initially released, many people will look to the rating to decide what film to watch which will likely impact how profitable a movie is after the initial opening weekend.


In [237]:
df.head()


Out[237]:
Director Actor 1 Genre Actor 2 Number of Votes Actor 3 Country Rating Year IMDB Score Budget Gross
Movie Name
Avatar James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 2008 9.0 185.0 533.316061

In [238]:
df_review = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Rating','Year','Budget','Country'], 1)

In [239]:
df_review.head()


Out[239]:
Number of Votes IMDB Score Gross
Movie Name
Avatar 886204 7.9 760.505847
Titanic 793059 7.7 658.672302
Jurassic World 418214 7.0 652.177271
The Avengers 995415 8.1 623.279547
The Dark Knight 1676169 9.0 533.316061

In [240]:
IMDB = sns.regplot(y="Gross", x="IMDB Score", data=df_review)
IMDB.axes.set_title('Gross Revenue vs. IMDB Score', fontsize=15,color="r",alpha=0.5)


Out[240]:
<matplotlib.text.Text at 0x12491a0f0>

In [241]:
regression_IMBD = ols(x=df_review['IMDB Score'],y=df_review['Gross'])
regression_IMBD


/Users/AnthonyEid/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning:

The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html

Out[241]:
-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <x> + <intercept>

Number of Observations:         3306
Number of Degrees of Freedom:   2

R-squared:         0.0634
Adj R-squared:     0.0631

Rmse:             68.5911

F-stat (1, 3304):   223.7561, p-value:     0.0000

Degrees of Freedom: model 1, resid 3304

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             x    16.9866     1.1356      14.96     0.0000    14.7609    19.2124
     intercept   -54.7646     7.3978      -7.40     0.0000   -69.2643   -40.2650
---------------------------------End of Summary---------------------------------

In [243]:
bins = [0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5]
group_names = [1,2,3,4,5,6,7,8,9]
categories2 = pd.cut(df_review['IMDB Score'], bins, labels=group_names)
df_review['Avg_Score'] = pd.cut(df_review['IMDB Score'], bins, labels=group_names)
categories2.head()


Out[243]:
Movie Name
Avatar              8
Titanic             8
Jurassic World      7
The Avengers        8
The Dark Knight     9
Name: IMDB Score, dtype: category
Categories (9, int64): [1 < 2 < 3 < 4 ... 6 < 7 < 8 < 9]

In [272]:
f, ax = plt.subplots(figsize=(10, 5))
AvgScore = sns.boxplot(x="Avg_Score", y="Gross", data=df_review, palette="Set3")
AvgScore.axes.set_title('Average Score v. Gross Revenue', fontsize=15,color="r",alpha=0.5)


Out[272]:
<matplotlib.text.Text at 0x123a2ca90>

The R-Squared indicates that there is very little relationship between IMDB scores and gross revenue. However, the box plot shows that movies that are ranked highly still tend to generate higher revenues. The outliers highlight that the the highest grossing movies don't necessarily have the highest scores.


In [274]:
df_review.head(10)


Out[274]:
Number of Votes IMDB Score Gross Avg_Score
Movie Name
Avatar 886204 7.9 760.505847 8
Titanic 793059 7.7 658.672302 8
Jurassic World 418214 7.0 652.177271 7
The Avengers 995415 8.1 623.279547 8
The Dark Knight 1676169 9.0 533.316061 9
Star Wars: Episode I - The Phantom Menace 534658 6.5 474.544677 6
Star Wars: Episode IV - A New Hope 911097 8.7 460.935665 9
Avengers: Age of Ultron 462669 7.5 458.991599 7
The Dark Knight Rises 1144337 8.5 448.130642 8
Shrek 2 314630 7.2 436.471036 7

This statement is supported by quickly looking at the top 10 grossing movies from our dataset. Of the 10 only two recieved a score in the 9 range.

Regardless, on average, highly rated movies tend to have higher gross revenues.

Rating vs. Gross Revenue


In [58]:
df_rating = df.drop(['Director','Actor 1','Actor 2','Actor 3','Country','Genre','Year','Budget','IMDB Score'],1)
df_rating.head(5)


Out[58]:
Number of Votes Rating Gross
Movie Name
Avatar 886204 PG-13 760.505847
Titanic 793059 PG-13 658.672302
Jurassic World 418214 PG-13 652.177271
The Avengers 995415 PG-13 623.279547
The Dark Knight 1676169 PG-13 533.316061

In [246]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.barplot(x="Rating", y="Gross", data=df,palette="Set3")


Since the 1980s, the MPAA has only used the following ratings, which we are going to focus on:

G – General Audiences
PG – Parental Guidance Suggested
PG-13 – Parents Strongly Cautioned
R – Restricted
NC-17 – Adults Only

In [60]:
value_list = ['PG-13','PG','R','G','NC-17']
df_rating = df_rating[df_rating.Rating.isin(value_list)]
df_rating.head()


Out[60]:
Number of Votes Rating Gross
Movie Name
Avatar 886204 PG-13 760.505847
Titanic 793059 PG-13 658.672302
Jurassic World 418214 PG-13 652.177271
The Avengers 995415 PG-13 623.279547
The Dark Knight 1676169 PG-13 533.316061

In [61]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.barplot(x="Rating", y="Gross", data=df_rating, order=['G','PG','PG-13','R','NC-17'], palette="Set3")
Rating.axes.set_title('Rating vs. Gross Revenue', fontsize=15,color="r",alpha=0.5)


Out[61]:
<matplotlib.text.Text at 0x11c7a4630>

We found that, on average, movies with a G-rating tend to perform best. This can be explained by the fact that movies rated G attract a wider range of audiences, given that they doesn't place any age restrictions. The more restrictive the rating is, the lower the average box office revenues are.


In [251]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.boxplot(x="Rating", y="Gross", data=df_rating, order=['G','PG','PG-13','R','NC-17'], palette="Set3")
Rating.axes.set_title('Rating vs. Gross Revenue', fontsize=15,color="r",alpha=0.5)


Out[251]:
<matplotlib.text.Text at 0x123289cc0>

While age restrictions tend to lower average grosses, Hollywood's highest grossing productions have often been rated PG-13. It’s not a coincidence that most films are rated PG-13. Not just because they’re the most lucrative, but because, according to The Wrap, "life is unrated, and telling a story right frequently requires including sex or violence. Try to find a G-rated war movie, for example."

Genre vs. Gross Revenues

We then tried to look at different movie genres, to see whether they affect box office receipts. One issue, however, is that our data sets categorized the movies under multiple combinations of genres, making it difficult to analyze each one seperately. A little manipulation allowed us to fix this problem.


In [253]:
df['Genre'].head()


Out[253]:
Movie Name
Avatar               Action|Adventure|Fantasy|Sci-Fi
Titanic                                Drama|Romance
Jurassic World      Action|Adventure|Sci-Fi|Thriller
The Avengers                 Action|Adventure|Sci-Fi
The Dark Knight          Action|Crime|Drama|Thriller
Name: Genre, dtype: object

In [254]:
df_genre = df['Genre'].str.split('|', expand=True).astype(str)
df_genre.head()


Out[254]:
0 1 2 3 4 5 6 7
Movie Name
Avatar Action Adventure Fantasy Sci-Fi None None None None
Titanic Drama Romance None None None None None None
Jurassic World Action Adventure Sci-Fi Thriller None None None None
The Avengers Action Adventure Sci-Fi None None None None None
The Dark Knight Action Crime Drama Thriller None None None None

In [255]:
df_genre.columns = ['Genre1','Genre2','Genre3','Genre4','Genre5','Genre6','Genre7','Genre8']

In [256]:
df_gross = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Number of Votes','Country','Rating','Budget','IMDB Score','Year'], 1)
df_gross.head(10)


Out[256]:
Gross
Movie Name
Avatar 760.505847
Titanic 658.672302
Jurassic World 652.177271
The Avengers 623.279547
The Dark Knight 533.316061
Star Wars: Episode I - The Phantom Menace 474.544677
Star Wars: Episode IV - A New Hope 460.935665
Avengers: Age of Ultron 458.991599
The Dark Knight Rises 448.130642
Shrek 2 436.471036

In [257]:
df_genre = pd.concat([df_gross, df_genre], axis=1)
df_genre.head()


Out[257]:
Gross Genre1 Genre2 Genre3 Genre4 Genre5 Genre6 Genre7 Genre8
Movie Name
Avatar 760.505847 Action Adventure Fantasy Sci-Fi None None None None
Titanic 658.672302 Drama Romance None None None None None None
Jurassic World 652.177271 Action Adventure Sci-Fi Thriller None None None None
The Avengers 623.279547 Action Adventure Sci-Fi None None None None None
The Dark Knight 533.316061 Action Crime Drama Thriller None None None None

In [258]:
df_genre = df_genre.reset_index()
df_genre.head()


Out[258]:
Movie Name Gross Genre1 Genre2 Genre3 Genre4 Genre5 Genre6 Genre7 Genre8
0 Avatar 760.505847 Action Adventure Fantasy Sci-Fi None None None None
1 Titanic 658.672302 Drama Romance None None None None None None
2 Jurassic World 652.177271 Action Adventure Sci-Fi Thriller None None None None
3 The Avengers 623.279547 Action Adventure Sci-Fi None None None None None
4 The Dark Knight 533.316061 Action Crime Drama Thriller None None None None

In [259]:
df_genre.columns = ['Movie Name','Gross','Genre1','Genre2','Genre3','Genre4','Genre5','Genre6','Genre7','Genre8']
df_genre.head()


Out[259]:
Movie Name Gross Genre1 Genre2 Genre3 Genre4 Genre5 Genre6 Genre7 Genre8
0 Avatar 760.505847 Action Adventure Fantasy Sci-Fi None None None None
1 Titanic 658.672302 Drama Romance None None None None None None
2 Jurassic World 652.177271 Action Adventure Sci-Fi Thriller None None None None
3 The Avengers 623.279547 Action Adventure Sci-Fi None None None None None
4 The Dark Knight 533.316061 Action Crime Drama Thriller None None None None

In [260]:
df_genre.set_index(['Movie Name','Gross'],inplace=True)

In [261]:
df_genre = df_genre.stack()

In [262]:
df_genre = pd.DataFrame(df_genre, columns = ['Genre'])

df_genre.groupby(df_genre['Genre'])
df_genre.head(10)


Out[262]:
Genre
Movie Name Gross
Avatar 760.505847 Genre1 Action
Genre2 Adventure
Genre3 Fantasy
Genre4 Sci-Fi
Genre5 None
Genre6 None
Genre7 None
Genre8 None
Titanic 658.672302 Genre1 Drama
Genre2 Romance

In [263]:
df_genre.reset_index(inplace=True) 
df_genre.head()


Out[263]:
Movie Name Gross level_2 Genre
0 Avatar 760.505847 Genre1 Action
1 Avatar 760.505847 Genre2 Adventure
2 Avatar 760.505847 Genre3 Fantasy
3 Avatar 760.505847 Genre4 Sci-Fi
4 Avatar 760.505847 Genre5 None

In [264]:
pd.value_counts(df_genre['Genre'])


Out[264]:
None           16677
Drama           1619
Comedy          1354
Thriller         933
Action           802
Romance          763
Adventure        686
Crime            608
Fantasy          444
Sci-Fi           425
Family           400
Horror           327
Mystery          321
Biography        204
Animation        172
Music            146
Sport            136
History          115
War              111
Musical           94
Documentary       57
Western           51
Short              2
Film-Noir          1
Name: Genre, dtype: int64

In [266]:
vlist = ['Action', 'Adventure','Fantasy','Drama','Romance','Thriller','Crime','Animation','Comedy','Family','Musical','Biography','History','War','Mystery','Horror','Sport','Music','Western','Documentary']
df_genre = df_genre[df_genre['Genre'].isin(vlist)]

In [284]:
f, ax = plt.subplots(figsize=(20, 10))
Genre = sns.barplot(x="Genre", y="Gross", data=df_genre, palette="Set3")
Genre.axes.set_title('Genre v. Gross', fontsize=25,color="r",alpha=0.5)


Out[284]:
<matplotlib.text.Text at 0x12819fc88>

We found that Animation movies tend to gross the most, followed by Family, Fantasy, Adventure and Action.

Animation and Family movies attract a wider pool of people given their non-restrictive G rating (animation movies are also geared towards kids who are most often accompanied by guardians, doubling the revenue).

Action, Adventure and Fantasy movies, on the other hand, provide more of a movie-going experience (special effects, elaborate action scenes, etc.), not only attracting more people, but also charging more per ticket (IMAX or 3D movies tend to be more expensive). Further, these types of movies are often big-budget Hollywood productions, which ties back to our first conclusion that big budget films tend to generate higher revenues.

However, outside of the top 5, there is very little variation in the average revenue for all other genres. This indicates that, unless you are in the top 5, the genre of the movie will probably not impact how succesful it is domestically.

Director Popularity vs. Gross Revenues

Our dataset does not contain any indicators of director popularity (the "Facebook Likes" column contains too many wrong or missing values). Therefore, we focused on the ones who directed the highest number of movies. Studios only hire directors to direct again if their movies have been successful in the past or if they are famous, with the hopes that the director will repeat his or her success or the value of their celebrity will attract enough attention to the movie to generate profit.


In [275]:
df.head()


Out[275]:
Director Actor 1 Genre Actor 2 Number of Votes Actor 3 Country Rating Year IMDB Score Budget Gross
Movie Name
Avatar James Cameron Joel David Moore Action|Adventure|Fantasy|Sci-Fi CCH Pounder 886204 Wes Studi USA PG-13 2009 7.9 237.0 760.505847
Titanic James Cameron Kate Winslet Drama|Romance Leonardo DiCaprio 793059 Gloria Stuart USA PG-13 1997 7.7 200.0 658.672302
Jurassic World Colin Trevorrow Judy Greer Action|Adventure|Sci-Fi|Thriller Bryce Dallas Howard 418214 Omar Sy USA PG-13 2015 7.0 150.0 652.177271
The Avengers Joss Whedon Robert Downey Jr. Action|Adventure|Sci-Fi Chris Hemsworth 995415 Scarlett Johansson USA PG-13 2012 8.1 220.0 623.279547
The Dark Knight Christopher Nolan Heath Ledger Action|Crime|Drama|Thriller Christian Bale 1676169 Morgan Freeman USA PG-13 2008 9.0 185.0 533.316061

In [276]:
df_director = df.drop(['Actor 1','Actor 2','Actor 3','Number of Votes','Genre','Country','Rating','IMDB Score','Budget','Year'], 1)
df_director.head()


Out[276]:
Director Gross
Movie Name
Avatar James Cameron 760.505847
Titanic James Cameron 658.672302
Jurassic World Colin Trevorrow 652.177271
The Avengers Joss Whedon 623.279547
The Dark Knight Christopher Nolan 533.316061

In [277]:
df_director['Value Counts'] = df_director.groupby('Director')['Director'].transform('count')
df_director.head()


Out[277]:
Director Gross Value Counts
Movie Name
Avatar James Cameron 760.505847 7
Titanic James Cameron 658.672302 7
Jurassic World Colin Trevorrow 652.177271 2
The Avengers Joss Whedon 623.279547 3
The Dark Knight Christopher Nolan 533.316061 8

In [278]:
df_director['Director'].value_counts()


Out[278]:
Steven Spielberg         24
Clint Eastwood           19
Woody Allen              17
Ridley Scott             16
Martin Scorsese          16
Spike Lee                15
Renny Harlin             14
Tim Burton               14
Steven Soderbergh        14
Robert Zemeckis          13
Barry Levinson           13
Ron Howard               13
Michael Bay              12
Robert Rodriguez         12
Rob Reiner               11
Richard Linklater        11
Shawn Levy               11
Oliver Stone             11
Tony Scott               10
David Fincher            10
Wes Craven               10
Chris Columbus           10
Kevin Smith              10
John Carpenter           10
Joel Schumacher          10
Bobby Farrelly            9
Sam Raimi                 9
Dennis Dugan              9
M. Night Shyamalan        9
Francis Ford Coppola      9
                         ..
Joshua Michael Stern      1
Michael Radford           1
Joshua Tickell            1
Rob Hedden                1
Norman Ferguson           1
David Lynch               1
Chris Roberts             1
Chris Butler              1
Mark Osborne              1
David Moreau              1
Michael Herz              1
Nick Tomnay               1
Lee Toland Krieger        1
Adam Marcus               1
Derrick Borte             1
Thea Sharrock             1
Nathan Greno              1
Bruce Hunt                1
Tony Bill                 1
David Sington             1
Franklin J. Schaffner     1
James Bidgood             1
Eric Nicholas             1
Ric Roman Waugh           1
Barry Cook                1
Charles Shyer             1
Patty Jenkins             1
Darren Stein              1
Willard Huyck             1
Drew Barrymore            1
Name: Director, dtype: int64

In [279]:
df_director[df_director['Director'].str.contains("Steven Spielberg",na=False)].head()


Out[279]:
Director Gross Value Counts
Movie Name
E.T. the Extra-Terrestrial Steven Spielberg 434.949459 24
Jurassic Park Steven Spielberg 356.784000 24
Indiana Jones and the Kingdom of the Crystal Skull Steven Spielberg 317.011114 24
Jaws Steven Spielberg 260.000000 24
Raiders of the Lost Ark Steven Spielberg 242.374454 24

In [280]:
bins = [0,12,30]
group_names = ['Less than 12','More than 12']
categories = pd.cut(df_director['Value Counts'], bins, labels=group_names)
df_director['Popularity of Director'] = pd.cut(df_director['Value Counts'], bins, labels=group_names)
categories.head(5)


Out[280]:
Movie Name
Avatar              Less than 12
Titanic             Less than 12
Jurassic World      Less than 12
The Avengers        Less than 12
The Dark Knight     Less than 12
Name: Value Counts, dtype: category
Categories (2, object): [Less than 12 < More than 12]

In [281]:
pd.value_counts(df_director['Popularity of Director'])


Out[281]:
Less than 12    3118
More than 12     188
Name: Popularity of Director, dtype: int64

In [85]:
f, ax = plt.subplots(figsize=(5, 7))
Fame = sns.barplot(x="Popularity of Director", y="Gross", data=df_director, palette="Set3")
Fame.axes.set_title('Popularity of Director v. Gross', fontsize=15,color="r",alpha=0.5)


Out[85]:
<matplotlib.text.Text at 0x11ddde400>

By comparing the revenue of movies from directors who produce fewer films to directors who produce the most films (our measure for how popular a director is) we see that there is a relationship between the popularity of the director and the sucess of the film. On average, directors who produce a lot of films also tend to produce films with higher revenues.

However, looking at this relationship in isolation can be misleading because it is unclear whether these directors simply make movies that are of higher quality so they generate high revenue, or they generate more revenue because the more recognizable the director is the more likely people are to go see the film.

Star Power vs. Gross Revenues

We implemented the same rationale for actors as we did for director, and found very similar results.


In [87]:
df_stars = df.drop(['Director','Number of Votes','Genre','Country','Rating','IMDB Score','Budget','Year'], 1)
df_stars.head()


Out[87]:
Actor 1 Actor 2 Actor 3 Gross
Movie Name
Avatar Joel David Moore CCH Pounder Wes Studi 760.505847
Titanic Kate Winslet Leonardo DiCaprio Gloria Stuart 658.672302
Jurassic World Judy Greer Bryce Dallas Howard Omar Sy 652.177271
The Avengers Robert Downey Jr. Chris Hemsworth Scarlett Johansson 623.279547
The Dark Knight Heath Ledger Christian Bale Morgan Freeman 533.316061

In [88]:
df_stars = df_stars.reset_index()
df_stars.head()


Out[88]:
Movie Name Actor 1 Actor 2 Actor 3 Gross
0 Avatar Joel David Moore CCH Pounder Wes Studi 760.505847
1 Titanic Kate Winslet Leonardo DiCaprio Gloria Stuart 658.672302
2 Jurassic World Judy Greer Bryce Dallas Howard Omar Sy 652.177271
3 The Avengers Robert Downey Jr. Chris Hemsworth Scarlett Johansson 623.279547
4 The Dark Knight Heath Ledger Christian Bale Morgan Freeman 533.316061

In [89]:
df_stars.set_index(['Movie Name','Gross'],inplace=True)
df_stars.head()


Out[89]:
Actor 1 Actor 2 Actor 3
Movie Name Gross
Avatar 760.505847 Joel David Moore CCH Pounder Wes Studi
Titanic 658.672302 Kate Winslet Leonardo DiCaprio Gloria Stuart
Jurassic World 652.177271 Judy Greer Bryce Dallas Howard Omar Sy
The Avengers 623.279547 Robert Downey Jr. Chris Hemsworth Scarlett Johansson
The Dark Knight 533.316061 Heath Ledger Christian Bale Morgan Freeman

In [90]:
df_stars = df_stars.stack()
df_stars.head(10)


Out[90]:
Movie Name       Gross              
Avatar           760.505847  Actor 1       Joel David Moore
                             Actor 2            CCH Pounder
                             Actor 3              Wes Studi
Titanic          658.672302  Actor 1           Kate Winslet
                             Actor 2      Leonardo DiCaprio
                             Actor 3          Gloria Stuart
Jurassic World   652.177271  Actor 1             Judy Greer
                             Actor 2    Bryce Dallas Howard
                             Actor 3                Omar Sy
The Avengers     623.279547  Actor 1      Robert Downey Jr.
dtype: object

In [91]:
df_stars = pd.DataFrame(df_stars, columns = ['Star'])

df_stars.groupby(df_stars['Star'])
df_stars.head(10)


Out[91]:
Star
Movie Name Gross
Avatar 760.505847 Actor 1 Joel David Moore
Actor 2 CCH Pounder
Actor 3 Wes Studi
Titanic 658.672302 Actor 1 Kate Winslet
Actor 2 Leonardo DiCaprio
Actor 3 Gloria Stuart
Jurassic World 652.177271 Actor 1 Judy Greer
Actor 2 Bryce Dallas Howard
Actor 3 Omar Sy
The Avengers 623.279547 Actor 1 Robert Downey Jr.

In [92]:
df_stars.reset_index(inplace=True) 
df_stars.head()


Out[92]:
Movie Name Gross level_2 Star
0 Avatar 760.505847 Actor 1 Joel David Moore
1 Avatar 760.505847 Actor 2 CCH Pounder
2 Avatar 760.505847 Actor 3 Wes Studi
3 Titanic 658.672302 Actor 1 Kate Winslet
4 Titanic 658.672302 Actor 2 Leonardo DiCaprio

In [93]:
pd.value_counts(df_stars['Star'])


Out[93]:
Robert De Niro         46
Morgan Freeman         36
Bruce Willis           34
Matt Damon             34
Johnny Depp            33
Denzel Washington      31
Bill Murray            30
Steve Buscemi          29
J.K. Simmons           29
Brad Pitt              29
Will Ferrell           29
Robert Downey Jr.      28
Nicolas Cage           28
Tom Hanks              27
Harrison Ford          27
Tom Cruise             26
Julia Roberts          25
Liam Neeson            25
Matthew McConaughey    24
Robin Williams         24
Tom Wilkinson          23
Adam Sandler           23
Jim Broadbent          23
Angelina Jolie Pitt    22
Gerard Butler          22
Keanu Reeves           22
Christian Bale         22
Anthony Hopkins        22
Meryl Streep           22
Judy Greer             22
                       ..
Kevin Breznahan         1
Nina Hartley            1
Alan D. Purwin          1
Jay Leno                1
Patrick Muldoon         1
Alycia Debnam-Carey     1
Jonathan Brandis        1
Tom Sizemore            1
Elise Neal              1
Jossara Jinaro          1
Billy Brown             1
Jean-Louis Blondeau     1
Mink Stole              1
Kim Director            1
Kim De Angelo           1
Ted Cassidy             1
Charles Mesure          1
Luis E. Carazo          1
Elissa Dowling          1
T.C. Stallings          1
Jason Weaver            1
John Hodgman            1
Michael Lonsdale        1
Glenn Plummer           1
Taylor Blackwell        1
Tao Okamoto             1
David Moscow            1
James Urbaniak          1
Phyllis Diller          1
Niketa Calame           1
Name: Star, dtype: int64

In [94]:
df_stars['Star Popularity'] = df_stars.groupby('Star')['Star'].transform('count')
df_stars.head()


Out[94]:
Movie Name Gross level_2 Star Star Popularity
0 Avatar 760.505847 Actor 1 Joel David Moore 4
1 Avatar 760.505847 Actor 2 CCH Pounder 6
2 Avatar 760.505847 Actor 3 Wes Studi 6
3 Titanic 658.672302 Actor 1 Kate Winslet 18
4 Titanic 658.672302 Actor 2 Leonardo DiCaprio 18

In [95]:
df_stars[df_stars['Star'].str.contains("Robert De Niro",na=False)].head()


Out[95]:
Movie Name Gross level_2 Star Star Popularity
211 Meet the Fockers 279.167575 Actor 2 Robert De Niro 46
670 Meet the Parents 166.225040 Actor 2 Robert De Niro 46
730 Shark Tale 160.762022 Actor 2 Robert De Niro 46
814 Little Fockers 148.383780 Actor 2 Robert De Niro 46
1005 Silver Linings Playbook 132.088910 Actor 1 Robert De Niro 46

In [96]:
bins = [0,22,60]
group_names = ['Less than 22','More than 22']
categories = pd.cut(df_stars['Star Popularity'], bins, labels=group_names)
df_stars['Star Popularity'] = pd.cut(df_stars['Star Popularity'], bins, labels=group_names)
categories.head(5)


Out[96]:
0    Less than 22
1    Less than 22
2    Less than 22
3    Less than 22
4    Less than 22
Name: Star Popularity, dtype: category
Categories (2, object): [Less than 22 < More than 22]

In [97]:
pd.value_counts(df_stars['Star Popularity'])


Out[97]:
Less than 22    9246
More than 22     663
Name: Star Popularity, dtype: int64

In [98]:
f, ax = plt.subplots(figsize=(5, 7))
StarPower = sns.barplot(x="Star Popularity", y="Gross", data=df_stars, palette="Set3")
StarPower.axes.set_title('Star Power vs. Gross', fontsize=15,color="r",alpha=0.5)


Out[98]:
<matplotlib.text.Text at 0x11d615080>

Similarly to what we found with directors, famous actors (i.e. actors who have starred in a higher number of filmds) tend to drive up box office revenues.

Conclusions

Upon analyzing all these factors, we found that it is difficult to generalize the film industry as a whole. There are a huge number of variables that can influence whether a movie is successful or not, explaining the industry's highly unpredictable nature. If there was a perfect formula, big Hollywood studios would never produce any flops.

We found, however, that there are some trends in predicting future success:

  • Higher production budgets,
  • Better quality movies,
  • Non-restrictive ratings,
  • Family-friendly or action-based genres,
  • Director popularity and star power.

This, however, is a small proportion of all the variables that influence movie-going habits and box-office performance. It would be valuable to look at different factors such as month of release, macro-economic influences, or even the weather.