Solana San Pietro and Anthony Eid
Data Bootcamp Spring 2017 UG Section Final Project
In 2016, the global film industry generated 38 billion US dollars, and revenues are projected to increase to nearly 50 billion US dollars by 2020. However, although a highly profitable industry as a whole, every movie produced also requires a high level of investment from studios, producers, etc. with little guarantee that there will be a return on investment. A movie's profitability is increasingly elusive with the shift in consumer behavior towards media consumption and rising marketing costs.
The unpredicatability of a movie's profitability was highlighted recently with the failure of the movie "The Great Wall". Financed through a joint partnership between the US and China, the movie was expected to be a huge success. Nonetheless, the movie only made 34.8 million US dollars in the North American box office, in comparison with a production budget of 150 million US dollars.
On the other hand, Amazon has been investing over 3 billion US dollars annually in original content with various success stories such as their investment in "Manchester in the Sea", which recieved six Oscar nominations in 2017.
The profitability of an individual film is highly unpredictable, but industry investors would be put at ease if there could be some understanding of whether their investments will be worthwhile. Therefore, we want to look at the best indicators to predict a film’s success. We define a film's success by its domestic box office revenue.
We acquired our data from the following website: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset
The "IMDB 5000 Movie Set" contains movie data for over 5000 movies made in the past 100 years. The information was scraped from IMDB's database, and we were able to download it as a CSV file.
In [190]:
import sys
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import time
import seaborn as sns
from pandas_datareader import wb, data as web
# plotly imports
from plotly.offline import iplot, iplot_mpl # plotting functions
import plotly.graph_objs as go # ditto
import plotly # just to print version and init notebook
import cufflinks as cf # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)
%matplotlib inline
plotly.offline.init_notebook_mode(connected=True)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())
csv = "/Users/AnthonyEid/Desktop/movie_metadata.csv"
df = pd.read_csv(csv)
df.head()
Out[190]:
The "IMDB 5000" dataset included more information than we thought would be necesary or relevant to understanding what factors impact if a movie generates high revenue so we dropped these columns, only keeping data on the following:
Director
Gross
Genre
Actor
Movie Name
Country
Rating
Budget
Year
IMBD Score
In [191]:
df.drop(df.columns[[0,2,3,4,5,7,13,15,16,17,18,19,24,26,27]], axis=1, inplace=True)
In [192]:
df.columns = ['Director','Actor 1','Gross','Genre','Actor 2','Movie Name','Number of Votes','Actor 3','Country', 'Rating', 'Budget', 'Year', 'IMDB Score']
In [193]:
df.mean()
Out[193]:
In [194]:
print('Variable dtypes:\n', df.dtypes, sep='')
df.head()
Out[194]:
In [195]:
df = df.sort_values('Gross', ascending=False).iloc[:6000]
df.head(5)
Out[195]:
To further clean up our data we set the index and removed duplicates.
In [196]:
df = df.drop_duplicates('Movie Name')
df = df[pd.notnull(df['Gross']) & pd.notnull(df['Budget'])]
df = df.set_index('Movie Name')
df.head()
Out[196]:
In [197]:
df['Year'] = df['Year'].astype(int) #change year to integer to get rid of decimal
print('Variable dtypes:\n', df.dtypes, sep='')
df.head()
Out[197]:
We also noticed the dataset included movies from countries all over the world, but many of these international moives, particularly ones from smaller countries, did not adjust the budget to be in US dollars. For example,the South Korean movie "Lady Vengenace" had a budget of over 4 billion. However, this number is only so large because this is the value of their budget in Korean won.
To adjust for this problem we simply chose to only use data from the US and UK movie industries. Although the UK is a foreign country, many huge Hollywood movies, such as the Bond movies and Harry Potter movies, are listed as UK movies so it is important that we include these films. Furthermore, the difference between dollar and pound tend to not be as dramatic like in the case of South Korea.
In [198]:
df_bybudget = df.sort_values('Budget', ascending=False).iloc[:6000]
df_bybudget.head(10)
Out[198]:
In [199]:
vlist = ['USA', 'UK']
df = df[df['Country'].isin(vlist)]
df.head(5)
Out[199]:
In [200]:
def millions(number):
'''show dollar value in millions'''
return number*(1/1000000)
In [201]:
budgetmillions = millions(df['Budget'])
df = df.assign(Budget_inmillions=budgetmillions)
grossmillions = millions(df['Gross'])
df = df.assign(Gross_inmillions=grossmillions)
In [202]:
df.head(5)
Out[202]:
In [203]:
df = df.drop(['Budget','Gross'], 1)
In [204]:
df.head(5)
Out[204]:
In [205]:
df = df.rename(columns={'Budget_inmillions': 'Budget', 'Gross_inmillions': 'Gross'})
df.head(5)
Out[205]:
In [206]:
print('Variable dtypes:\n', df.dtypes, sep='')
df.head()
Out[206]:
In [269]:
sns.set_style("whitegrid")
f, ax = plt.subplots(figsize=(11, 6))
ax = sns.boxplot(x=df["Gross"])
ax.axes.set_title('Distribution of Gross Box Office Revenue', fontsize=15,color="r",alpha=0.5)
Out[269]:
In [270]:
mean = df['Gross'].mean()
median = df['Gross'].median()
std = df['Gross'].std()
print('The mean is', mean)
print('The median is', median)
print('The standard deviation', std)
We first wanted to look at the relationship between Budget and Gross profit. Big Hollywood studios tend to invest millions of dollars in productions they want to see perform well. It is well known that movies with big budgets often become global blockbusters, given the amount invested in special effects, marketing & advertising, etc.
We began by generating a standard scatterplot to visualize whether there is a relationship between Budget and Gross Revenue.
In [209]:
df_budget = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Country','Rating','IMDB Score','Number of Votes'], 1)
df_budget.head(5)
Out[209]:
In [210]:
sns.regplot(x = 'Gross', y = 'Budget', data=df_budget)
Out[210]:
In [211]:
from pandas.stats.api import ols
regression = ols(y=df_budget['Budget'],x=df_budget['Gross'])
regression
Out[211]:
We found that there is a positive correlation between budget and gross (regression line, R-squared = 0.43). This indicates that the higher the average production budget, the higher the movie's box office performance. This can be explained by a number of factors: increased spending on marketing, expensive special effects to attract a wider audience, elaborate action sequences, famous high-paid actors, etc.
However, one issue in our data set is that we didn't account for inflation, and considering it includes movies from the 1930s onwards, it wouldn't make sense to compare old movies to newer ones.
To take into account the inflation rate, and the fact that for this reason more recent movies will find it easier to perform better, we adjusted both budget and box office revenue for inflation.
To do so, we used historical CPI data (1913 to 2016) taken from the following website: http://www.usinflationcalculator.com/inflation/consumer-price-index-and-annual-percent-changes-from-1913-to-2008/
In [212]:
url = "/Users/AnthonyEid/Desktop/CPI_data.csv"
df_CPI = pd.read_csv(url)
print('\n', df_CPI)
In [213]:
df_CPI.head()
Out[213]:
In [214]:
df_CPI.columns = ['Year','CPI Avg']
In [215]:
df_budget = df_budget.reset_index()
df_inflation = pd.merge(df_budget, df_CPI, on='Year')
df_inflation.head()
Out[215]:
In [216]:
df_inflation = df_inflation.set_index("Movie Name")
In [217]:
df_inflation = df_inflation.sort_values('Year', ascending=False).iloc[:6000]
df_inflation.head(5)
Out[217]:
In [218]:
df_inflation = df_inflation[pd.notnull(df_inflation['Gross'])]
df_inflation = df_inflation[pd.notnull(df_inflation['Budget'])]
In [219]:
df_inflation.tail()
Out[219]:
To find the values of Gross and Budget in 2016 dollars, we need to adjust for inflation using the CPI yearly average. The CPI for 2016 is 240.007
In [220]:
df_inflation["Budget In 2016 $"] = (240.007/df_inflation["CPI Avg"])*df_inflation["Budget"]
df_inflation.head()
Out[220]:
To find box office grosses equivalent to 2016, we adjusted for inflation and population growth
In [221]:
df_inflation["Gross In 2016 $"] = (240.007/df_inflation["CPI Avg"])*df_inflation["Gross"]
df_inflation.tail(15)
Out[221]:
In [222]:
sns.regplot(x = 'Gross In 2016 $', y = 'Budget In 2016 $', data=df_inflation)
Out[222]:
In [223]:
from pandas.stats.api import ols
regression = ols(y=df_inflation['Budget In 2016 $'],x=df_inflation['Gross In 2016 $'])
regression
Out[223]:
We find that correlation is lower now. Why? Let's try to eliminate outliers.
In [224]:
df_inflation1 = df_inflation[((df_inflation['Gross In 2016 $'] - df_inflation['Gross In 2016 $'].mean()) / df_inflation['Gross In 2016 $'].std()).abs() < 3]
In [225]:
sns.regplot(x = 'Gross In 2016 $', y = 'Budget In 2016 $', data=df_inflation1)
Out[225]:
In [226]:
from pandas.stats.api import ols
regression = ols(y=df_inflation1['Budget In 2016 $'],x=df_inflation1['Gross In 2016 $'])
regression
Out[226]:
In elimating outliers, our R-squared increased from 0.15 to 0.36, but this is still a lower correlation from the one we found prior to adjusting for inflation (0.43).
Over time, movie tastes would have changed, meaning the characteristics of a profitable movie would have also changed, so it didn't make sense for us to compare movies from significantly different eras.
Instead of adjusting for inflation, we chose to simply focus on movies post-2000. This time period restriction was placed so that gross revenue comparisons wouldn’t be significantly impacted by the rates of inflation, and to account for changes in movie-going habits.
In [227]:
df_budget = df_budget.sort_values('Year', ascending=False).iloc[:6000]
df_budget.head()
Out[227]:
In [228]:
yearlist = range(2000,2017)
df_newbudget = df_budget[df_budget['Year'].isin(yearlist)]
df_newbudget.tail()
Out[228]:
In [229]:
sns.regplot(x = 'Gross', y = 'Budget', data=df_newbudget)
Out[229]:
In [230]:
regression = ols(y=df_newbudget['Budget'],x=df_newbudget['Gross'])
regression
Out[230]:
We now found that the R-squared increased to 0.51, showing a relatively strong correlation between budget and box office grosses, confirming our initital hypothesis that higher production costs, on average, lead to higher revenues (given the amount invested in star power, marketing, etc.)
To better visualize this relationship, we grouped different budget ranges into buckets (0-1, 1-10, 10-50, 50-100, 100-150, and 150-300) and generated a bar chart of gross revenues.
In [231]:
df_budget = df_budget.sort_values('Budget', ascending=False).iloc[:6000]
df_budget.head()
Out[231]:
In [232]:
df_budget.tail()
Out[232]:
In [233]:
bins = [0,1,10,50,100,150,301]
group_names = ['0-1','1-10','10-50','50-100','100-150','150-300']
categories = pd.cut(df_budget['Budget'], bins, labels=group_names)
df_budget['Avg Budget'] = pd.cut(df_budget['Budget'], bins, labels=group_names)
categories.head(5)
Out[233]:
In [234]:
df_budget.tail()
Out[234]:
In [235]:
pd.value_counts(df_budget['Avg Budget'])
Out[235]:
In [236]:
f, ax = plt.subplots(figsize=(10, 5))
Budget = sns.barplot(x="Avg Budget", y="Gross", data=df_budget, palette="Set3")
Budget.axes.set_title('Average Budget v. Gross Revenue', fontsize=15,color="r",alpha=0.5)
Out[236]:
As can be seen above, the higher the budget range, the more financially successful movies are, on average.
Next, we chose to look at the relationship between the IMDB Score and Gross Revenue. The IMBD score is a weighted average of votes the movie recieved by IMDB users. It is meant to reflect how the general public recieves a movie.
Understanding this relationship is important because it can highlight whether the quality of the movie produced should be considered. Although, scores are tallied after the movie is initially released, many people will look to the rating to decide what film to watch which will likely impact how profitable a movie is after the initial opening weekend.
In [237]:
df.head()
Out[237]:
In [238]:
df_review = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Rating','Year','Budget','Country'], 1)
In [239]:
df_review.head()
Out[239]:
In [240]:
IMDB = sns.regplot(y="Gross", x="IMDB Score", data=df_review)
IMDB.axes.set_title('Gross Revenue vs. IMDB Score', fontsize=15,color="r",alpha=0.5)
Out[240]:
In [241]:
regression_IMBD = ols(x=df_review['IMDB Score'],y=df_review['Gross'])
regression_IMBD
Out[241]:
In [243]:
bins = [0.5,1.5,2.5,3.5,4.5,5.5,6.5,7.5,8.5,9.5]
group_names = [1,2,3,4,5,6,7,8,9]
categories2 = pd.cut(df_review['IMDB Score'], bins, labels=group_names)
df_review['Avg_Score'] = pd.cut(df_review['IMDB Score'], bins, labels=group_names)
categories2.head()
Out[243]:
In [272]:
f, ax = plt.subplots(figsize=(10, 5))
AvgScore = sns.boxplot(x="Avg_Score", y="Gross", data=df_review, palette="Set3")
AvgScore.axes.set_title('Average Score v. Gross Revenue', fontsize=15,color="r",alpha=0.5)
Out[272]:
The R-Squared indicates that there is very little relationship between IMDB scores and gross revenue. However, the box plot shows that movies that are ranked highly still tend to generate higher revenues. The outliers highlight that the the highest grossing movies don't necessarily have the highest scores.
In [274]:
df_review.head(10)
Out[274]:
This statement is supported by quickly looking at the top 10 grossing movies from our dataset. Of the 10 only two recieved a score in the 9 range.
Regardless, on average, highly rated movies tend to have higher gross revenues.
In [58]:
df_rating = df.drop(['Director','Actor 1','Actor 2','Actor 3','Country','Genre','Year','Budget','IMDB Score'],1)
df_rating.head(5)
Out[58]:
In [246]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.barplot(x="Rating", y="Gross", data=df,palette="Set3")
Since the 1980s, the MPAA has only used the following ratings, which we are going to focus on:
G – General Audiences
PG – Parental Guidance Suggested
PG-13 – Parents Strongly Cautioned
R – Restricted
NC-17 – Adults Only
In [60]:
value_list = ['PG-13','PG','R','G','NC-17']
df_rating = df_rating[df_rating.Rating.isin(value_list)]
df_rating.head()
Out[60]:
In [61]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.barplot(x="Rating", y="Gross", data=df_rating, order=['G','PG','PG-13','R','NC-17'], palette="Set3")
Rating.axes.set_title('Rating vs. Gross Revenue', fontsize=15,color="r",alpha=0.5)
Out[61]:
We found that, on average, movies with a G-rating tend to perform best. This can be explained by the fact that movies rated G attract a wider range of audiences, given that they doesn't place any age restrictions. The more restrictive the rating is, the lower the average box office revenues are.
In [251]:
f, ax = plt.subplots(figsize=(11, 6))
Rating = sns.boxplot(x="Rating", y="Gross", data=df_rating, order=['G','PG','PG-13','R','NC-17'], palette="Set3")
Rating.axes.set_title('Rating vs. Gross Revenue', fontsize=15,color="r",alpha=0.5)
Out[251]:
While age restrictions tend to lower average grosses, Hollywood's highest grossing productions have often been rated PG-13. It’s not a coincidence that most films are rated PG-13. Not just because they’re the most lucrative, but because, according to The Wrap, "life is unrated, and telling a story right frequently requires including sex or violence. Try to find a G-rated war movie, for example."
We then tried to look at different movie genres, to see whether they affect box office receipts. One issue, however, is that our data sets categorized the movies under multiple combinations of genres, making it difficult to analyze each one seperately. A little manipulation allowed us to fix this problem.
In [253]:
df['Genre'].head()
Out[253]:
In [254]:
df_genre = df['Genre'].str.split('|', expand=True).astype(str)
df_genre.head()
Out[254]:
In [255]:
df_genre.columns = ['Genre1','Genre2','Genre3','Genre4','Genre5','Genre6','Genre7','Genre8']
In [256]:
df_gross = df.drop(['Director','Actor 1','Actor 2','Actor 3','Genre','Number of Votes','Country','Rating','Budget','IMDB Score','Year'], 1)
df_gross.head(10)
Out[256]:
In [257]:
df_genre = pd.concat([df_gross, df_genre], axis=1)
df_genre.head()
Out[257]:
In [258]:
df_genre = df_genre.reset_index()
df_genre.head()
Out[258]:
In [259]:
df_genre.columns = ['Movie Name','Gross','Genre1','Genre2','Genre3','Genre4','Genre5','Genre6','Genre7','Genre8']
df_genre.head()
Out[259]:
In [260]:
df_genre.set_index(['Movie Name','Gross'],inplace=True)
In [261]:
df_genre = df_genre.stack()
In [262]:
df_genre = pd.DataFrame(df_genre, columns = ['Genre'])
df_genre.groupby(df_genre['Genre'])
df_genre.head(10)
Out[262]:
In [263]:
df_genre.reset_index(inplace=True)
df_genre.head()
Out[263]:
In [264]:
pd.value_counts(df_genre['Genre'])
Out[264]:
In [266]:
vlist = ['Action', 'Adventure','Fantasy','Drama','Romance','Thriller','Crime','Animation','Comedy','Family','Musical','Biography','History','War','Mystery','Horror','Sport','Music','Western','Documentary']
df_genre = df_genre[df_genre['Genre'].isin(vlist)]
In [284]:
f, ax = plt.subplots(figsize=(20, 10))
Genre = sns.barplot(x="Genre", y="Gross", data=df_genre, palette="Set3")
Genre.axes.set_title('Genre v. Gross', fontsize=25,color="r",alpha=0.5)
Out[284]:
We found that Animation movies tend to gross the most, followed by Family, Fantasy, Adventure and Action.
Animation and Family movies attract a wider pool of people given their non-restrictive G rating (animation movies are also geared towards kids who are most often accompanied by guardians, doubling the revenue).
Action, Adventure and Fantasy movies, on the other hand, provide more of a movie-going experience (special effects, elaborate action scenes, etc.), not only attracting more people, but also charging more per ticket (IMAX or 3D movies tend to be more expensive). Further, these types of movies are often big-budget Hollywood productions, which ties back to our first conclusion that big budget films tend to generate higher revenues.
However, outside of the top 5, there is very little variation in the average revenue for all other genres. This indicates that, unless you are in the top 5, the genre of the movie will probably not impact how succesful it is domestically.
Our dataset does not contain any indicators of director popularity (the "Facebook Likes" column contains too many wrong or missing values). Therefore, we focused on the ones who directed the highest number of movies. Studios only hire directors to direct again if their movies have been successful in the past or if they are famous, with the hopes that the director will repeat his or her success or the value of their celebrity will attract enough attention to the movie to generate profit.
In [275]:
df.head()
Out[275]:
In [276]:
df_director = df.drop(['Actor 1','Actor 2','Actor 3','Number of Votes','Genre','Country','Rating','IMDB Score','Budget','Year'], 1)
df_director.head()
Out[276]:
In [277]:
df_director['Value Counts'] = df_director.groupby('Director')['Director'].transform('count')
df_director.head()
Out[277]:
In [278]:
df_director['Director'].value_counts()
Out[278]:
In [279]:
df_director[df_director['Director'].str.contains("Steven Spielberg",na=False)].head()
Out[279]:
In [280]:
bins = [0,12,30]
group_names = ['Less than 12','More than 12']
categories = pd.cut(df_director['Value Counts'], bins, labels=group_names)
df_director['Popularity of Director'] = pd.cut(df_director['Value Counts'], bins, labels=group_names)
categories.head(5)
Out[280]:
In [281]:
pd.value_counts(df_director['Popularity of Director'])
Out[281]:
In [85]:
f, ax = plt.subplots(figsize=(5, 7))
Fame = sns.barplot(x="Popularity of Director", y="Gross", data=df_director, palette="Set3")
Fame.axes.set_title('Popularity of Director v. Gross', fontsize=15,color="r",alpha=0.5)
Out[85]:
By comparing the revenue of movies from directors who produce fewer films to directors who produce the most films (our measure for how popular a director is) we see that there is a relationship between the popularity of the director and the sucess of the film. On average, directors who produce a lot of films also tend to produce films with higher revenues.
However, looking at this relationship in isolation can be misleading because it is unclear whether these directors simply make movies that are of higher quality so they generate high revenue, or they generate more revenue because the more recognizable the director is the more likely people are to go see the film.
We implemented the same rationale for actors as we did for director, and found very similar results.
In [87]:
df_stars = df.drop(['Director','Number of Votes','Genre','Country','Rating','IMDB Score','Budget','Year'], 1)
df_stars.head()
Out[87]:
In [88]:
df_stars = df_stars.reset_index()
df_stars.head()
Out[88]:
In [89]:
df_stars.set_index(['Movie Name','Gross'],inplace=True)
df_stars.head()
Out[89]:
In [90]:
df_stars = df_stars.stack()
df_stars.head(10)
Out[90]:
In [91]:
df_stars = pd.DataFrame(df_stars, columns = ['Star'])
df_stars.groupby(df_stars['Star'])
df_stars.head(10)
Out[91]:
In [92]:
df_stars.reset_index(inplace=True)
df_stars.head()
Out[92]:
In [93]:
pd.value_counts(df_stars['Star'])
Out[93]:
In [94]:
df_stars['Star Popularity'] = df_stars.groupby('Star')['Star'].transform('count')
df_stars.head()
Out[94]:
In [95]:
df_stars[df_stars['Star'].str.contains("Robert De Niro",na=False)].head()
Out[95]:
In [96]:
bins = [0,22,60]
group_names = ['Less than 22','More than 22']
categories = pd.cut(df_stars['Star Popularity'], bins, labels=group_names)
df_stars['Star Popularity'] = pd.cut(df_stars['Star Popularity'], bins, labels=group_names)
categories.head(5)
Out[96]:
In [97]:
pd.value_counts(df_stars['Star Popularity'])
Out[97]:
In [98]:
f, ax = plt.subplots(figsize=(5, 7))
StarPower = sns.barplot(x="Star Popularity", y="Gross", data=df_stars, palette="Set3")
StarPower.axes.set_title('Star Power vs. Gross', fontsize=15,color="r",alpha=0.5)
Out[98]:
Similarly to what we found with directors, famous actors (i.e. actors who have starred in a higher number of filmds) tend to drive up box office revenues.
Upon analyzing all these factors, we found that it is difficult to generalize the film industry as a whole. There are a huge number of variables that can influence whether a movie is successful or not, explaining the industry's highly unpredictable nature. If there was a perfect formula, big Hollywood studios would never produce any flops.
We found, however, that there are some trends in predicting future success:
This, however, is a small proportion of all the variables that influence movie-going habits and box-office performance. It would be valuable to look at different factors such as month of release, macro-economic influences, or even the weather.