Get Out has been one of the most talked about films in 2017 and as of April 2017 the highest grossing debut film based on an original screenplay in history. We want to programmatically find out how Get Out ranked amongst other 2017 American films and which films have earned the most revenue in 2017. This tutorial assumes most readers have basic working knowledge of Python.
Install the following python packages and run them ideally in a virtualenv.
In addition to installing the above dependencies we will need to request an API key from The Movie DB (TMDB). TMDB has a free API to programmatically access information about movies.
In [702]:
import config # to hide TMDB API keys
import requests # to make TMDB API calls
import locale # to format currency as USD
locale.setlocale( locale.LC_ALL, '' )
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter # to format currency on charts axis
api_key = config.tmdb_api_key # get TMDB API key from config.py file
If you plan on committing your project to GitHub or another public repository and need help setting up config
you should read this article about using config
to hide API keys.
In this section we will request 2017 data from TMDB, store the data we recieve as a json
into a dataframe
and then use matplotlib
to visualize our data.
In order to get the highest earning films from TMDB an API request needs to be constructed to return films with a primary_release_year
of 2017 sorted in descending order by revenue.
In [703]:
response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' + api_key + '&primary_release_year=2017&sort_by=revenue.desc')
In [704]:
highest_revenue = response.json() # store parsed json response
# uncomment the next line to get a peek at the highest_revenue json structure
# highest_revenue
highest_revenue_films = highest_revenue['results']
In [705]:
# define column names for our new dataframe
columns = ['film', 'revenue']
# create dataframe with film and revenue columns
df = pandas.DataFrame(columns=columns)
Now to add the data to our dataframe we will need to loop through the data.
In [706]:
# for each of the highest revenue films make an api call for that specific movie to return the budget and revenue
for film in highest_revenue_films:
# print(film['title'])
film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
film_revenue = film_revenue.json()
#print(locale.currency(film_revenue['revenue'], grouping=True ))
df.loc[len(df)]=[film['title'],film_revenue['revenue']] # store title and revenue in our dataframe
Below is what the dataframe head
(top 5 lines) looks like after iterating through the films our API call returned.
In [707]:
df.head()
Out[707]:
We will create a horizontal bar chart using matplotlib to display the revenue earned for each film.
In [708]:
matplotlib.style.use('ggplot')
fig, ax = plt.subplots()
df.plot(kind="barh", y='revenue', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=df['film'], ax=ax)
#format xaxis in terms of currency
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)
avg = df['revenue'].mean()
# Add a line for the average
ax.axvline(x=avg, color='b', label='Average', linestyle='--', linewidth=1)
ax.set(title='American Films with Highest Revenue (2017)', xlabel='Revenue', ylabel='Film')
Out[708]:
In this section we will request all-time data from TMDB, store the data we recieve as a json
into a dataframe
and then use matplotlib
to visualize our data. Our API call will be similar to the one we used in the previous section but sans &primary_release_year=2017
.
In [709]:
response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=' + api_key + '&sort_by=revenue.desc')
highest_revenue_ever = response.json()
highest_revenue_films_ever = highest_revenue_ever['results']
columns = ['film', 'revenue', 'budget', 'release_date']
highest_revenue_ever_df = pandas.DataFrame(columns=columns)
for film in highest_revenue_films_ever:
# print(film['title'])
film_revenue = requests.get('https://api.themoviedb.org/3/movie/'+ str(film['id']) +'?api_key='+ api_key+'&language=en-US')
film_revenue = film_revenue.json()
# print(film_revenue)
# print(locale.currency(film_revenue['revenue'], grouping=True ))
# Lord of the Rings duplicate w/ bad data was being returned https://www.themoviedb.org/movie/454499-the-lord-of-the-rings
# It's budget was $281 which is way too low for a top-earning film. Therefore in order to be added to dataframe the film
# budget must be greater than $281.
if film_revenue['budget'] > 281:
# print(film_revenue['budget'])
# add film title, revenue, budget and release date to the dataframe
highest_revenue_ever_df.loc[len(highest_revenue_ever_df)]=[film['title'],film_revenue['revenue'], (film_revenue['budget'] * -1), film_revenue['release_date']]
highest_revenue_ever_df.head()
Out[709]:
In [710]:
highest_revenue_ever_df['gross'] = highest_revenue_ever_df['revenue'] + highest_revenue_ever_df['budget']
What does the dataframe look like now?
In [711]:
highest_revenue_ever_df.head()
Out[711]:
In [712]:
fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind="barh", y='revenue', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=highest_revenue_ever_df['film'], ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)
ax.set(title='American Films with Highest Revenue (All Time)', xlabel='Revenue', ylabel='Film')
Out[712]:
In [713]:
fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind="barh", y='gross', color = ['#624ea7', '#599ad3', '#f9a65a', '#9e66ab', 'purple'], x=highest_revenue_ever_df['film'], ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.legend().set_visible(False)
ax.set(title='Gross Profit of the American Films with Highest Revenue (All Time)', xlabel='Gross Profit', ylabel='Film')
Out[713]:
In [722]:
fig, ax = plt.subplots()
highest_revenue_ever_df.plot(kind='scatter', y='gross', x='budget', ax=ax)
formatter = FuncFormatter(currency)
ax.xaxis.set_major_formatter(formatter)
ax.yaxis.set_major_formatter(formatter)
ax.set(title='Profit vs Budget of the American Films with Highest Revenue (All Time)', xlabel='Budget', ylabel='Gross Profit')
Out[722]:
In [657]:
# Adding release year to dataframe
# highest_revenue_ever_df['year'] = pd.DatetimeIndex(highest_revenue_ever_df['release_date']).year
# print(highest_revenue_ever_df)
The above data and graphs do not account for inflation (the TMDB API returns by revenue unadjusted by inflation) therefore the earnings from more recent films are more weighted than their earlier counterparts. When looking at all time data inflation should be adjusted for however when looking over a shorter time period adjusting for inflation might not be necessary. Older films would appear above if inflation was taken into account, as it is now, the oldest film on this list was The Titanic in 1997.
Cover photo is Chris Washington, played by Daniel Kaluuya, from Get Out. Universal Pictures