Hollywood hunks come and go, but every so often a star builds a lasting career out of blowing stuff up. Currently, there is no shortage of beef cake on the silver screen with Chris Evans, Chris Hemsworth, and Chris Pratt all regularly starring in blockbuster films. There is no denying the bankability of the Chrises, but which Chris has staying power?
The now defunct Grantland podcast had a “market correction” theory they applied to Hollywood actors. The idea is that there’s only room in the market for one A list celebrity of a particular type and that over time the market will choose its favorite.The hosts would compare two Hollywood actors with similar “types” and predict which one would still have a career in 20 years.
Using data from Box Office Mojo we decided to test the market correction theory on the Chrises by comparing the box office numbers of their biggest hits to those of heroes from the days of yore: Tom Cruise, Arnold Schwarzenegger, and Bruce Willis. We were looking for patterns in the box office receipts of the old guard that may shed some light on who which Chris will be on top in 2035, and to see if any of the box office heroes of yesteryear had a little more staying power than the others.
In [115]:
#This guided coding excercise requires associated .csv files: CE1.csv, CH1.csv, CP1.csv, Arnold1.csv, Bruce1.csv, and Tom1.csv
#make sure you have these supplemental materials ready to go in your active directory before proceeding
#Let's start coding! We first need to make sure our preliminary packages are in order. We imported the following...
#some may have ended up superfluous, but we figured it was better to cover our bases!
import pandas as pd
import sys
import matplotlib as mpl
import matplotlib.pyplot as plt
import sys
import os
import datetime as dt
import csv
import requests, io
from bs4 import BeautifulSoup
%matplotlib inline
print('\nPython version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Requests version: ', requests.__version__)
print("Today's date:", dt.date.today())
Methodology
To dive into which Chris will have staying power in years to come, we looked to authoritative Hollywood Data Source BoxOfficeMojo.com. A bit of simple webscraping gave us film titles broken out by actor, with adjusted box office revenues in tow.
We wanted to aggregate data for our "Three Chrises" and compare it to 3 Hollywood legends who have had variable staying power over the years: Bruce Willis, Tom Cruise, and Arnold Schwarzenegger.
Digging up data on our leading gentlemen
Cells that follow show our process for scraping and organizing the data for the Chris contenders.
In [2]:
# data scraped from Box Office Mojo, the authoritative source for Hollywood Box Office Data
# chris evans
url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=chrisevans.htm'
evans = pd.read_html(url)
print('Ouput has type', type(evans), 'and length', len(evans))
print('First element has type', type(evans[0]))
#we have a list of dataframes, and the cut of data we want is represented by the below
evans[2]
Out[2]:
In [7]:
ce=evans[2]
print("type=", type(ce)," ", "length=", len(ce), "shape=", ce.shape)
print(ce)
In [4]:
ce.to_csv("ce.csv")
#since scraped dataset is small, and had a tricky double index, we decided to export to csv and do a quick cleanup there
#removed indices; cleaned titles; cleaned date
#Clean File saved as CE1.csv
In [33]:
#this is the path for my machine; you'll have to link to the CE1.csv file that you've saved on your machine
path='C:\\Users\\Nick\\Desktop\\Data_Bootcamp\\Final Project\\CE1.csv'
CE = pd.read_csv(path)
print(type(CE), "shape is", CE.shape, "types:", CE.dtypes)
print(CE) #this is going to be much better for us to work with
In [14]:
#this looks good! let's test and make sure the data makes sense with a simple plot:
CE.plot.scatter('Release Year', 'Adjusted Gross')
Out[14]:
In [67]:
#we love what we see, let's repeat it for our other leading gentlemen
“The Heartthrob”
Age: 32
Height: 6’ 3”
Known for: Thor; The Avengers; Snow White and the Huntsman
Legit Roles: Rush
Biggest Hit: Marvel’s The Avengers $659,640,800
Biggest Thor Movie: $212,276,600
In [88]:
# same process for our second leading Chris
# chris hemsworth
url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=chrishemsworth.htm'
hemsworth = pd.read_html(url)
print('Ouput has type', type(hemsworth), 'and length', len(hemsworth))
print('First element has type', type(hemsworth[0]))
hemsworth[3]
Out[88]:
In [87]:
ch=hemsworth[3]
print("type=", type(ch)," ", "length=", len(ch), "shape=", ch.shape)
print(ch)
ch.to_csv("ch.csv")
#since scraped dataset is small, and had a tricky double index, we decided to export to csv and do a quick cleanup there
#Cleaned File saved as CH1.csv
path='C:\\Users\\Nick\\Desktop\\Data_Bootcamp\\Final Project\\CH1.csv'
#again, this is the path on my machine, you'll want to make sure you adjust to wherever you saved down CH1
CH = pd.read_csv(path)
print(type(CH), "shape is", CH.shape, "types:", CH.dtypes)
CH.plot.scatter('Release Year', 'Adjusted Gross')
Out[87]:
Our data looks good! The axes are a little strange, but we just want to make sure we have data we can work with!
“The Everyman”
Age: 36
Height: 6’ 2”
Known for: Guardians of the Galaxy ($353,303,500); Jurassic World (1 + one in pre); Parks & Rec (TV)
Legit Roles: Her, Moneyball
Biggest Role: Jurassic World $678,242,100
In [ ]:
# Chris number three, coming through!
# chris pratt
url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=chrispratt.htm'
pratt = pd.read_html(url)
print('Ouput has type', type(pratt), 'and length', len(pratt))
print('First element has type', type(pratt[0]))
pratt[3]
In [90]:
cp=pratt[3]
print("type=", type(cp)," ", "length=", len(cp), "shape=", cp.shape)
print(cp)
cp.to_csv("cp.csv")
#since scraped dataset is small, and had a tricky double index, we decided to export to csv and do a quick cleanup there
#Cleaned File saved as CP1.csv
path='C:\\Users\\Nick\\Desktop\\Data_Bootcamp\\Final Project\\CP1.csv'
#remember to adjust path to where you've saved the .csv down
CP = pd.read_csv(path)
print(type(CP), "shape is", CP.shape, "types:", CP.dtypes)
CP.plot.scatter('Release Year', 'Adjusted Gross')
Out[90]:
Now that we've got that sorted out, let's take a look at all three Chrises together. How do their box office titles stack up with one another over time?
In [80]:
plt.scatter(CE['Release Year'], CE['Adjusted Gross'],
color="purple")
plt.scatter(CH['Release Year'], CH['Adjusted Gross'],
color="red")
plt.scatter(CP['Release Year'], CP['Adjusted Gross'],
color="orange")
plt.title('Chris Film Box Office Share Over Time')
Out[80]:
In the graph above, we color coded our Chris contingency as follows:
Chris Evans: Purple
Chris Hemsworth: Red
Chris Pratt: Orange
A few things stand out. First, we can see right away that Chris Evans has, to date, had the longest career at the box office, dating back to 2001. Does this maybe suggest some longevity right off the bat? We're not so quick to draw that conclusion, especially since his biggest box office hit is shared with Chris Hemsworth in the Marvel Avengers movie.
Looking back at our raw data, we can also note that Pratt seems to have had the biggest breakout hit with his 2015 with Jurassic World, one of the top grossing films of all time, where he was the sole leading man.
This data gives us one view, but what other cuts might we want to look at?
In [108]:
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=True)
CE['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[0], color='purple', title="Evans")
CH['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[1], color='red', title="Hemsworth")
CP['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[2], color='orange', title="Pratt")
Out[108]:
In the above, we take a look at the box office grosses for the top 10 films for each Chris. Here, we start to wonder if maybe Evans has a more consistent box office performance. Of his top 10 filims, 9 are in the $200 million range, a stat unmatched by our other two gentlemen.
This is an interesting insight, but what does it look like over time?
In [89]:
plt.bar(CE['Release Year'], CE['Adjusted Gross'],
align='center',
color='pink')
plt.title('Chris Evans')
Out[89]:
Buoyed by franchise films in the last five years, Chris Evans has been a steady player, but hasn't excelled outside the Marvel universe franchises. All his biggest hits are as a member of a franchise / ensemble. Evans's Marvel hits since 2011 have performed well, though non-Marvel titles have largely been blips on the radar.
In [85]:
plt.bar(CH['Release Year'], CH['Adjusted Gross'],
align='center',
color='red')
plt.title("Chris Hemsworth")
Out[85]:
Hemsworth had a very rough 2015. He featured prominently in 4 films, only one of which was a box office success (another Marvel Avengers installment). After a breakout 2012, are the tides turning after major flops like In the Heart of the Sea?
In [86]:
plt.bar(CP['Release Year'], CP['Adjusted Gross'],
align='center',
color='orange')
plt.title("Chris Pratt")
Out[86]:
Pratt may have been a slower starter than our other leading gentlemen, but his 2014 breakout Guardians of the Galaxy cemented his status as leading man potential, and 2015's Jurassic World broke tons of box office records. As a non-Marvel film (though a franchise reboot), Jurassic World is unique in that it may be a standalone hit for Pratt, and everyone will be closely watching his box office performance in whatever leading man project he chooses next.
In [120]:
plt.bar(CE['Release Year'], CE['Adjusted Gross'],
align='center',
color='purple')
plt.bar(CH['Release Year'], CH['Adjusted Gross'],
align='center',
color='red')
plt.bar(CP['Release Year'], CP['Adjusted Gross'],
align='center',
color='orange')
plt.title('Chris Film Box Office Share Over Time')
Out[120]:
We love this data cut. Here, we take a comparative look of our Chrises over time. Keeping our colors consistent, Evans is purple, Hemsworth is red, Pratt is orange.
One slight issue; movies where both Hemsworth and Evans were cast (Avengers) -- the graph chooses just one color. Here's a flipped view:
In [121]:
plt.bar(CH['Release Year'], CH['Adjusted Gross'],
align='center',
color='red')
plt.bar(CE['Release Year'], CE['Adjusted Gross'],
align='center',
color='purple')
plt.bar(CP['Release Year'], CP['Adjusted Gross'],
align='center',
color='orange')
plt.title('Chris Film Box Office Share Over Time')
Out[121]:
Whoa! Where did Hemsworth go?
What these two cuts show us is that Evans and Hemsworth are both heavily reliant on their Marvel franchise hits, where they are sharing the limelight, whereas Pratt has been more of a solo vehicle, especially in more recent years.
In [122]:
#Movie scraping and data arranging like we did before
#Bruce Willis
url = 'http://www.boxofficemojo.com/people/chart/?id=brucewillis.htm'
willis = pd.read_html(url)
print('Ouput has type', type(willis), 'and length', len(willis))
print('First element has type', type(willis[0]))
willis[2]
Out[122]:
In [123]:
bruce=willis[2]
bruce.to_csv("Bruce.csv") #Converting dataframe into a csv file
#editing and cleaning as needed, resaved as Bruce1.csv
In [124]:
path='/Users/Nick/Desktop/data_bootcamp/Final Project/Bruce1.csv'
BWillis = pd.read_csv(path)
print(type(BWillis), BWillis.shape, BWillis.dtypes)
In [126]:
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
In [127]:
BWillis.plot.scatter('Release Year', 'Adjusted Gross')
Out[127]:
In [129]:
#That's a lot of films! Let's narrow:
BW=BWillis.head(11)
print(BW)
In [131]:
#we'll come back to this later, but let's get our other leading men in the frame!
In [132]:
#here we go again!
#Arnold Schwarzenegger
url = 'http://www.boxofficemojo.com/people/chart/?id=arnoldschwarzenegger.htm'
schwarz = pd.read_html(url)
print('Ouput has type', type(schwarz), 'and length', len(schwarz))
print('First element has type', type(schwarz[0]))
schwarz[2]
Out[132]:
In [133]:
arnold=schwarz[2]
print("type=", type(arnold)," ", "length=", len(arnold))
arnold.shape
print(arnold)
In [134]:
arnold.to_csv("Arnold.csv")
In [135]:
path='/Users/Nick/Desktop/data_bootcamp/Final Project/Arnold1.csv'
ASchwarz = pd.read_csv(path)
print(type(ASchwarz), ASchwarz.shape, ASchwarz.dtypes)
print(ASchwarz)
In [136]:
ASchwarz.plot.scatter('Release Year', 'Adjusted Gross')
Out[136]:
In [137]:
#let's scale back sample size again
AS=ASchwarz.head(11)
#we'll use this soon
In [138]:
#last but not least, our data for Tom Cruise
url = 'http://www.boxofficemojo.com/people/chart/?id=tomcruise.htm'
cruise = pd.read_html(url)
print('Ouput has type', type(cruise), 'and length', len(cruise))
print('First element has type', type(cruise[0]))
cruise[3]
Tom=cruise[3]
Tom.to_csv("Tom.csv")
In [139]:
path='/Users/Nick/Desktop/data_bootcamp/Final Project/Tom1.csv'
TCruise = pd.read_csv(path)
print(type(TCruise), TCruise.shape, TCruise.dtypes)
print(TCruise)
In [140]:
TCruise.plot.scatter('Release Year', 'Adjusted Gross')
Out[140]:
In [141]:
#cutting down to the top 10
TC=TCruise.head(11)
In [143]:
#All of the old school action stars in one histogram. Representing share of box office cumulatively over time.
plt.bar(TC['Release Year'],
TC['Adjusted Gross'],
align='center',
color='Blue')
plt.bar(BW['Release Year'],
BW['Adjusted Gross'],
align='center',
color='Green')
plt.bar(AS['Release Year'],
AS['Adjusted Gross'],
align='center',
color='Yellow')
plt.title('"OG" Leading Box Office over Time')
Out[143]:
LEGEND:
Tom Cruise = Blue
Bruce Willis = Green
Arnold Schwarzenegger = Yellow
In [145]:
#As a reminder, here's what we are comparing against:
fig, ax = plt.subplots(nrows=3, ncols=1, sharex=True, sharey=True)
CE['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[0], color='purple', title="Evans")
CH['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[1], color='red', title="Hemsworth")
CP['Adjusted Gross'].head(10).plot(kind="bar",ax=ax[2], color='orange', title="Pratt")
Out[145]:
In [146]:
plt.bar(CE['Release Year'], CE['Adjusted Gross'],
align='center',
color='purple')
plt.bar(CH['Release Year'], CH['Adjusted Gross'],
align='center',
color='red')
plt.bar(CP['Release Year'], CP['Adjusted Gross'],
align='center',
color='orange')
plt.title('Chris Film Box Office Share Over Time')
Out[146]:
LEGEND:
Chris Evans = Purple
Chris Hemsworth = Red
Chris Pratt = Orange
Tom Cruise (blue) has obvious staying power with films raking in over 200 million over two decades. Arnold's biggest films are clustered in a 10 year period. Bruce Willis also had clusters of hits with his biggest successes in the late nineties. If our Chrises want to stay relevant in 2035 they'll need to adopt the "slow and steady wins the race" strategy of Tom Cruise (as long as slow and steady comes with strong receipts).
The Winner: Chris Pratt! Looking at the data we predict that Chris Pratt is in the best position to capitalize going forward given his strong hauls in solo vehicles over the past several years. If he can keep his popularity up over the next decade he will be the Chris you take your grandkids to the movies to see. The upward trajectory matches our legends, and we like the trend that we see coupled with soft factors like his "everyman" appeal.
Dark Horse: Chris Evans if he can successfully spin his Marvel success into a solo vehicle for leading roles that aren't franchises.
Throw him a lifesaver: Chris Hemsworth. The once bright Thor star is floundering in solo projects, and may go the downward route of Bruce Willis.
In [ ]: