ReproduceIt is a series of articles that reproduce the results from data analysis articles focusing on having open data and open code. All the code and data is available on github: reproduceit-538-adam-sandler-movies. This post contains a more verbose version of the content that will probably get outdated while the github version could be updated including fixes.

I am a fan of FiveThirtyEight and how they do most of their articles based on data analysis I am also a fan of how they open source a lot of their code and data on github. The ReproduceIt series of articles is highly based on them.

In this first article of ReproduceIt I am going to try to reproduce the analysis Walt Hickey did for the article "The Three Types Of Adam Sandler Movies". This particular article is a simple data analysis on Adam Sandler movies and they didn't provide any code or data for it so it think is a nice opportunity to start this series of posts.

The other objective of these posts is to learn something new, in this case I did my first ever Bokeh plot to make the interactive plots. This makes it easier to explore the movie data that unfortunately is not possible in the original article static image.

Let's start by printing the version of the language (python in this case) and the libraries that are used on this analysis, an environment.yml is available in the github repo so you can easily create a conda environment from it.



In [1]:

    
import sys
sys.version_info









    Out[1]:





sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)



In [2]:

    
import requests
requests.__version__









    Out[2]:





'2.6.2'



In [3]:

    
import bs4
from bs4 import BeautifulSoup
bs4.__version__









    Out[3]:





'4.3.2'



In [4]:

    
import numpy as np
np.__version__









    Out[4]:





'1.9.2'



In [5]:

    
import pandas as pd
pd.__version__









    Out[5]:





'0.16.0'



In [6]:

    
from sklearn.cluster import KMeans

import sklearn
sklearn.__version__









    Out[6]:





'0.16.1'



In [7]:

    
import bokeh.plotting as plt
from bokeh.models import HoverTool
plt.output_notebook()

import bokeh
bokeh.__version__









    




    
        
        
        
    
        
        BokehJS successfully loaded.
    






    Out[7]:





'0.8.2'

Getting Data

The original article is based on Rotten Tomatoes ratings, the article does not mention exactly which rating they used "TOMATOMETER" or "AUDIENCE SCORE" so I decied to use "TOMATOMETER" since is the default Rotten Tomatoes shows in the actor page. This metric is defined as: "The percentage of Approved Tomatometer Critics who have given this movie a positive review".

For the Box Office Gross the article mentions to use Opus Data which is behind a paywall, with a 7 day demo. Trying to be more open about that so I decided to replace Opus Data with the Box Office Gross Data available on Rotten Tomatoes. This also made the task easier since I was able to get all data from one source, and even one page.

To parse the HTML from Rotten Tomatoes I used BeautifulSoup. I used the handy tool selector gadget to get the CSS selector of the content, one <table> on this case. We can pass that HTML to pandas and they will return a nice DataFrame.



In [8]:

    
def get_soup(url):
    r = requests.get(url)
    return BeautifulSoup(r.text, 'html5lib')



In [9]:

    
rotten_sandler_url = 'http://www.rottentomatoes.com/celebrity/adam_sandler/'



In [10]:

    
soup = get_soup(rotten_sandler_url)



In [11]:

    
films_table = str(soup.select('#filmography_box table:first-child')[0])



In [12]:

    
rotten = pd.read_html(films_table)[0]



In [13]:

    
rotten.head()









    Out[13]:






  
    
      
      RATING
      TITLE
      CREDIT
      BOX OFFICE
      YEAR
    
  
  
    
      0
      NaN
      Candy Land
      Actor
      --
      2015
    
    
      1
      6%
      Paul Blart: Mall Cop 2
      Producer
      $43.2M
      2015
    
    
      2
      NaN
      Hello Ghost
      Actor Producer
      --
      2015
    
    
      3
      9%
      The Cobbler
      Max Simkin
      --
      2015
    
    
      4
      NaN
      Pixels
      Producer Screenwriter Sam Brenner
      --
      2015

The data needed to be cleaned up a little bit. Convert the "Rating" and "Box Office" columns to numeric values with some simple transformations removing some text characters and also replacing empty values with numpy.nan.



In [14]:

    
rotten.RATING = rotten.RATING.str.replace('%', '').astype(float)



In [15]:

    
rotten['BOX OFFICE'] = rotten['BOX OFFICE'].str.replace('$', '').str.replace('M', '').str.replace('-', '0')
rotten['BOX OFFICE'] = rotten['BOX OFFICE'].astype(float)



In [16]:

    
rotten.loc[rotten['BOX OFFICE'] == 0, ['BOX OFFICE']] = np.nan



In [17]:

    
rotten.head()









    Out[17]:






  
    
      
      RATING
      TITLE
      CREDIT
      BOX OFFICE
      YEAR
    
  
  
    
      0
      NaN
      Candy Land
      Actor
      NaN
      2015
    
    
      1
      6
      Paul Blart: Mall Cop 2
      Producer
      43.2
      2015
    
    
      2
      NaN
      Hello Ghost
      Actor Producer
      NaN
      2015
    
    
      3
      9
      The Cobbler
      Max Simkin
      NaN
      2015
    
    
      4
      NaN
      Pixels
      Producer Screenwriter Sam Brenner
      NaN
      2015



In [18]:

    
rotten = rotten.set_index('TITLE')

Finaly save the dataset.



In [19]:

    
rotten.to_csv('rotten.csv')

Plot

This is the original plot for comparison.



In [20]:

    
from IPython.display import Image



In [21]:

    
Image(url='https://espnfivethirtyeight.files.wordpress.com/2015/04/hickey-datalab-sandler.png', width=550)









    Out[21]:

We load the saved data into a Pandas DataFrame and we plot it using bokeh that gives some nice interactive features that the original chart do not have, this makes it easier to explore the different movies in the multiple clusters.

To make it simpler I removed all the movies that have one of the data points missing, this might differ from the original article analised movies a little bit.



In [22]:

    
rotten = pd.read_csv('rotten.csv', index_col=0)



In [23]:

    
rotten = rotten.dropna()



In [24]:

    
len(rotten)









    Out[24]:





37



In [25]:

    
rotten.index









    Out[25]:





Index(['Paul Blart: Mall Cop 2', 'Blended', 'Top Five', 'Grown Ups 2', 'Hotel Transylvania', 'That's My Boy', 'Here Comes the Boom', 'Bucky Larson: Born to Be a Star', 'Jack and Jill', 'Zookeeper', 'Just Go with It', 'Grown Ups', 'Funny People', 'Paul Blart: Mall Cop', 'Bedtime Stories', 'You Don't Mess With the Zohan', 'The House Bunny', 'Strange Wilderness', 'I Now Pronounce You Chuck & Larry', 'Reign Over Me', 'The Benchwarmers', 'Grandma's Boy', 'Click', 'The Longest Yard', 'Deuce Bigalow: European Gigolo', 'Spanglish', '50 First Dates', 'Anger Management', 'Dickie Roberts: Former Child Star', 'Punch-Drunk Love', 'Adam Sandler's Eight Crazy Nights', 'The Hot Chick', 'Mr. Deeds', 'The Master of Disguise', 'The Animal', 'Joe Dirt', 'Little Nicky'], dtype='object')



In [26]:

    
source = plt.ColumnDataSource(
    data=dict(
        rating=rotten.RATING,
        gross=rotten['BOX OFFICE'],
        movie=rotten.index,
    )
)

p = plt.figure(tools='reset,save,hover', x_range=[0, 100],
               title='', width=530, height=530,
               x_axis_label="Rotten Tomatoes rating",
               y_axis_label="Box Office Gross")
p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source)

hover = p.select(dict(type=HoverTool))

hover.tooltips = [
    ("Movie", "@movie"),
    ("Rating", "@rating"),
    ("Box Office Gross", "@gross"),
]

plt.show(p)

Clusters

The article also mentioned some simple clustering on the dataset. I used scikit-learn to reproduce that result.



In [27]:

    
X = rotten[['RATING', 'BOX OFFICE']].values



In [28]:

    
clf = KMeans(n_clusters=3)



In [29]:

    
clf.fit(X)









    Out[29]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)



In [30]:

    
clusters = clf.predict(X)
clusters









    Out[30]:





array([0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 0, 0, 1, 2, 0, 0, 1,
       1, 0, 2, 1, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0], dtype=int32)



In [31]:

    
colors = clusters.astype(str)
colors[clusters == 0] = 'green'
colors[clusters == 1] = 'red'
colors[clusters == 2] = 'gold'



In [32]:

    
source = plt.ColumnDataSource(
    data=dict(
        rating=rotten.RATING,
        gross=rotten['BOX OFFICE'],
        movie=rotten.index,
    )
)

p = plt.figure(tools='reset,save,hover', x_range=[0, 100],
               title='', width=530, height=530,
               x_axis_label="Rotten Tomatoes rating",
               y_axis_label="Box Office Gross")
p.scatter(rotten.RATING, rotten['BOX OFFICE'], size=10, source=source, color=colors)

hover = p.select(dict(type=HoverTool))

hover.tooltips = [
    ("Movie", "@movie"),
    ("Rating", "@rating"),
    ("Box Office Gross", "@gross"),
]

plt.show(p)

We can see a very similar result to the original plot. You can read how the author analyses these 3 clusters there: The Three Types Of Adam Sandler Movies

IMDB

I was curious about how this result compare if we change the sources of the data a little bit: What happens if we use IMDB ratings instead of Rotten Tomatoes?

IMDB: Ratings

We apply a similar procedure for getting the data from IMDB with a basic crawler to get the rating from every movie page.



In [33]:

    
imdb_sandler_url = 'http://www.imdb.com/name/nm0001191/'



In [34]:

    
soup = get_soup(imdb_sandler_url)



In [35]:

    
a_tags = soup.select('div#filmo-head-actor + div b a')



In [36]:

    
a_tags[:5]









    Out[36]:





[<a href="/title/tt2479478/?ref_=nm_flmg_act_1">The Ridiculous 6</a>,
 <a href="/title/tt2510894/?ref_=nm_flmg_act_2">Hotel Transylvania 2</a>,
 <a href="/title/tt2120120/?ref_=nm_flmg_act_3">Pixels</a>,
 <a href="/title/tt3203616/?ref_=nm_flmg_act_4">The Cobbler</a>,
 <a href="/title/tt3179568/?ref_=nm_flmg_act_5">Men, Women &amp; Children</a>]



In [37]:

    
movies = {}
for a_tag in a_tags:
    movie_name = a_tag.text
    movie_url = 'http://www.imdb.com' + a_tag['href']
    soup = get_soup(movie_url)
    rating = soup.select('.star-box-giga-star')
    if len(rating) == 1:
        movies[movie_name] = float(rating[0].text)



In [38]:

    
ratings = pd.DataFrame.from_dict(movies, orient='index')
ratings.columns = ['rating']



In [39]:

    
ratings.head()









    Out[39]:






  
    
      
      rating
    
  
  
    
      Brooklyn Nine-Nine
      8.3
    
    
      A Day with the Meatball
      6.6
    
    
      You Don't Mess with the Zohan
      5.5
    
    
      Click
      6.4
    
    
      Deuce Bigalow: Male Gigolo
      5.7



In [40]:

    
len(ratings)









    Out[40]:





53



In [41]:

    
ratings.index.name = 'Title'



In [42]:

    
ratings.to_csv('imdb-ratings.csv')

IMDB: Box Office Mojo

IMDB also provides the Box Office Gross information from Box Office Mojo.



In [43]:

    
box_sandler_url = 'http://www.boxofficemojo.com/people/chart/?view=Actor&id=adamsandler.htm'



In [44]:

    
soup = get_soup(box_sandler_url)



In [45]:

    
box_gross_table = str(soup.select('br + table')[0])



In [46]:

    
gross = pd.read_html(box_gross_table, header=0)[0]



In [47]:

    
gross.head()









    Out[47]:






  
    
      
      Date
      Title (click to view)
      Studio
      Lifetime Gross / Theaters
      Opening / Theaters
      Rank
      Unnamed: 6
      Unnamed: 7
    
  
  
    
      0
      10/1/14
      Men, Women & Children
      Par.
      $705,908
      608
      $48,024
      17
      30
    
    
      1
      5/23/14
      Blended
      WB
      $46,294,610
      3555
      $14,284,031
      3555
      18
    
    
      2
      7/12/13
      Grown Ups 2
      Sony
      $133,668,525
      3491
      $41,508,572
      3491
      8
    
    
      3
      9/28/12
      Hotel Transylvania(Voice)
      Sony
      $148,313,048
      3375
      $42,522,194
      3349
      5
    
    
      4
      6/15/12
      That's My Boy
      Sony
      $36,931,089
      3030
      $13,453,714
      3030
      22



In [48]:

    
gross.drop('Unnamed: 6', axis=1, inplace=True)
gross.drop('Unnamed: 7', axis=1, inplace=True)
gross.drop('Opening / Theaters', axis=1, inplace=True)
gross.drop('Rank', axis=1, inplace=True)
gross.drop('Studio', axis=1, inplace=True)



In [49]:

    
gross.columns = ['Date', 'Title', 'Gross']



In [50]:

    
gross.set_index('Title', inplace=True)



In [51]:

    
gross.Gross = gross.Gross.str.replace(r'[$,]', '').astype(int)



In [52]:

    
gross.head()









    Out[52]:






  
    
      
      Date
      Gross
    
    
      Title
      
      
    
  
  
    
      Men, Women & Children
      10/1/14
      705908
    
    
      Blended
      5/23/14
      46294610
    
    
      Grown Ups 2
      7/12/13
      133668525
    
    
      Hotel Transylvania(Voice)
      9/28/12
      148313048
    
    
      That's My Boy
      6/15/12
      36931089



In [53]:

    
gross.to_csv('imdb-gross.csv')

IMDB: Analysis

Load both datasets and merge them. Fixing some naming issues becuase of some inconsistencies between the two sites.



In [54]:

    
ratings = pd.read_csv('imdb-ratings.csv', index_col=0)



In [55]:

    
gross = pd.read_csv('imdb-gross.csv', index_col=0)



In [56]:

    
gross.Gross = gross.Gross / 1e6



In [57]:

    
len(ratings)









    Out[57]:





53



In [58]:

    
len(gross)









    Out[58]:





37



In [59]:

    
gross.ix['Just Go with It'] = gross.ix['Just Go With It']
gross = gross.drop('Just Go With It')



In [60]:

    
gross.ix['I Now Pronounce You Chuck & Larry'] = gross.ix['I Now Pronounce You Chuck and Larry']
gross = gross.drop('I Now Pronounce You Chuck and Larry')



In [61]:

    
imdb = gross.join(ratings)



In [62]:

    
len(imdb), len(imdb.dropna())









    Out[62]:





(37, 33)



In [63]:

    
imdb = imdb.dropna()



In [64]:

    
source = plt.ColumnDataSource(
    data=dict(
        rating=imdb.rating,
        gross=imdb.Gross,
        movie=imdb.index,
    )
)

p = plt.figure(tools='reset,save,hover', x_range=[0, 10], 
               title='', width=530, height=530,
               x_axis_label="Rotten Tomatoes rating",
               y_axis_label="Box Office Gross")
p.scatter(imdb.rating, imdb.Gross, size=10, source=source)

hover = p.select(dict(type=HoverTool))
hover.tooltips = [
    ("Movie", "@movie"),
    ("Rating", "@rating"),
    ("Box Office Gross", "@gross"),
]

plt.show(p)



In [65]:

    
X = imdb[['rating', 'Gross']].values



In [66]:

    
clf = KMeans(n_clusters=2)



In [67]:

    
clf.fit(X)









    Out[67]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=2, n_init=10,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)



In [68]:

    
clusters = clf.predict(X)
clusters









    Out[68]:





array([1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 0], dtype=int32)



In [69]:

    
colors = clusters.astype(str)
colors[clusters == 0] = 'green'
colors[clusters == 1] = 'red'



In [70]:

    
source = plt.ColumnDataSource(
    data=dict(
        rating=imdb.rating,
        gross=imdb.Gross,
        movie=imdb.index,
    )
)

p = plt.figure(tools='reset,save,hover', x_range=[0, 10],
               title='', width=530, height=530,
               x_axis_label="Rotten Tomatoes rating",
               y_axis_label="Box Office Gross")
p.scatter(imdb.rating, imdb.Gross, size=10, source=source, color=colors)

hover = p.select(dict(type=HoverTool))
hover.tooltips = [
    ("Movie", "@movie"),
    ("Rating", "@rating"),
    ("Box Office Gross", "@gross"),
]

plt.show(p)

Conclusions

It was quite easy to reproduce most of the results from the original article with publicly available data. Since the sources of data differ in the Box Office Gross then they results differ a little bit. Also the original article only did movies when Sandler is an actor so movies like: Bucky Larson: Born to be a star where Sandler is a writer are not part of their analysis but were for this one. This particualr movie can be found in the first plot close to the (0,0) point.

Interesting how the results between IMDB and Rotten Tomatoes are actually quite different. In IMDB the ratings are not that extreme like in Rotten Tomatoes. Almost all the movies are between 4 and 8 and there are two clear clusters, basically the movies that made more than 100M and the movies that made less than that. With only two movies connecting a big 40M gap.

With this results you could say that all Sandler movies are average (6/10), with some of them making money and some did not.

With this code now it should be possible to reproduce this analysis with more actors. Actually the Sandler article was a continuation of another FiveThirtyEight article: The Four Types Of Will Ferrell Movies.

All code and data is available on github: reproduceit-538-adam-sandler-movies. Any issues, inconsistencies or improvement comments are welcome.

	RATING	TITLE	CREDIT	BOX OFFICE	YEAR
0	NaN	Candy Land	Actor	--	2015
1	6%	Paul Blart: Mall Cop 2	Producer	$43.2M	2015
2	NaN	Hello Ghost	Actor Producer	--	2015
3	9%	The Cobbler	Max Simkin	--	2015
4	NaN	Pixels	Producer Screenwriter Sam Brenner	--	2015

	rating
Brooklyn Nine-Nine	8.3
A Day with the Meatball	6.6
You Don't Mess with the Zohan	5.5
Click	6.4
Deuce Bigalow: Male Gigolo	5.7

	Date	Title (click to view)	Studio	Lifetime Gross / Theaters	Opening / Theaters	Rank	Unnamed: 6	Unnamed: 7
0	10/1/14	Men, Women & Children	Par.	$705,908	608	$48,024	17	30
1	5/23/14	Blended	WB	$46,294,610	3555	$14,284,031	3555	18
2	7/12/13	Grown Ups 2	Sony	$133,668,525	3491	$41,508,572	3491	8
3	9/28/12	Hotel Transylvania(Voice)	Sony	$148,313,048	3375	$42,522,194	3349	5
4	6/15/12	That's My Boy	Sony	$36,931,089	3030	$13,453,714	3030	22

	Date	Gross
Title
Men, Women & Children	10/1/14	705908
Blended	5/23/14	46294610
Grown Ups 2	7/12/13	133668525
Hotel Transylvania(Voice)	9/28/12	148313048
That's My Boy	6/15/12	36931089