Analysis of Oscar-nominated Films



In [188]:

    
import re
import numpy as np
import pandas as pd
import scipy.stats as stats

pd.set_option('display.float_format', lambda x: '%.3f' % x)



In [153]:

    
%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sb

sb.set(color_codes=True)
sb.set_palette("muted")

np.random.seed(sum(map(ord, "regression")))



In [182]:

    
awards = pd.read_csv('../data/nominations.csv')
oscars = pd.read_csv('../data/analysis.csv')

Descriptive Analysis

To better understand general trends in the data. This is a work in progress. last updated on: February 26, 2017

Seasonality

It is well known that movies gunning for an Academy Award aim to be released between December and February, two months before the award ceremony. This is pretty evident looking at a distribution of film release months:



In [175]:

    
sb.countplot(x="release_month", data=oscars)









    Out[175]:





<matplotlib.axes.AxesSubplot at 0x110372750>

This can be more or less confirmed by calculating the Pearson correlation coefficient, which measures the linear dependence between two variables:



In [212]:

    
def print_pearsonr(data, dependent, independent):
    for field in independent:
        coeff = stats.pearsonr(data[dependent], data[field])
        print "{0} | coeff: {1} | p-value: {2}".format(field, coeff[0], coeff[1])

print_pearsonr(oscars, 'Oscar', ['q1_release', 'q2_release', 'q3_release', 'q4_release'])









    



q1_release | coeff: 0.0516477112662 | p-value: 0.0457924227928
q2_release | coeff: -0.0315055870144 | p-value: 0.223276479778
q3_release | coeff: 0.00108129557868 | p-value: 0.966668039253
q4_release | coeff: -0.0317501964736 | p-value: 0.219701562299

Q1 and Q4 have a higher coefficient than Q2 and Q3, so that points in the right direction...

This won't really help us determine who will win the actual Oscar, but at least we know that if we want a shot, we need to be releasing in late Q4 and early Q1.

Profitability

How do the financial details contribute to Oscar success?



In [179]:

    
# In case we want to examine the data based on the release decade...
oscars['decade'] = oscars['year'].apply(lambda y: str(y)[2] + "0")



In [186]:

    
# Adding some fields to slice and dice...
profit = oscars[~oscars['budget'].isnull()]
profit = profit[~profit['box_office'].isnull()]

profit['profit'] = profit['box_office'] - profit['budget']
profit['margin'] = profit['profit'] / profit['box_office']

Profitability by Award Category

Since 1980, the profitability for films which won an Oscar were on average higher than all films nominated that year.



In [187]:

    
avg_margin_for_all = profit.groupby(['category'])['margin'].mean()
avg_margin_for_win = profit[profit['Oscar'] == 1].groupby(['category'])['margin'].mean()

fig, ax = plt.subplots()

index = np.arange(len(profit['category'].unique()))

rects1 = plt.bar(index, avg_margin_for_win, 0.45, color='r', label='Won')
rects2 = plt.bar(index, avg_margin_for_all, 0.45, color='b', label='All')

plt.xlabel('Award Category')
ax.set_xticklabels(profit['category'].unique(), rotation='vertical')

plt.ylabel('Profit Margin (%)')
plt.title('Average Profit Margin by Award Category')
plt.legend()
plt.show()

The biggest losers...that won?

This is just a fun fact. There were 5 awards since 1980 that were given to films that actually lost money.



In [82]:

    
fields = ['year', 'film', 'category', 'name', 'budget', 'box_office', 'profit', 'margin']
profit[(profit['profit'] < 0) & (profit['Oscar'] == 1)][fields]









    Out[82]:






  
    
      
      year
      film
      category
      name
      budget
      box_office
      profit
      margin
    
  
  
    
      218
      1994
      Blue Sky
      Actress
      Jessica Lange
      16000000.000
      3359465.000
      -12640535.000
      -3.763
    
    
      394
      1998
      Gods and Monsters
      Adapted Screenplay
      Bill Condon
      10000000.000
      6500000.000
      -3500000.000
      -0.538
    
    
      747
      2000
      Almost Famous
      Original Screenplay
      Cameron Crowe
      60000000.000
      47400000.000
      -12600000.000
      -0.266
    
    
      1103
      1994
      Ed Wood
      Supporting Actor
      Martin Landau
      18000000.000
      5900000.000
      -12100000.000
      -2.051
    
    
      1262
      1994
      Bullets over Broadway
      Supporting Actress
      Dianne Wiest
      20000000.000
      13400000.000
      -6600000.000
      -0.493

Other Awards

Do the BAFTAs, Golden Globes, Screen Actors Guild Awards, etc. forecast who is going to win the Oscars? Let's find out...



In [249]:

    
winning_awards = oscars[['category', 'Oscar', 'BAFTA', 'Golden Globe', 'Guild']]
winning_awards.head()









    Out[249]:






  
    
      
      category
      Oscar
      BAFTA
      Golden Globe
      Guild
    
  
  
    
      0
      Actor
      1.000
      0.000
      1.000
      0.000
    
    
      1
      Actor
      0.000
      0.000
      0.000
      0.000
    
    
      2
      Actor
      0.000
      1.000
      0.000
      0.000
    
    
      3
      Actor
      0.000
      0.000
      0.000
      0.000
    
    
      4
      Actor
      0.000
      0.000
      0.000
      0.000



In [244]:

    
acting_categories = ['Actor', 'Actress', 'Supporting Actor', 'Supporting Actress']
                     
y = winning_awards[(winning_awards['Oscar'] == 1)&(winning_awards['category'].isin(acting_categories))]

fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharey=True)

plt.title('Count Plot of Wins by Award')

sb.countplot(x="BAFTA", data=y, ax=ax1)
sb.countplot(x="Golden Globe", data=y, ax=ax2)
sb.countplot(x="Guild", data=y, ax=ax3)

print "Pearson correlation for acting categories\n"
print_pearsonr(oscars[oscars['category'].isin(acting_categories)], 'Oscar', ['BAFTA', 'Golden Globe', 'Guild'])









    



Pearson correlation for acting categories

BAFTA | coeff: 0.343106753782 | p-value: 1.53555701801e-21
Golden Globe | coeff: 0.471264631568 | p-value: 1.59689509599e-41
Guild | coeff: 0.467345698872 | p-value: 8.92371589745e-41

It looks like if the Golden Globes and Screen Actors Guild awards are better indicators of Oscar success than the BAFTAs. Let's take a look at the same analysis, but for Best Picture. The "Guild" award we use is the Screen Actor Guild Award for Outstanding Performance by a Cast in a Motion Picture.



In [247]:

    
y = winning_awards[(winning_awards['Oscar'] == 1)&(winning_awards['category'] == 'Picture')]

fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, sharey=True)

plt.title('Count Plot of Wins by Award')

sb.countplot(x="BAFTA", data=y, ax=ax1)
sb.countplot(x="Golden Globe", data=y, ax=ax2)
sb.countplot(x="Guild", data=y, ax=ax3)

print "Pearson correlation for acting categories\n"
print_pearsonr(oscars[oscars['category'] == 'Picture'], 'Oscar', ['BAFTA', 'Golden Globe', 'Guild'])









    



Pearson correlation for acting categories

BAFTA | coeff: 0.303661894517 | p-value: 5.22830069622e-06
Golden Globe | coeff: 0.367860033304 | p-value: 2.34135839185e-08
Guild | coeff: 0.286173459405 | p-value: 1.8594940365e-05

Seems like the BAFTAs hold a bit more weight than the SAG awar, but the Golden Globes are still the best way to forecast an Oscar win.

	year	film	category	name	budget	box_office	profit	margin
218	1994	Blue Sky	Actress	Jessica Lange	16000000.000	3359465.000	-12640535.000	-3.763
394	1998	Gods and Monsters	Adapted Screenplay	Bill Condon	10000000.000	6500000.000	-3500000.000	-0.538
747	2000	Almost Famous	Original Screenplay	Cameron Crowe	60000000.000	47400000.000	-12600000.000	-0.266
1103	1994	Ed Wood	Supporting Actor	Martin Landau	18000000.000	5900000.000	-12100000.000	-2.051
1262	1994	Bullets over Broadway	Supporting Actress	Dianne Wiest	20000000.000	13400000.000	-6600000.000	-0.493

	category	Oscar	BAFTA	Golden Globe
0	Actor	1.000	0.000	1.000
1	Actor	0.000	0.000	0.000
2	Actor	0.000	1.000	0.000
3	Actor	0.000	0.000	0.000
4	Actor	0.000	0.000	0.000