Preparing Scraped Data for Prediction

This notebook describes the process in which the raw films.csv and nominations.csv files are "wrangled" into a workable format for our classifier(s). At the time of this writing (February 25, 2017), the resulting dataset is only used in a decision tree classifier.



In [15]:

    
import re
import pandas as pd
import numpy as np

pd.set_option('display.float_format', lambda x: '%.3f' % x)



In [16]:

    
nominations = pd.read_csv('../data/nominations.csv')

# clean out some obvious mistakes...
nominations = nominations[~nominations['film'].isin(['2001: A Space Odyssey', 'Oliver!', 'Closely Observed Train'])]
nominations = nominations[nominations['year'] >= 1980]

# scraper pulled in some character names instead of film names...
nominations.loc[nominations['film'] == 'Penny Lane', 'film'] = 'Almost Famous'
nominations.loc[nominations['film'] == 'Sister James', 'film'] = 'Doubt'

Pivot Nominations

Since we pull in four award types, we know that each nominee can have a maximum of four line items. The nominations table is pivoted to ensure that each nomination has its own unique line while still maintaining a count of wins per award.



In [17]:

    
wins = pd.pivot_table(nominations, values='winner', index=['year', 'category', 'film', 'name'], columns=['award'], aggfunc=np.sum)
wins = wins.fillna(0) # if a nominee wasn't in a specific ceremony, we just fill it as a ZERO.

wins.reset_index(inplace=True) # flattens the dataframe
wins.head()









    Out[17]:






  
    
      award
      year
      category
      film
      name
      BAFTA
      Golden Globe
      Guild
      Oscar
    
  
  
    
      0
      1980
      Actor
      All That Jazz
      Roy Scheider
      0.000
      0.000
      0.000
      0.000
    
    
      1
      1980
      Actor
      Being There
      Peter Sellers
      0.000
      0.000
      0.000
      0.000
    
    
      2
      1980
      Actor
      Coal Miner's Daughter
      Tommy Lee Jones
      0.000
      0.000
      0.000
      0.000
    
    
      3
      1980
      Actor
      Hopscotch
      Walter Matthau
      0.000
      0.000
      0.000
      0.000
    
    
      4
      1980
      Actor
      Kramer vs. Kramer
      Dustin Hoffman
      0.000
      0.000
      0.000
      0.000

Merge Oscars with Nominations

We only care about films that were nominated for an Academy Award. The pd.merge function is used to perform a left join between the oscars dataframe and the wins. In other words, we are pruning out any films that were never nominated for an Academy Award based on the join fields.



In [18]:

    
oscars = nominations[nominations['award'] == 'Oscar'][['year', 'category', 'film', 'name']]
awards = pd.merge(oscars, wins, how='left', on=['year', 'category', 'name', 'film'])
awards.head()









    Out[18]:






  
    
      
      year
      category
      film
      name
      BAFTA
      Golden Globe
      Guild
      Oscar
    
  
  
    
      0
      1980
      Actor
      Raging Bull
      Robert De Niro
      0.000
      1.000
      0.000
      1.000
    
    
      1
      1980
      Actor
      The Great Santini
      Robert Duvall
      0.000
      0.000
      0.000
      0.000
    
    
      2
      1980
      Actor
      The Elephant Man
      John Hurt
      1.000
      0.000
      0.000
      0.000
    
    
      3
      1980
      Actor
      Tribute
      Jack Lemmon
      0.000
      0.000
      0.000
      0.000
    
    
      4
      1980
      Actor
      The Stunt Man
      Peter O'Toole
      0.000
      0.000
      0.000
      0.000

Read in Films Dataframe

We pull the films.csv file into a dataframe called films. This is then merged to the awards dataframe from above. Note that we only include specific fields. Fields like metacritic_score and bom_worldwide have been excluded because too many null values exist, which would have an adverse effect on our model.



In [19]:

    
films = pd.read_csv('../data/films.csv')



In [20]:

    
relevant_fields = [
    'film',
    'country',
    'release_date', 
    'running_time', 
    'mpaa',
    'box_office',
    'budget',
    'imdb_score', 
    'rt_audience_score', 
    'rt_critic_score', 
    'stars_count', 
    'writers_count'
]

df = pd.merge(awards, films[relevant_fields], how='left', on='film')



In [21]:

    
print "Total Observations:", len(df)
print
print "Observations with NaN fields:"

for column in df.columns:
    l = len(df[df[column].isnull()])
    if l != 0:
        print len(df[df[column].isnull()]), "\t", column









    



Total Observations: 1506

Observations with NaN fields:
9 	country
2 	running_time
9 	mpaa
17 	box_office
137 	budget
2 	rt_audience_score
4 	rt_critic_score
5 	stars_count

So we obviously have some null values, which is disappointing. We'll take the time to clean these up.



In [22]:

    
### FIX RUN TIME ###
# df[df['running_time'].isnull()] # Hilary and Jackie
df.loc[df['film'] == 'Hilary and Jackie', 'running_time'] = '121 minutes'
df.loc[df['film'] == 'Fanny and Alexander', 'running_time'] = '121 minutes'

### FIX MPAA RATING ###
df = df.replace('NOT RATED', np.nan)
df = df.replace('UNRATED', np.nan)
df = df.replace('M', np.nan)
df = df.replace('NC-17', np.nan)
df = df.replace('APPROVED', np.nan)

# df[df['mpaa'].isnull()]
df.loc[df['film'].isin(['L.A. Confidential', 'In the Loop']), 'mpaa'] = 'R'
df.loc[df['film'].isin(['True Grit', 'A Room with a View']), 'mpaa'] = 'PG-13'

### FIX COUNTRY ###
# df[df['country'].isnull()] # Ulee's Gold, The Constant Gardner, Dave
df.loc[df['film'].isin(["Ulee's Gold", "Dave"]), 'country'] = 'United States'
df.loc[df['country'].isnull(), 'country'] = 'United Kingdom'
df.loc[df['country'] == 'Germany\\', 'country'] = 'Germany'
df.loc[df['country'] == 'United States & Australia', 'country'] = 'United States'
df['country'].unique()

### FIX STARS COUNT ###
# df[df['stars_count'].isnull()]

df.loc[df['film'].isin(['Before Sunset', 'Before Midnight']), 'stars_count'] = 2
df.loc[df['film'] == 'Dick Tracy', 'stars_count'] = 10
df.loc[df['stars_count'].isnull(), 'stars_count'] = 1

df = df[~df['release_date'].isin(['1970'])]



In [23]:

    
def to_numeric(value):
    multiplier = 1
    
    try:
        value = re.sub(r'([$,])', '', str(value)).strip() 
        value = re.sub(r'\([^)]*\)', '', str(value)).strip()
        
        if 'million' in value:
            multiplier = 1000000  
        elif 'billion' in value:
            multiplier = 10000000
        
        for replace in ['US', 'billion', 'million']:
            value = value.replace(replace, '')
            
        value = value.split(' ')[0]
        
        if isinstance(value, str):
            value = value.split('-')[0]
        
        value = float(value) * multiplier
    except:
        return np.nan
    
    return value

def to_runtime(value):
    try:
        return re.findall(r'\d+', value)[0]
    except:
        return np.nan

                   
### Apply function to appropriate fields ###
for field in ['box_office', 'budget']:
    df[field] = df[field].apply(to_numeric)
    
df['release_month'] = df['release_date'].apply(lambda y: int(y.split('-')[1]))
df['running_time'] = df['running_time'].apply(to_runtime)



In [25]:

    
### FIX BOX OFFICE ###
list(df[df['mpaa'].isnull()]['film'].unique())

# cleaned_box_offices = {
#     'Mona Lisa': 5794184, 
#     'Testament': 2044982, 
#     'Pennies from Heaven': 9171289, 
#     'The Year of Living Dangerously': 10300000
# }

# for key, value in cleaned_box_offices.items():
#     df.loc[df['film'] == key, 'box_office'] = value
    
# ### FIX BUDGET ###
# # df[(df['budget'].isnull())]['film'].unique()

# cleaned_budgets = {'Juno': 6500000, 'Blue Sky': 16000000, 'Pollock': 6000000 }

# for key, value in cleaned_budgets.items():
#     df.loc[df['film'] == key, 'budget'] = value









    Out[25]:





['Dark Eyes',
 'Pelle the Conqueror',
 'Resurrection',
 'The Bostonians',
 'My Life as a Dog',
 'Hamlet',
 'The Official Story',
 'The Whales of August']

Adding some more fields and removing remaining nulls

While we are pretty happy with our MPAA field, we can't input it into a predictive model as is. The decision tree would not know how to treat a string (e.g., "PG"). So, instead, we pivot those values into separate, boolean fields.

So instead of...

film	mpaa
Raging Bull	R
Kramer vs. Kramer	PG

We get...

film	G	PG	PG13	R
Raging Bull	0	0	0	1
Kramer vs. Kramer	0	1	0	0

This essentially "quantifies" the MPAA feature so that our algorithm can properly interpret it. Note that we perform a similar action for production country (just for the USA) and seasonality.



In [26]:

    
df = df[~df['mpaa'].isnull()]



In [27]:

    
df['produced_USA'] = df['country'].apply(lambda x: 1 if x == 'United States' else 0)

for column in df['mpaa'].unique():
    df[column.replace('-', '')] = df['mpaa'].apply(lambda x: 1 if x == column else 0)

df['q1_release'] = df['release_month'].apply(lambda m: 1 if m <= 3 else 0)
df['q2_release'] = df['release_month'].apply(lambda m: 1 if m > 3 and m <= 6 else 0)
df['q3_release'] = df['release_month'].apply(lambda m: 1 if m > 6 and m <= 9 else 0)
df['q4_release'] = df['release_month'].apply(lambda m: 1 if m > 9 else 0)



In [28]:

    
df.to_csv('../data/analysis.csv', index=False)



In [29]:

    
del df['mpaa']
del df['country']
del df['release_date']
del df['release_month']
del df['budget']

for column in df.columns:
    df = df[~df[column].isnull()]



In [30]:

    
df.to_csv('../data/prepared.csv', index=False)

award	year	category	film	name
0	1980	Actor	All That Jazz	Roy Scheider
1	1980	Actor	Being There	Peter Sellers
2	1980	Actor	Coal Miner's Daughter	Tommy Lee Jones
3	1980	Actor	Hopscotch	Walter Matthau
4	1980	Actor	Kramer vs. Kramer	Dustin Hoffman