Preparing Scraped Data for Prediction

This notebook describes the process in which the raw films.csv and nominations.csv files are "wrangled" into a workable format for our classifier(s). At the time of this writing (February 25, 2017), the resulting dataset is only used in a decision tree classifier.


In [15]:
import re
import pandas as pd
import numpy as np

pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [16]:
nominations = pd.read_csv('../data/nominations.csv')

# clean out some obvious mistakes...
nominations = nominations[~nominations['film'].isin(['2001: A Space Odyssey', 'Oliver!', 'Closely Observed Train'])]
nominations = nominations[nominations['year'] >= 1980]

# scraper pulled in some character names instead of film names...
nominations.loc[nominations['film'] == 'Penny Lane', 'film'] = 'Almost Famous'
nominations.loc[nominations['film'] == 'Sister James', 'film'] = 'Doubt'

Pivot Nominations

Since we pull in four award types, we know that each nominee can have a maximum of four line items. The nominations table is pivoted to ensure that each nomination has its own unique line while still maintaining a count of wins per award.


In [17]:
wins = pd.pivot_table(nominations, values='winner', index=['year', 'category', 'film', 'name'], columns=['award'], aggfunc=np.sum)
wins = wins.fillna(0) # if a nominee wasn't in a specific ceremony, we just fill it as a ZERO.

wins.reset_index(inplace=True) # flattens the dataframe
wins.head()


Out[17]:
award year category film name BAFTA Golden Globe Guild Oscar
0 1980 Actor All That Jazz Roy Scheider 0.000 0.000 0.000 0.000
1 1980 Actor Being There Peter Sellers 0.000 0.000 0.000 0.000
2 1980 Actor Coal Miner's Daughter Tommy Lee Jones 0.000 0.000 0.000 0.000
3 1980 Actor Hopscotch Walter Matthau 0.000 0.000 0.000 0.000
4 1980 Actor Kramer vs. Kramer Dustin Hoffman 0.000 0.000 0.000 0.000

Merge Oscars with Nominations

We only care about films that were nominated for an Academy Award. The pd.merge function is used to perform a left join between the oscars dataframe and the wins. In other words, we are pruning out any films that were never nominated for an Academy Award based on the join fields.


In [18]:
oscars = nominations[nominations['award'] == 'Oscar'][['year', 'category', 'film', 'name']]
awards = pd.merge(oscars, wins, how='left', on=['year', 'category', 'name', 'film'])
awards.head()


Out[18]:
year category film name BAFTA Golden Globe Guild Oscar
0 1980 Actor Raging Bull Robert De Niro 0.000 1.000 0.000 1.000
1 1980 Actor The Great Santini Robert Duvall 0.000 0.000 0.000 0.000
2 1980 Actor The Elephant Man John Hurt 1.000 0.000 0.000 0.000
3 1980 Actor Tribute Jack Lemmon 0.000 0.000 0.000 0.000
4 1980 Actor The Stunt Man Peter O'Toole 0.000 0.000 0.000 0.000

Read in Films Dataframe

We pull the films.csv file into a dataframe called films. This is then merged to the awards dataframe from above. Note that we only include specific fields. Fields like metacritic_score and bom_worldwide have been excluded because too many null values exist, which would have an adverse effect on our model.


In [19]:
films = pd.read_csv('../data/films.csv')

In [20]:
relevant_fields = [
    'film',
    'country',
    'release_date', 
    'running_time', 
    'mpaa',
    'box_office',
    'budget',
    'imdb_score', 
    'rt_audience_score', 
    'rt_critic_score', 
    'stars_count', 
    'writers_count'
]

df = pd.merge(awards, films[relevant_fields], how='left', on='film')

In [21]:
print "Total Observations:", len(df)
print
print "Observations with NaN fields:"

for column in df.columns:
    l = len(df[df[column].isnull()])
    if l != 0:
        print len(df[df[column].isnull()]), "\t", column


Total Observations: 1506

Observations with NaN fields:
9 	country
2 	running_time
9 	mpaa
17 	box_office
137 	budget
2 	rt_audience_score
4 	rt_critic_score
5 	stars_count

So we obviously have some null values, which is disappointing. We'll take the time to clean these up.


In [22]:
### FIX RUN TIME ###
# df[df['running_time'].isnull()] # Hilary and Jackie
df.loc[df['film'] == 'Hilary and Jackie', 'running_time'] = '121 minutes'
df.loc[df['film'] == 'Fanny and Alexander', 'running_time'] = '121 minutes'

### FIX MPAA RATING ###
df = df.replace('NOT RATED', np.nan)
df = df.replace('UNRATED', np.nan)
df = df.replace('M', np.nan)
df = df.replace('NC-17', np.nan)
df = df.replace('APPROVED', np.nan)

# df[df['mpaa'].isnull()]
df.loc[df['film'].isin(['L.A. Confidential', 'In the Loop']), 'mpaa'] = 'R'
df.loc[df['film'].isin(['True Grit', 'A Room with a View']), 'mpaa'] = 'PG-13'

### FIX COUNTRY ###
# df[df['country'].isnull()] # Ulee's Gold, The Constant Gardner, Dave
df.loc[df['film'].isin(["Ulee's Gold", "Dave"]), 'country'] = 'United States'
df.loc[df['country'].isnull(), 'country'] = 'United Kingdom'
df.loc[df['country'] == 'Germany\\', 'country'] = 'Germany'
df.loc[df['country'] == 'United States & Australia', 'country'] = 'United States'
df['country'].unique()

### FIX STARS COUNT ###
# df[df['stars_count'].isnull()]

df.loc[df['film'].isin(['Before Sunset', 'Before Midnight']), 'stars_count'] = 2
df.loc[df['film'] == 'Dick Tracy', 'stars_count'] = 10
df.loc[df['stars_count'].isnull(), 'stars_count'] = 1

df = df[~df['release_date'].isin(['1970'])]

In [23]:
def to_numeric(value):
    multiplier = 1
    
    try:
        value = re.sub(r'([$,])', '', str(value)).strip() 
        value = re.sub(r'\([^)]*\)', '', str(value)).strip()
        
        if 'million' in value:
            multiplier = 1000000  
        elif 'billion' in value:
            multiplier = 10000000
        
        for replace in ['US', 'billion', 'million']:
            value = value.replace(replace, '')
            
        value = value.split(' ')[0]
        
        if isinstance(value, str):
            value = value.split('-')[0]
        
        value = float(value) * multiplier
    except:
        return np.nan
    
    return value

def to_runtime(value):
    try:
        return re.findall(r'\d+', value)[0]
    except:
        return np.nan

                   
### Apply function to appropriate fields ###
for field in ['box_office', 'budget']:
    df[field] = df[field].apply(to_numeric)
    
df['release_month'] = df['release_date'].apply(lambda y: int(y.split('-')[1]))
df['running_time'] = df['running_time'].apply(to_runtime)

In [25]:
### FIX BOX OFFICE ###
list(df[df['mpaa'].isnull()]['film'].unique())

# cleaned_box_offices = {
#     'Mona Lisa': 5794184, 
#     'Testament': 2044982, 
#     'Pennies from Heaven': 9171289, 
#     'The Year of Living Dangerously': 10300000
# }

# for key, value in cleaned_box_offices.items():
#     df.loc[df['film'] == key, 'box_office'] = value
    
# ### FIX BUDGET ###
# # df[(df['budget'].isnull())]['film'].unique()

# cleaned_budgets = {'Juno': 6500000, 'Blue Sky': 16000000, 'Pollock': 6000000 }

# for key, value in cleaned_budgets.items():
#     df.loc[df['film'] == key, 'budget'] = value


Out[25]:
['Dark Eyes',
 'Pelle the Conqueror',
 'Resurrection',
 'The Bostonians',
 'My Life as a Dog',
 'Hamlet',
 'The Official Story',
 'The Whales of August']

Adding some more fields and removing remaining nulls

While we are pretty happy with our MPAA field, we can't input it into a predictive model as is. The decision tree would not know how to treat a string (e.g., "PG"). So, instead, we pivot those values into separate, boolean fields.

So instead of...

film mpaa
Raging Bull R
Kramer vs. Kramer PG

We get...

film G PG PG13 R
Raging Bull 0 0 0 1
Kramer vs. Kramer 0 1 0 0

This essentially "quantifies" the MPAA feature so that our algorithm can properly interpret it. Note that we perform a similar action for production country (just for the USA) and seasonality.


In [26]:
df = df[~df['mpaa'].isnull()]

In [27]:
df['produced_USA'] = df['country'].apply(lambda x: 1 if x == 'United States' else 0)

for column in df['mpaa'].unique():
    df[column.replace('-', '')] = df['mpaa'].apply(lambda x: 1 if x == column else 0)

df['q1_release'] = df['release_month'].apply(lambda m: 1 if m <= 3 else 0)
df['q2_release'] = df['release_month'].apply(lambda m: 1 if m > 3 and m <= 6 else 0)
df['q3_release'] = df['release_month'].apply(lambda m: 1 if m > 6 and m <= 9 else 0)
df['q4_release'] = df['release_month'].apply(lambda m: 1 if m > 9 else 0)

In [28]:
df.to_csv('../data/analysis.csv', index=False)

In [29]:
del df['mpaa']
del df['country']
del df['release_date']
del df['release_month']
del df['budget']

for column in df.columns:
    df = df[~df[column].isnull()]

In [30]:
df.to_csv('../data/prepared.csv', index=False)