This notebook describes the process in which the raw films.csv and nominations.csv files are "wrangled" into a workable format for our classifier(s). At the time of this writing (February 25, 2017), the resulting dataset is only used in a decision tree classifier.
In [15]:
import re
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [16]:
nominations = pd.read_csv('../data/nominations.csv')
# clean out some obvious mistakes...
nominations = nominations[~nominations['film'].isin(['2001: A Space Odyssey', 'Oliver!', 'Closely Observed Train'])]
nominations = nominations[nominations['year'] >= 1980]
# scraper pulled in some character names instead of film names...
nominations.loc[nominations['film'] == 'Penny Lane', 'film'] = 'Almost Famous'
nominations.loc[nominations['film'] == 'Sister James', 'film'] = 'Doubt'
In [17]:
wins = pd.pivot_table(nominations, values='winner', index=['year', 'category', 'film', 'name'], columns=['award'], aggfunc=np.sum)
wins = wins.fillna(0) # if a nominee wasn't in a specific ceremony, we just fill it as a ZERO.
wins.reset_index(inplace=True) # flattens the dataframe
wins.head()
Out[17]:
We only care about films that were nominated for an Academy Award. The pd.merge
function is used to perform a left join between the oscars
dataframe and the wins
. In other words, we are pruning out any films that were never nominated for an Academy Award based on the join fields.
In [18]:
oscars = nominations[nominations['award'] == 'Oscar'][['year', 'category', 'film', 'name']]
awards = pd.merge(oscars, wins, how='left', on=['year', 'category', 'name', 'film'])
awards.head()
Out[18]:
We pull the films.csv file into a dataframe called films
. This is then merged to the awards
dataframe from above. Note that we only include specific fields. Fields like metacritic_score
and bom_worldwide
have been excluded because too many null values exist, which would have an adverse effect on our model.
In [19]:
films = pd.read_csv('../data/films.csv')
In [20]:
relevant_fields = [
'film',
'country',
'release_date',
'running_time',
'mpaa',
'box_office',
'budget',
'imdb_score',
'rt_audience_score',
'rt_critic_score',
'stars_count',
'writers_count'
]
df = pd.merge(awards, films[relevant_fields], how='left', on='film')
In [21]:
print "Total Observations:", len(df)
print
print "Observations with NaN fields:"
for column in df.columns:
l = len(df[df[column].isnull()])
if l != 0:
print len(df[df[column].isnull()]), "\t", column
So we obviously have some null values, which is disappointing. We'll take the time to clean these up.
In [22]:
### FIX RUN TIME ###
# df[df['running_time'].isnull()] # Hilary and Jackie
df.loc[df['film'] == 'Hilary and Jackie', 'running_time'] = '121 minutes'
df.loc[df['film'] == 'Fanny and Alexander', 'running_time'] = '121 minutes'
### FIX MPAA RATING ###
df = df.replace('NOT RATED', np.nan)
df = df.replace('UNRATED', np.nan)
df = df.replace('M', np.nan)
df = df.replace('NC-17', np.nan)
df = df.replace('APPROVED', np.nan)
# df[df['mpaa'].isnull()]
df.loc[df['film'].isin(['L.A. Confidential', 'In the Loop']), 'mpaa'] = 'R'
df.loc[df['film'].isin(['True Grit', 'A Room with a View']), 'mpaa'] = 'PG-13'
### FIX COUNTRY ###
# df[df['country'].isnull()] # Ulee's Gold, The Constant Gardner, Dave
df.loc[df['film'].isin(["Ulee's Gold", "Dave"]), 'country'] = 'United States'
df.loc[df['country'].isnull(), 'country'] = 'United Kingdom'
df.loc[df['country'] == 'Germany\\', 'country'] = 'Germany'
df.loc[df['country'] == 'United States & Australia', 'country'] = 'United States'
df['country'].unique()
### FIX STARS COUNT ###
# df[df['stars_count'].isnull()]
df.loc[df['film'].isin(['Before Sunset', 'Before Midnight']), 'stars_count'] = 2
df.loc[df['film'] == 'Dick Tracy', 'stars_count'] = 10
df.loc[df['stars_count'].isnull(), 'stars_count'] = 1
df = df[~df['release_date'].isin(['1970'])]
In [23]:
def to_numeric(value):
multiplier = 1
try:
value = re.sub(r'([$,])', '', str(value)).strip()
value = re.sub(r'\([^)]*\)', '', str(value)).strip()
if 'million' in value:
multiplier = 1000000
elif 'billion' in value:
multiplier = 10000000
for replace in ['US', 'billion', 'million']:
value = value.replace(replace, '')
value = value.split(' ')[0]
if isinstance(value, str):
value = value.split('-')[0]
value = float(value) * multiplier
except:
return np.nan
return value
def to_runtime(value):
try:
return re.findall(r'\d+', value)[0]
except:
return np.nan
### Apply function to appropriate fields ###
for field in ['box_office', 'budget']:
df[field] = df[field].apply(to_numeric)
df['release_month'] = df['release_date'].apply(lambda y: int(y.split('-')[1]))
df['running_time'] = df['running_time'].apply(to_runtime)
In [25]:
### FIX BOX OFFICE ###
list(df[df['mpaa'].isnull()]['film'].unique())
# cleaned_box_offices = {
# 'Mona Lisa': 5794184,
# 'Testament': 2044982,
# 'Pennies from Heaven': 9171289,
# 'The Year of Living Dangerously': 10300000
# }
# for key, value in cleaned_box_offices.items():
# df.loc[df['film'] == key, 'box_office'] = value
# ### FIX BUDGET ###
# # df[(df['budget'].isnull())]['film'].unique()
# cleaned_budgets = {'Juno': 6500000, 'Blue Sky': 16000000, 'Pollock': 6000000 }
# for key, value in cleaned_budgets.items():
# df.loc[df['film'] == key, 'budget'] = value
Out[25]:
While we are pretty happy with our MPAA field, we can't input it into a predictive model as is. The decision tree would not know how to treat a string (e.g., "PG"). So, instead, we pivot those values into separate, boolean fields.
So instead of...
film | mpaa |
---|---|
Raging Bull | R |
Kramer vs. Kramer | PG |
We get...
film | G | PG | PG13 | R |
---|---|---|---|---|
Raging Bull | 0 | 0 | 0 | 1 |
Kramer vs. Kramer | 0 | 1 | 0 | 0 |
This essentially "quantifies" the MPAA feature so that our algorithm can properly interpret it. Note that we perform a similar action for production country (just for the USA) and seasonality.
In [26]:
df = df[~df['mpaa'].isnull()]
In [27]:
df['produced_USA'] = df['country'].apply(lambda x: 1 if x == 'United States' else 0)
for column in df['mpaa'].unique():
df[column.replace('-', '')] = df['mpaa'].apply(lambda x: 1 if x == column else 0)
df['q1_release'] = df['release_month'].apply(lambda m: 1 if m <= 3 else 0)
df['q2_release'] = df['release_month'].apply(lambda m: 1 if m > 3 and m <= 6 else 0)
df['q3_release'] = df['release_month'].apply(lambda m: 1 if m > 6 and m <= 9 else 0)
df['q4_release'] = df['release_month'].apply(lambda m: 1 if m > 9 else 0)
In [28]:
df.to_csv('../data/analysis.csv', index=False)
In [29]:
del df['mpaa']
del df['country']
del df['release_date']
del df['release_month']
del df['budget']
for column in df.columns:
df = df[~df[column].isnull()]
In [30]:
df.to_csv('../data/prepared.csv', index=False)