This notebook describes the process in which the raw films.csv and nominations.csv files are "wrangled" into a workable format for our classifier(s). At the time of this writing (February 25, 2017), the resulting dataset is only used in a decision tree classifier.
In [15]:
import re
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)
In [16]:
nominations = pd.read_csv('../data/nominations.csv')
# clean out some obvious mistakes...
nominations = nominations[~nominations['film'].isin(['2001: A Space Odyssey', 'Oliver!', 'Closely Observed Train'])]
nominations = nominations[nominations['year'] >= 1980]
# scraper pulled in some character names instead of film names...
nominations.loc[nominations['film'] == 'Penny Lane', 'film'] = 'Almost Famous'
nominations.loc[nominations['film'] == 'Sister James', 'film'] = 'Doubt'
In [17]:
wins = pd.pivot_table(nominations, values='winner', index=['year', 'category', 'film', 'name'], columns=['award'], aggfunc=np.sum)
wins = wins.fillna(0) # if a nominee wasn't in a specific ceremony, we just fill it as a ZERO.
wins.reset_index(inplace=True) # flattens the dataframe
wins.head()
Out[17]:
We only care about films that were nominated for an Academy Award. The pd.merge function is used to perform a left join between the oscars dataframe and the wins. In other words, we are pruning out any films that were never nominated for an Academy Award based on the join fields.
In [18]:
oscars = nominations[nominations['award'] == 'Oscar'][['year', 'category', 'film', 'name']]
awards = pd.merge(oscars, wins, how='left', on=['year', 'category', 'name', 'film'])
awards.head()
Out[18]:
We pull the films.csv file into a dataframe called films. This is then merged to the awards dataframe from above. Note that we only include specific fields. Fields like metacritic_score and bom_worldwide have been excluded because too many null values exist, which would have an adverse effect on our model.
In [19]:
films = pd.read_csv('../data/films.csv')
In [20]:
relevant_fields = [
'film',
'country',
'release_date',
'running_time',
'mpaa',
'box_office',
'budget',
'imdb_score',
'rt_audience_score',
'rt_critic_score',
'stars_count',
'writers_count'
]
df = pd.merge(awards, films[relevant_fields], how='left', on='film')
In [21]:
print "Total Observations:", len(df)
print
print "Observations with NaN fields:"
for column in df.columns:
l = len(df[df[column].isnull()])
if l != 0:
print len(df[df[column].isnull()]), "\t", column
So we obviously have some null values, which is disappointing. We'll take the time to clean these up.
In [22]:
### FIX RUN TIME ###
# df[df['running_time'].isnull()] # Hilary and Jackie
df.loc[df['film'] == 'Hilary and Jackie', 'running_time'] = '121 minutes'
df.loc[df['film'] == 'Fanny and Alexander', 'running_time'] = '121 minutes'
### FIX MPAA RATING ###
df = df.replace('NOT RATED', np.nan)
df = df.replace('UNRATED', np.nan)
df = df.replace('M', np.nan)
df = df.replace('NC-17', np.nan)
df = df.replace('APPROVED', np.nan)
# df[df['mpaa'].isnull()]
df.loc[df['film'].isin(['L.A. Confidential', 'In the Loop']), 'mpaa'] = 'R'
df.loc[df['film'].isin(['True Grit', 'A Room with a View']), 'mpaa'] = 'PG-13'
### FIX COUNTRY ###
# df[df['country'].isnull()] # Ulee's Gold, The Constant Gardner, Dave
df.loc[df['film'].isin(["Ulee's Gold", "Dave"]), 'country'] = 'United States'
df.loc[df['country'].isnull(), 'country'] = 'United Kingdom'
df.loc[df['country'] == 'Germany\\', 'country'] = 'Germany'
df.loc[df['country'] == 'United States & Australia', 'country'] = 'United States'
df['country'].unique()
### FIX STARS COUNT ###
# df[df['stars_count'].isnull()]
df.loc[df['film'].isin(['Before Sunset', 'Before Midnight']), 'stars_count'] = 2
df.loc[df['film'] == 'Dick Tracy', 'stars_count'] = 10
df.loc[df['stars_count'].isnull(), 'stars_count'] = 1
df = df[~df['release_date'].isin(['1970'])]
In [23]:
def to_numeric(value):
multiplier = 1
try:
value = re.sub(r'([$,])', '', str(value)).strip()
value = re.sub(r'\([^)]*\)', '', str(value)).strip()
if 'million' in value:
multiplier = 1000000
elif 'billion' in value:
multiplier = 10000000
for replace in ['US', 'billion', 'million']:
value = value.replace(replace, '')
value = value.split(' ')[0]
if isinstance(value, str):
value = value.split('-')[0]
value = float(value) * multiplier
except:
return np.nan
return value
def to_runtime(value):
try:
return re.findall(r'\d+', value)[0]
except:
return np.nan
### Apply function to appropriate fields ###
for field in ['box_office', 'budget']:
df[field] = df[field].apply(to_numeric)
df['release_month'] = df['release_date'].apply(lambda y: int(y.split('-')[1]))
df['running_time'] = df['running_time'].apply(to_runtime)
In [25]:
### FIX BOX OFFICE ###
list(df[df['mpaa'].isnull()]['film'].unique())
# cleaned_box_offices = {
# 'Mona Lisa': 5794184,
# 'Testament': 2044982,
# 'Pennies from Heaven': 9171289,
# 'The Year of Living Dangerously': 10300000
# }
# for key, value in cleaned_box_offices.items():
# df.loc[df['film'] == key, 'box_office'] = value
# ### FIX BUDGET ###
# # df[(df['budget'].isnull())]['film'].unique()
# cleaned_budgets = {'Juno': 6500000, 'Blue Sky': 16000000, 'Pollock': 6000000 }
# for key, value in cleaned_budgets.items():
# df.loc[df['film'] == key, 'budget'] = value
Out[25]:
While we are pretty happy with our MPAA field, we can't input it into a predictive model as is. The decision tree would not know how to treat a string (e.g., "PG"). So, instead, we pivot those values into separate, boolean fields.
So instead of...
| film | mpaa |
|---|---|
| Raging Bull | R |
| Kramer vs. Kramer | PG |
We get...
| film | G | PG | PG13 | R |
|---|---|---|---|---|
| Raging Bull | 0 | 0 | 0 | 1 |
| Kramer vs. Kramer | 0 | 1 | 0 | 0 |
This essentially "quantifies" the MPAA feature so that our algorithm can properly interpret it. Note that we perform a similar action for production country (just for the USA) and seasonality.
In [26]:
df = df[~df['mpaa'].isnull()]
In [27]:
df['produced_USA'] = df['country'].apply(lambda x: 1 if x == 'United States' else 0)
for column in df['mpaa'].unique():
df[column.replace('-', '')] = df['mpaa'].apply(lambda x: 1 if x == column else 0)
df['q1_release'] = df['release_month'].apply(lambda m: 1 if m <= 3 else 0)
df['q2_release'] = df['release_month'].apply(lambda m: 1 if m > 3 and m <= 6 else 0)
df['q3_release'] = df['release_month'].apply(lambda m: 1 if m > 6 and m <= 9 else 0)
df['q4_release'] = df['release_month'].apply(lambda m: 1 if m > 9 else 0)
In [28]:
df.to_csv('../data/analysis.csv', index=False)
In [29]:
del df['mpaa']
del df['country']
del df['release_date']
del df['release_month']
del df['budget']
for column in df.columns:
df = df[~df[column].isnull()]
In [30]:
df.to_csv('../data/prepared.csv', index=False)