This notebook explores using data science techniques on a data set of 5000+ movies, and predicting whether a movie will be highly rated on IMDb.
The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.
This notebook is adapted from "Titanic Data Science Solutions" by Manav Sehgal
This workflow goes through seven stages.
The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.
The original data set used in this notebook can be found here at Kaggle.
Knowing from a training set of samples listing movies and their IMDb scores, can our model determine based on a given test dataset not containing the scores, if the movies in the test dataset scored highly or not?
The data science solutions workflow solves for seven major goals.
Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.
Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.
Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.
Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.
Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.
Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.
Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals. A good start is to read the Tableau paper on Which chart or graph is right for you?.
In [348]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
from scipy.stats import truncnorm
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
The Python Pandas packages helps us work with our datasets. We start by acquiring the datasets into a Pandas DataFrame.
We will partition off 80% as our training data and 20% of the data as our test data. We also combine these datasets to run certain operations on both datasets together.
Let's move the partioning to after the data wrangling. Makes the code simpler, and doesn't make a real difference. Also removes any differences in the banding portions between runs.
In [349]:
df = pd.read_csv('../input/movie_metadata.csv')
# train_df, test_df = train_test_split(df, test_size = 0.2)
# test_actual = test_df['imdb_score']
# test_df = test_df.drop('imdb_score', axis=1)
# combine = [train_df, test_df]
Pandas also helps describe the datasets answering following questions early in our project.
Which features are available in the dataset?
Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle page here.
In [350]:
print(df.columns.values)
Which features are categorical?
Which features are numerical?
In [351]:
pd.set_option('display.max_columns', 50)
In [352]:
# preview the data
df.head()
Out[352]:
In [353]:
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()
Which features contain blank, null or empty values?
These will require correcting.
What are the data types for various features?
Helping us during converting goal.
In [354]:
df.info()
In [355]:
df.describe()
Out[355]:
What is the distribution of categorical features?
In [356]:
df.describe(include=['O'])
Out[356]:
In [357]:
df.loc[ df['imdb_score'] < 7.0, 'imdb_score'] = 0
df.loc[ df['imdb_score'] >= 7.0, 'imdb_score'] = 1
df.head()
Out[357]:
We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.
Correlating.
We want to know how well does each feature correlate with IMDb score. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.
Completing.
Correcting.
Creating.
Classifying.
To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.
In [358]:
df[['content_rating', 'imdb_score']].groupby(['content_rating'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)
Out[358]:
In [359]:
df[["color", "imdb_score"]].groupby(['color'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)
Out[359]:
In [360]:
df[["director_name", "imdb_score"]].groupby(['director_name'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)
Out[360]:
In [361]:
df[["country", "imdb_score"]].groupby(['country'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)
Out[361]:
In [362]:
g = sns.FacetGrid(df, col='imdb_score')
g.map(plt.hist, 'title_year', bins=20)
Out[362]:
In [363]:
# grid = sns.FacetGrid(df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='imdb_score', row='color', size=2.2, aspect=1.6)
grid.map(plt.hist, 'title_year', alpha=.5, bins=20)
grid.add_legend();
In [364]:
# grid = sns.FacetGrid(df, col='Embarked')
# grid = sns.FacetGrid(df, row='Embarked', size=2.2, aspect=1.6)
# grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
# grid.add_legend()
In [365]:
# grid = sns.FacetGrid(df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
# grid = sns.FacetGrid(df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
# grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
# grid.add_legend()
We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.
This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.
Based on our assumptions and decisions we want to drop the actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link features.
Note that where applicable we perform operations on both training and testing datasets together to stay consistent.
In [366]:
print("Before", df.shape)
df = df.drop(['genres', 'movie_title', 'plot_keywords', 'movie_imdb_link'], axis=1)
"After", df.shape
Out[366]:
We want to analyze if the Director and Actor name features can be engineered to extract number of films starred or directed and test correlation between number of films and score, before dropping Director and Actor name features.
In the following code we extract the num_of_films_director
and num_of_films_actor
features by iterating over the dataframe.
If the director name field is empty, we fill the field with 1.
Observations.
When we plot the number of films directed, and number of films acted in, we note the following observations.
Decision.
In [367]:
actors = {}
directors = {}
for index, row in df.iterrows():
for actor in row[['actor_1_name', 'actor_2_name', 'actor_3_name']]:
if actor is not np.nan:
if actor not in actors:
actors[actor] = 0
actors[actor] += 1
director = row['director_name']
if director is not np.nan:
if director not in directors:
directors[director] = 0
directors[director] += 1
In [368]:
df['num_of_films_director'] = df["director_name"].dropna().map(directors).astype(int)
df['num_of_films_director'] = df['num_of_films_director'].fillna(1)
In [369]:
df['NumFilmsBand'] = pd.cut(df['num_of_films_director'], 4)
df[['NumFilmsBand', 'imdb_score']].groupby(['NumFilmsBand'], as_index=False).mean().sort_values(by='NumFilmsBand', ascending=True)
Out[369]:
In [370]:
df.loc[ df['num_of_films_director'] <= 7, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 7) & (df['num_of_films_director'] <= 13), 'num_of_films_director'] = 1
df.loc[(df['num_of_films_director'] > 13) & (df['num_of_films_director'] <= 19), 'num_of_films_director'] = 2
df.loc[ df['num_of_films_director'] > 19, 'num_of_films_director'] = 3
df.head()
Out[370]:
We can now remove the director_name and num_of_films_director_band features.
In [371]:
df = df.drop(['director_name', 'NumFilmsBand'], axis=1)
Now, let's examine actors by number of films acted in. Since we have three actors listed per film, we'll need to combine these numbers. Let's fill any empty fields with the median value for that field, and sum the columns.
In [372]:
df["actor_1_name"].dropna().map(actors).describe()
Out[372]:
In [373]:
df['num_of_films_actor_1'] = df["actor_1_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_1'] = df['num_of_films_actor_1'].fillna(8)
In [374]:
df["actor_2_name"].dropna().map(actors).describe()
Out[374]:
In [375]:
df['num_of_films_actor_2'] = df["actor_2_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_2'] = df['num_of_films_actor_2'].fillna(4)
In [376]:
df["actor_3_name"].dropna().map(actors).describe()
Out[376]:
In [377]:
df['num_of_films_actor_3'] = df["actor_3_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_3'] = df['num_of_films_actor_3'].fillna(2)
In [378]:
df['actor_sum'] = df["num_of_films_actor_1"] + df["num_of_films_actor_2"] + df["num_of_films_actor_3"]
In [379]:
df['ActorSumBand'] = pd.cut(df['actor_sum'], 5)
df[['ActorSumBand', 'imdb_score']].groupby(['ActorSumBand'], as_index=False).mean().sort_values(by='ActorSumBand', ascending=True)
Out[379]:
In [380]:
df.loc[ df['actor_sum'] <= 24, 'actor_sum'] = 0
df.loc[(df['actor_sum'] > 24) & (df['actor_sum'] <= 46), 'actor_sum'] = 1
df.loc[(df['actor_sum'] > 46) & (df['actor_sum'] <= 67), 'actor_sum'] = 2
df.loc[(df['actor_sum'] > 67) & (df['actor_sum'] <= 89), 'actor_sum'] = 3
df.loc[ df['actor_sum'] > 89, 'actor_sum'] = 4
In [381]:
df.head()
Out[381]:
Now we can remove actor_1_name
, actor_2_name
, actor_3_name
and ActorSumBand
In [382]:
df = df.drop(['actor_1_name', 'num_of_films_actor_1', 'actor_2_name', 'num_of_films_actor_2', 'actor_3_name', 'num_of_films_actor_3', 'ActorSumBand'], axis=1)
In [383]:
df.head()
Out[383]:
Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.
Let us start by converting Color feature to a new feature called Color where black and white=1 and color=0.
Since some values are null, let's fill them with the most common value, Color
In [384]:
df['color'] = df['color'].fillna("Color")
In [385]:
df['color'] = df['color'].map( {' Black and White': 1, 'Color': 0} ).astype(int)
df.head()
Out[385]:
Next, let's look at the language and country features
In [386]:
df['language'].value_counts()
Out[386]:
The bulk of the films are in English. Let's convert this field to 1 for Non-English, and 0 for English
First, let's fill any null values with English
In [387]:
df['language'] = df['language'].fillna("English")
In [388]:
df['language'] = df['language'].map(lambda l: 0 if l == 'English' else 1)
In [389]:
df.head()
Out[389]:
Next, let's explore country
In [390]:
df['country'].value_counts()
Out[390]:
Again, most films are from USA. Taking the same approach, we'll fill NaNs with USA, and transfrom USA to 0, all others to 1
In [391]:
df['country'] = df['country'].fillna("USA")
In [392]:
df['country'] = df['country'].map(lambda c: 0 if c == 'USA' else 1)
In [393]:
df.head()
Out[393]:
Next up is content rating. Let's look at that
In [394]:
df['content_rating'].value_counts()
Out[394]:
The majority of the films use the standard MPAA ratings: G, PG, PG-13, and R
Let's group the rest of the films (and null values) into the 'Not Rated' category, and then transform them to integers
In [395]:
df['content_rating'] = df['content_rating'].map({'G':0, 'PG':1, 'PG-13': 2, 'R': 3}).fillna(4).astype(int)
In [396]:
df.head()
Out[396]:
Aspect ratio may seem like a numerical feature, but it's somewhat of a categorial one. First, what values do we find in the dataset?
In [397]:
df['aspect_ratio'].value_counts()
Out[397]:
Some of these values seem to be in the wrong format, 16.00
is most likely 16:9 (1.78) and 4.00
is more likely 4:3 (1.33). Let's fix those.
In [398]:
df['aspect_ratio'] = df['aspect_ratio'].fillna(2.35)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.33 if ar == 4.00 else ar)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.78 if ar == 16.00 else ar)
In [399]:
df[['aspect_ratio', 'imdb_score']].groupby(pd.cut(df['aspect_ratio'], 4)).mean()
Out[399]:
The above banding looks good. It sepearates out the two predominant aspect ratios (2.35 and 1.85), and also has two bands below and above these ratios. Let's use that.
In [400]:
df.loc[ df['aspect_ratio'] <= 1.575, 'aspect_ratio'] = 0
df.loc[(df['aspect_ratio'] > 1.575) & (df['aspect_ratio'] <= 1.97), 'aspect_ratio'] = 1
df.loc[(df['aspect_ratio'] > 1.97) & (df['aspect_ratio'] <= 2.365), 'aspect_ratio'] = 2
df.loc[ df['aspect_ratio'] > 2.365, 'aspect_ratio'] = 3
In [401]:
df.head()
Out[401]:
Now we should start estimating and completing features with missing or null values. We will first do this for the Duration feature.
We can consider three methods to complete a numerical continuous feature.
The easist way is to use the median value.
Another simple way is to generate random numbers between mean and standard deviation.
More accurate way of guessing missing values is to use other correlated features, and use the median value based on other features.
Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.
We will use method 2.
In [402]:
mean = df['duration'].mean()
std = df['duration'].std()
mean, std
Out[402]:
In [403]:
df['duration'] = df['duration'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)
Let us create Duration bands and determine correlations with IMDb score.
In [404]:
df[['duration', 'imdb_score']].groupby(pd.qcut(df['duration'], 5)).mean()
Out[404]:
Let us replace Duration with ordinals based on these bands.
In [405]:
df.loc[ df['duration'] <= 91, 'duration'] = 0
df.loc[(df['duration'] > 91) & (df['duration'] <= 99), 'duration'] = 1
df.loc[(df['duration'] > 99) & (df['duration'] <= 108), 'duration'] = 2
df.loc[(df['duration'] > 108) & (df['duration'] <= 122), 'duration'] = 3
df.loc[ df['duration'] > 122, 'duration'] = 4
df.head()
Out[405]:
Let's apply the same techniques to the following features:
num_critic_for_reviews
In [406]:
mean = df['num_critic_for_reviews'].mean()
std = df['num_critic_for_reviews'].std()
mean, std
Out[406]:
In [407]:
df['num_critic_for_reviews'] = df['num_critic_for_reviews'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)
In [408]:
df[['num_critic_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_critic_for_reviews'], 5)).mean()
Out[408]:
In [409]:
df.loc[ df['num_critic_for_reviews'] <= 40, 'num_critic_for_reviews'] = 0
df.loc[(df['num_critic_for_reviews'] > 40) & (df['num_critic_for_reviews'] <= 84), 'num_critic_for_reviews'] = 1
df.loc[(df['num_critic_for_reviews'] > 84) & (df['num_critic_for_reviews'] <= 140), 'num_critic_for_reviews'] = 2
df.loc[(df['num_critic_for_reviews'] > 140) & (df['num_critic_for_reviews'] <= 222), 'num_critic_for_reviews'] = 3
df.loc[ df['num_critic_for_reviews'] > 222, 'num_critic_for_reviews'] = 4
df.head()
Out[409]:
director_facebook_likes
In [410]:
mean = df['director_facebook_likes'].mean()
std = df['director_facebook_likes'].std()
mean, std
Out[410]:
Since the standard deviation for this field is ~4x the mean, we'll just stick to using the mean value for nulls
In [411]:
df['director_facebook_likes'] = df['director_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [412]:
df[['director_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['director_facebook_likes'], 5)).mean()
Out[412]:
In [413]:
df.loc[ df['director_facebook_likes'] <= 3, 'director_facebook_likes'] = 0
df.loc[(df['director_facebook_likes'] > 3) & (df['director_facebook_likes'] <= 27.8), 'director_facebook_likes'] = 1
df.loc[(df['director_facebook_likes'] > 27.8) & (df['director_facebook_likes'] <= 91), 'director_facebook_likes'] = 2
df.loc[(df['director_facebook_likes'] > 91) & (df['director_facebook_likes'] <= 309), 'director_facebook_likes'] = 3
df.loc[ df['director_facebook_likes'] > 309, 'director_facebook_likes'] = 4
df.head()
Out[413]:
actor_1_facebook_likes
In [414]:
mean = df['actor_1_facebook_likes'].mean()
std = df['actor_1_facebook_likes'].std()
mean, std
Out[414]:
In [415]:
df['actor_1_facebook_likes'] = df['actor_1_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [416]:
df['actor_1_facebook_likes'].describe()
Out[416]:
In [417]:
df[['actor_1_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_1_facebook_likes'], 5)).mean()
Out[417]:
In [418]:
df.loc[ df['actor_1_facebook_likes'] <= 523, 'actor_1_facebook_likes'] = 0
df.loc[(df['actor_1_facebook_likes'] > 523) & (df['actor_1_facebook_likes'] <= 865), 'actor_1_facebook_likes'] = 1
df.loc[(df['actor_1_facebook_likes'] > 865) & (df['actor_1_facebook_likes'] <= 2000), 'actor_1_facebook_likes'] = 2
df.loc[(df['actor_1_facebook_likes'] > 2000) & (df['actor_1_facebook_likes'] <= 13000), 'actor_1_facebook_likes'] = 3
df.loc[ df['actor_1_facebook_likes'] > 13000, 'actor_1_facebook_likes'] = 4
df.head()
Out[418]:
actor_2_facebook_likes
In [419]:
mean = df['actor_2_facebook_likes'].mean()
std = df['actor_2_facebook_likes'].std()
mean, std
Out[419]:
In [420]:
df['actor_2_facebook_likes'] = df['actor_2_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [421]:
df[['actor_2_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_2_facebook_likes'], 5)).mean()
Out[421]:
In [422]:
df.loc[ df['actor_2_facebook_likes'] <= 218, 'actor_2_facebook_likes'] = 0
df.loc[(df['actor_2_facebook_likes'] > 218) & (df['actor_2_facebook_likes'] <= 486), 'actor_2_facebook_likes'] = 1
df.loc[(df['actor_2_facebook_likes'] > 486) & (df['actor_2_facebook_likes'] <= 726.2), 'actor_2_facebook_likes'] = 2
df.loc[(df['actor_2_facebook_likes'] > 726.2) & (df['actor_2_facebook_likes'] <= 979), 'actor_2_facebook_likes'] = 3
df.loc[ df['actor_2_facebook_likes'] > 979, 'actor_2_facebook_likes'] = 4
df.head()
Out[422]:
actor_3_facebook_likes
In [423]:
mean = df['actor_3_facebook_likes'].mean()
std = df['actor_3_facebook_likes'].std()
mean, std
Out[423]:
In [424]:
df['actor_3_facebook_likes'] = df['actor_3_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [425]:
df['actor_3_facebook_likes'].describe()
Out[425]:
In [426]:
df[['actor_3_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_3_facebook_likes'], 5)).mean()
Out[426]:
In [427]:
df.loc[ df['actor_3_facebook_likes'] <= 97, 'actor_3_facebook_likes'] = 0
df.loc[(df['actor_3_facebook_likes'] > 97) & (df['actor_3_facebook_likes'] <= 265), 'actor_3_facebook_likes'] = 1
df.loc[(df['actor_3_facebook_likes'] > 265) & (df['actor_3_facebook_likes'] <= 472), 'actor_3_facebook_likes'] = 2
df.loc[(df['actor_3_facebook_likes'] > 472) & (df['actor_3_facebook_likes'] <= 700), 'actor_3_facebook_likes'] = 3
df.loc[ df['actor_3_facebook_likes'] > 700, 'actor_3_facebook_likes'] = 4
df.head()
Out[427]:
gross
In [428]:
mean = df['gross'].mean()
std = df['gross'].std()
mean, std
Out[428]:
In [429]:
df['gross'] = df['gross'].map(lambda v: mean if pd.isnull(v) else v)
In [430]:
df['gross'].describe()
Out[430]:
In [431]:
df[['gross', 'imdb_score']].groupby(pd.qcut(df['gross'], 5)).mean()
Out[431]:
In [432]:
df.loc[ df['gross'] <= 4909758.4, 'gross'] = 0
df.loc[(df['gross'] > 4909758.4) & (df['gross'] <= 24092475.2), 'gross'] = 1
df.loc[(df['gross'] > 24092475.2) & (df['gross'] <= 48468407.527), 'gross'] = 2
df.loc[(df['gross'] > 48468407.527) & (df['gross'] <= 64212162.4), 'gross'] = 3
df.loc[ df['gross'] > 64212162.4, 'gross'] = 4
df.head()
Out[432]:
facenumber_in_poster
In [433]:
mean = df['facenumber_in_poster'].mean()
std = df['facenumber_in_poster'].std()
mean, std
Out[433]:
In [434]:
df['facenumber_in_poster'].value_counts()
Out[434]:
In [435]:
df['facenumber_in_poster'].median()
Out[435]:
In [436]:
df['facenumber_in_poster'] = df['facenumber_in_poster'].map(lambda v: 1 if pd.isnull(v) else v)
In [437]:
df['facenumber_in_poster'].describe()
Out[437]:
In [438]:
df[['facenumber_in_poster', 'imdb_score']].groupby(pd.cut(df['facenumber_in_poster'], [-1,0,1,2,100])).mean()
Out[438]:
In [439]:
df.loc[ df['facenumber_in_poster'] <= 0, 'facenumber_in_poster'] = 0
df.loc[(df['facenumber_in_poster'] > 0) & (df['facenumber_in_poster'] <= 1), 'facenumber_in_poster'] = 1
df.loc[(df['facenumber_in_poster'] > 1) & (df['facenumber_in_poster'] <= 2), 'facenumber_in_poster'] = 2
df.loc[ df['facenumber_in_poster'] > 2, 'facenumber_in_poster'] = 3
df.head()
Out[439]:
num_user_for_reviews
In [440]:
mean = df['num_user_for_reviews'].mean()
std = df['num_user_for_reviews'].std()
mean, std
Out[440]:
In [441]:
df['num_user_for_reviews'] = df['num_user_for_reviews'].map(lambda v: mean if pd.isnull(v) else v)
In [442]:
df['num_user_for_reviews'].describe()
Out[442]:
In [443]:
df[['num_user_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_user_for_reviews'], 5)).mean()
Out[443]:
In [444]:
df.loc[ df['num_user_for_reviews'] <= 48, 'num_user_for_reviews'] = 0
df.loc[(df['num_user_for_reviews'] > 48) & (df['num_user_for_reviews'] <= 116), 'num_user_for_reviews'] = 1
df.loc[(df['num_user_for_reviews'] > 116) & (df['num_user_for_reviews'] <= 210), 'num_user_for_reviews'] = 2
df.loc[(df['num_user_for_reviews'] > 210) & (df['num_user_for_reviews'] <= 389), 'num_user_for_reviews'] = 3
df.loc[ df['num_user_for_reviews'] > 389, 'num_user_for_reviews'] = 4
df.head()
Out[444]:
budget
In [445]:
mean = df['budget'].mean()
std = df['budget'].std()
mean, std
Out[445]:
In [446]:
df['budget'] = df['budget'].map(lambda v: mean if pd.isnull(v) else v)
In [447]:
df['budget'].describe()
Out[447]:
In [448]:
df[['budget', 'imdb_score']].groupby(pd.qcut(df['budget'], 3)).mean()
Out[448]:
In [449]:
df.loc[ df['budget'] <= 12000000, 'budget'] = 0
df.loc[(df['budget'] > 12000000) & (df['budget'] <= 39752620.436), 'budget'] = 1
df.loc[ df['budget'] > 39752620.436, 'budget'] = 2
df.head()
Out[449]:
title_year
In [450]:
mean = df['title_year'].mean()
std = df['title_year'].std()
mean, std
Out[450]:
In [451]:
df['title_year'] = df['title_year'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)
In [452]:
df[['title_year', 'imdb_score']].groupby(pd.cut(df['title_year'], 5)).mean()
Out[452]:
In [453]:
df.loc[ df['title_year'] <= 1936, 'title_year'] = 0
df.loc[(df['title_year'] > 1936) & (df['title_year'] <= 1956), 'title_year'] = 1
df.loc[(df['title_year'] > 1956) & (df['title_year'] <= 1976), 'title_year'] = 2
df.loc[(df['title_year'] > 1976) & (df['title_year'] <= 1996), 'title_year'] = 3
df.loc[ df['title_year'] > 1996, 'title_year'] = 4
df.head()
Out[453]:
num_voted_users
In [454]:
mean = df['num_voted_users'].mean()
std = df['num_voted_users'].std()
mean, std
Out[454]:
In [455]:
df['num_voted_users'] = df['num_voted_users'].map(lambda v: mean if pd.isnull(v) else v)
In [456]:
df['num_voted_users'].describe()
Out[456]:
In [457]:
df[['num_voted_users', 'imdb_score']].groupby(pd.qcut(df['num_voted_users'], 5)).mean()
Out[457]:
In [458]:
df.loc[ df['num_voted_users'] <= 5623.8, 'num_voted_users'] = 0
df.loc[(df['num_voted_users'] > 5623.8) & (df['num_voted_users'] <= 21478.4), 'num_voted_users'] = 1
df.loc[(df['num_voted_users'] > 21478.4) & (df['num_voted_users'] <= 53178.2), 'num_voted_users'] = 2
df.loc[(df['num_voted_users'] > 53178.2) & (df['num_voted_users'] <= 1.24e+05), 'num_voted_users'] = 3
df.loc[ df['num_voted_users'] > 1.24e+05, 'num_voted_users'] = 4
df.head()
Out[458]:
cast_total_facebook_likes
In [459]:
mean = df['cast_total_facebook_likes'].mean()
std = df['cast_total_facebook_likes'].std()
mean, std
Out[459]:
In [460]:
df['cast_total_facebook_likes'] = df['cast_total_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [461]:
df['cast_total_facebook_likes'].describe()
Out[461]:
In [462]:
df[['cast_total_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['cast_total_facebook_likes'], 5)).mean()
Out[462]:
In [463]:
df.loc[ df['cast_total_facebook_likes'] <= 1136, 'cast_total_facebook_likes'] = 0
df.loc[(df['cast_total_facebook_likes'] > 1136) & (df['cast_total_facebook_likes'] <= 2366.6), 'cast_total_facebook_likes'] = 1
df.loc[(df['cast_total_facebook_likes'] > 2366.6) & (df['cast_total_facebook_likes'] <= 4369.2), 'cast_total_facebook_likes'] = 2
df.loc[(df['cast_total_facebook_likes'] > 4369.2) & (df['cast_total_facebook_likes'] <= 16285.8), 'cast_total_facebook_likes'] = 3
df.loc[ df['cast_total_facebook_likes'] > 16285.8, 'cast_total_facebook_likes'] = 4
df.head()
Out[463]:
movie_facebook_likes
In [464]:
mean = df['movie_facebook_likes'].mean()
std = df['movie_facebook_likes'].std()
mean, std
Out[464]:
In [465]:
df['movie_facebook_likes'] = df['movie_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)
In [466]:
df['movie_facebook_likes'].describe()
Out[466]:
In [467]:
df[df['movie_facebook_likes'] > 0][['movie_facebook_likes', 'imdb_score']].groupby(pd.qcut(df[df['movie_facebook_likes'] > 0]['movie_facebook_likes'], 4)).mean()
Out[467]:
In [468]:
df.loc[ df['movie_facebook_likes'] <= 0, 'movie_facebook_likes'] = 0
df.loc[(df['movie_facebook_likes'] > 0) & (df['movie_facebook_likes'] <= 401), 'movie_facebook_likes'] = 1
df.loc[(df['movie_facebook_likes'] > 401) & (df['movie_facebook_likes'] <= 1000), 'movie_facebook_likes'] = 2
df.loc[(df['movie_facebook_likes'] > 1000) & (df['movie_facebook_likes'] <= 17000), 'movie_facebook_likes'] = 3
df.loc[ df['movie_facebook_likes'] > 17000, 'movie_facebook_likes'] = 3
df.head()
Out[468]:
num_of_films_director
In [469]:
mean = df['num_of_films_director'].mean()
std = df['num_of_films_director'].std()
mean, std
Out[469]:
In [470]:
df['num_of_films_director'].value_counts()
Out[470]:
In [471]:
df['num_of_films_director'] = df['num_of_films_director'].map(lambda v: 1 if pd.isnull(v) else v)
In [472]:
df[['num_of_films_director', 'imdb_score']].groupby(pd.cut(df['num_of_films_director'], 3)).mean()
Out[472]:
In [473]:
df.loc[ df['num_of_films_director'] <= 1, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 1) & (df['num_of_films_director'] <= 2), 'num_of_films_director'] = 1
df.loc[ df['num_of_films_director'] > 2, 'num_of_films_director'] = 2
df.head()
Out[473]:
In [474]:
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()
In [476]:
train_df, test_df = train_test_split(df, test_size = 0.2)
Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:
In [477]:
X_train = train_df.drop("imdb_score", axis=1)
Y_train = train_df["imdb_score"]
X_test = test_df.drop("imdb_score", axis=1)
Y_test = test_df["imdb_score"]
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape
Out[477]:
Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.
Note the confidence score generated by the model based on our training dataset.
In [478]:
# Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log
Out[478]:
We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.
Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).
In [479]:
for i, value in enumerate(train_df.columns):
print i, value
In [480]:
coeff_df = pd.DataFrame(train_df.columns.delete(17))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)
Out[480]:
Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference Wikipedia.
Note that the model generates a confidence score which is higher than Logistics Regression model.
In [481]:
# Support Vector Machines
svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
Out[481]:
In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference Wikipedia.
KNN confidence score is better than Logistics Regression and SVM.
In [482]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn
Out[482]:
In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference Wikipedia.
The model generated confidence score is the lowest among the models evaluated so far.
In [483]:
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian
Out[483]:
The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference Wikipedia.
In [484]:
# Perceptron
perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron
Out[484]:
In [485]:
# Linear SVC
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc
Out[485]:
In [486]:
# Stochastic Gradient Descent
sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd
Out[486]:
This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.
The model confidence score is the highest among models evaluated so far.
In [487]:
# Decision Tree
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree
Out[487]:
The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.
The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.
In [488]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest
Out[488]:
In [489]:
models = pd.DataFrame({
'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Perceptron',
'Stochastic Gradient Decent', 'Linear SVC',
'Decision Tree'],
'Score': [acc_svc, acc_knn, acc_log,
acc_random_forest, acc_gaussian, acc_perceptron,
acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)
Out[489]:
In [491]:
print accuracy_score(Y_test, Y_pred, normalize=False), '/', len(Y_test)
print accuracy_score(Y_test, Y_pred)
Using the Random Forrest Classifier resulted in correctly predicting 793 out of 1009 movies, or 78.6%. Not bad for our first attempt. Any suggestions to improve our score are most welcome.