IMDB Predictive Analytics

This notebook explores using data science techniques on a data set of 5000+ movies, and predicting whether a movie will be highly rated on IMDb.

The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.

This notebook is adapted from "Titanic Data Science Solutions" by Manav Sehgal

Workflow stages

This workflow goes through seven stages.

  1. Question or problem definition.
  2. Acquire training and testing data.
  3. Wrangle, prepare, cleanse the data.
  4. Analyze, identify patterns, and explore the data.
  5. Model, predict and solve the problem.
  6. Visualize, report, and present the problem solving steps and final solution.
  7. Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.

  • We may combine mulitple workflow stages. We may analyze by visualizing data.
  • Perform a stage earlier than indicated. We may analyze data before and after wrangling.
  • Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.

Question and problem definition

The original data set used in this notebook can be found here at Kaggle.

Knowing from a training set of samples listing movies and their IMDb scores, can our model determine based on a given test dataset not containing the scores, if the movies in the test dataset scored highly or not?

Workflow goals

The data science solutions workflow solves for seven major goals.

Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals. A good start is to read the Tableau paper on Which chart or graph is right for you?.


In [348]:
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
from scipy.stats import truncnorm

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the datasets into a Pandas DataFrame.

We will partition off 80% as our training data and 20% of the data as our test data. We also combine these datasets to run certain operations on both datasets together.

Let's move the partioning to after the data wrangling. Makes the code simpler, and doesn't make a real difference. Also removes any differences in the banding portions between runs.


In [349]:
df = pd.read_csv('../input/movie_metadata.csv')
# train_df, test_df = train_test_split(df, test_size = 0.2)
# test_actual = test_df['imdb_score']
# test_df = test_df.drop('imdb_score', axis=1)
# combine = [train_df, test_df]

Analyze by describing data

Pandas also helps describe the datasets answering following questions early in our project.

Which features are available in the dataset?

Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle page here.


In [350]:
print(df.columns.values)


['color' 'director_name' 'num_critic_for_reviews' 'duration'
 'director_facebook_likes' 'actor_3_facebook_likes' 'actor_2_name'
 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title'
 'num_voted_users' 'cast_total_facebook_likes' 'actor_3_name'
 'facenumber_in_poster' 'plot_keywords' 'movie_imdb_link'
 'num_user_for_reviews' 'language' 'country' 'content_rating' 'budget'
 'title_year' 'actor_2_facebook_likes' 'imdb_score' 'aspect_ratio'
 'movie_facebook_likes']

Which features are categorical?

  • Color, Director name, Actor 1 name, Actor 2 name, Actor 3 name, Genres, Language, Country, Content Rating, Movie title, Plot keywords, Movie IMDb link

Which features are numerical?

  • Number of critics for reviews, Duration, Director Facebook likes, Actor 1 Facebook likes, Actor 2 Facebook likes, Actor 3 Facebook likes, Gross, Number of voted users, Cast total Facebook likes, Number of faces in poster, Number of users for reviews, Budget, Title year, IMDb score, Aspect ratio, Movie Facebook likes

In [351]:
pd.set_option('display.max_columns', 50)

In [352]:
# preview the data
df.head()


Out[352]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster plot_keywords movie_imdb_link num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 4834 Wes Studi 0.0 avatar|future|marine|native|paraplegic http://www.imdb.com/title/tt0499549/?ref_=fn_t... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 48350 Jack Davenport 0.0 goddess|marriage ceremony|marriage proposal|pi... http://www.imdb.com/title/tt0449088/?ref_=fn_t... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 11700 Stephanie Sigman 1.0 bomb|espionage|sequel|spy|terrorist http://www.imdb.com/title/tt2379713/?ref_=fn_t... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 106759 Joseph Gordon-Levitt 0.0 deception|imprisonment|lawlessness|police offi... http://www.imdb.com/title/tt1345836/?ref_=fn_t... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary Doug Walker Star Wars: Episode VII - The Force Awakens  ... 8 143 NaN 0.0 NaN http://www.imdb.com/title/tt5289954/?ref_=fn_t... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

In [353]:
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 21 columns):
color                      5024 non-null object
director_name              4939 non-null object
num_critic_for_reviews     4993 non-null float64
duration                   5028 non-null float64
director_facebook_likes    4939 non-null float64
actor_3_facebook_likes     5020 non-null float64
actor_2_name               5030 non-null object
actor_1_facebook_likes     5036 non-null float64
gross                      4159 non-null float64
actor_1_name               5036 non-null object
actor_3_name               5020 non-null object
facenumber_in_poster       5030 non-null float64
plot_keywords              4890 non-null object
num_user_for_reviews       5022 non-null float64
language                   5031 non-null object
country                    5038 non-null object
content_rating             4740 non-null object
budget                     4551 non-null float64
title_year                 4935 non-null float64
actor_2_facebook_likes     5030 non-null float64
aspect_ratio               4714 non-null float64
dtypes: float64(12), object(9)
memory usage: 827.4+ KB

Which features contain blank, null or empty values?

These will require correcting.

  • color
  • director_name
  • num_critic_for_reviews
  • duration
  • director_facebook_likes
  • actor_3_facebook_likes
  • actor_2_name
  • actor_1_facebook_likes
  • gross
  • actor_1_name
  • actor_3_name
  • facenumber_in_poster
  • plot_keywords
  • num_user_for_reviews
  • language
  • country
  • content_rating
  • budget
  • title_year
  • actor_2_facebook_likes
  • aspect_ratio

What are the data types for various features?

Helping us during converting goal.

  • Twelve features are floats.
  • Nine features are strings (object).

In [354]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB

In [355]:
df.describe()


Out[355]:
num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
count 4993.000000 5028.000000 4939.000000 5020.000000 5036.000000 4.159000e+03 5.043000e+03 5043.000000 5030.000000 5022.000000 4.551000e+03 4935.000000 5030.000000 5043.000000 4714.000000 5043.000000
mean 140.194272 107.201074 686.509212 645.009761 6560.047061 4.846841e+07 8.366816e+04 9699.063851 1.371173 272.770808 3.975262e+07 2002.470517 1651.754473 6.442138 2.220403 7525.964505
std 121.601675 25.197441 2813.328607 1665.041728 15020.759120 6.845299e+07 1.384853e+05 18163.799124 2.013576 377.982886 2.061149e+08 12.474599 4042.438863 1.125116 1.385113 19320.445110
min 1.000000 7.000000 0.000000 0.000000 0.000000 1.620000e+02 5.000000e+00 0.000000 0.000000 1.000000 2.180000e+02 1916.000000 0.000000 1.600000 1.180000 0.000000
25% 50.000000 93.000000 7.000000 133.000000 614.000000 5.340988e+06 8.593500e+03 1411.000000 0.000000 65.000000 6.000000e+06 1999.000000 281.000000 5.800000 1.850000 0.000000
50% 110.000000 103.000000 49.000000 371.500000 988.000000 2.551750e+07 3.435900e+04 3090.000000 1.000000 156.000000 2.000000e+07 2005.000000 595.000000 6.600000 2.350000 166.000000
75% 195.000000 118.000000 194.500000 636.000000 11000.000000 6.230944e+07 9.630900e+04 13756.500000 2.000000 326.000000 4.500000e+07 2011.000000 918.000000 7.200000 2.350000 3000.000000
max 813.000000 511.000000 23000.000000 23000.000000 640000.000000 7.605058e+08 1.689764e+06 656730.000000 43.000000 5060.000000 1.221550e+10 2016.000000 137000.000000 9.500000 16.000000 349000.000000

What is the distribution of categorical features?


In [356]:
df.describe(include=['O'])


Out[356]:
color director_name actor_2_name genres actor_1_name movie_title actor_3_name plot_keywords movie_imdb_link language country content_rating
count 5024 4939 5030 5043 5036 5043 5020 4890 5043 5031 5038 4740
unique 2 2398 3032 914 2097 4917 3521 4760 4919 47 65 18
top Color Steven Spielberg Morgan Freeman Drama Robert De Niro Halloween Steve Coogan based on novel http://www.imdb.com/title/tt0077651/?ref_=fn_t... English USA R
freq 4815 26 20 236 49 3 8 4 3 4704 3807 2118

Transformation of IMDb score

Let's simplify this problem into a binary classification/regression. Let us treat all movies with an IMDb score of 7.0 or higher as "good" (with a value of '1') and all below as "bad" (with a value of '0').


In [357]:
df.loc[ df['imdb_score'] < 7.0, 'imdb_score'] = 0
df.loc[ df['imdb_score'] >= 7.0, 'imdb_score'] = 1

df.head()


Out[357]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres actor_1_name movie_title num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster plot_keywords movie_imdb_link num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi CCH Pounder Avatar 886204 4834 Wes Studi 0.0 avatar|future|marine|native|paraplegic http://www.imdb.com/title/tt0499549/?ref_=fn_t... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy Johnny Depp Pirates of the Caribbean: At World's End 471220 48350 Jack Davenport 0.0 goddess|marriage ceremony|marriage proposal|pi... http://www.imdb.com/title/tt0449088/?ref_=fn_t... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller Christoph Waltz Spectre 275868 11700 Stephanie Sigman 1.0 bomb|espionage|sequel|spy|terrorist http://www.imdb.com/title/tt2379713/?ref_=fn_t... 994.0 English UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller Tom Hardy The Dark Knight Rises 1144337 106759 Joseph Gordon-Levitt 0.0 deception|imprisonment|lawlessness|police offi... http://www.imdb.com/title/tt1345836/?ref_=fn_t... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary Doug Walker Star Wars: Episode VII - The Force Awakens  ... 8 143 NaN 0.0 NaN http://www.imdb.com/title/tt5289954/?ref_=fn_t... NaN NaN NaN NaN NaN NaN 12.0 1.0 NaN 0

Assumtions based on data analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

Correlating.

We want to know how well does each feature correlate with IMDb score. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

Completing.

Correcting.

Creating.

Classifying.

Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

  • Pclass We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.
  • Sex We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).
  • SibSp and Parch These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).

In [358]:
df[['content_rating', 'imdb_score']].groupby(['content_rating'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)


Out[358]:
content_rating imdb_score
15 TV-Y7 1.000000
14 TV-Y 1.000000
12 TV-MA 1.000000
0 Approved 0.781818
8 Passed 0.777778
13 TV-PG 0.769231
10 TV-14 0.766667
16 Unrated 0.548387
11 TV-G 0.500000
17 X 0.461538
5 Not Rated 0.456897
1 G 0.455357
4 NC-17 0.428571
3 M 0.400000
9 R 0.367328
2 GP 0.333333
6 PG 0.310984
7 PG-13 0.266940

In [359]:
df[["color", "imdb_score"]].groupby(['color'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)


Out[359]:
color imdb_score
0 Black and White 0.665072
1 Color 0.339356

In [360]:
df[["director_name", "imdb_score"]].groupby(['director_name'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)


Out[360]:
director_name imdb_score
1277 Ken Loach 1.0
353 Christian Carion 1.0
589 Don Kempf 1.0
2020 Sam Martin 1.0
593 Don Siegel 1.0
1620 Mitchell Altieri 1.0
596 Doug Atchison 1.0
597 Doug Block 1.0
1622 Molly Bernstein 1.0
600 Doug Walker 1.0
2012 Sai Varadan 1.0
2011 Sadyk Sher-Niyaz 1.0
607 Drew Goddard 1.0
608 Dror Moreh 1.0
1625 Mor Loushy 1.0
610 Duke Johnson 1.0
611 Duncan Jones 1.0
612 Duncan Tucker 1.0
2009 S.S. Rajamouli 1.0
1628 Morgan Neville 1.0
2003 Ryan Fleck 1.0
588 Don Hall 1.0
1618 Miranda July 1.0
2029 Sanjay Rawal 1.0
566 Dennis Hopper 1.0
1274 Ken Annakin 1.0
1613 Mike van Diem 1.0
558 Debra Granik 1.0
559 Deepa Mehta 1.0
561 Dena Seidel 1.0
... ... ...
917 Jamel Debbouze 0.0
961 Jared Hess 0.0
962 Jason Alexander 0.0
963 Jason Bateman 0.0
996 Jeff Crook 0.0
1010 Jennifer Finnigan 0.0
1009 Jem Cohen 0.0
1007 Jeffrey W. Byrd 0.0
1006 Jeffrey St. Jules 0.0
1005 Jeff Wadlow 0.0
1003 Jeff Schaffer 0.0
1001 Jeff Nathanson 0.0
1000 Jeff Lowell 0.0
999 Jeff Kanew 0.0
998 Jeff Garlin 0.0
997 Jeff Franklin 0.0
995 Jeff Burr 0.0
964 Jason Connery 0.0
993 Jeb Stuart 0.0
986 Jean-Jacques Mantello 0.0
984 Jean-François Richet 0.0
978 Jay Duplass 0.0
976 Jay Alaimo 0.0
973 Jason Zada 0.0
972 Jason Trost 0.0
971 Jason Stone 0.0
969 Jason Naumann 0.0
966 Jason Friedberg 0.0
965 Jason Eisener 0.0
2397 Étienne Faure 0.0

2398 rows × 2 columns


In [361]:
df[["country", "imdb_score"]].groupby(['country'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)


Out[361]:
country imdb_score
0 Afghanistan 1.000000
33 Kenya 1.000000
63 United Arab Emirates 1.000000
18 Finland 1.000000
54 Soviet Union 1.000000
17 Egypt 1.000000
44 Panama 1.000000
43 Pakistan 1.000000
30 Israel 1.000000
9 Cameroon 1.000000
27 Indonesia 1.000000
35 Libya 1.000000
34 Kyrgyzstan 1.000000
13 Colombia 1.000000
37 Netherlands 0.800000
47 Poland 0.800000
1 Argentina 0.750000
28 Iran 0.750000
39 New Zealand 0.733333
56 Sweden 0.666667
64 West Germany 0.666667
14 Czech Republic 0.666667
15 Denmark 0.636364
6 Brazil 0.625000
53 South Korea 0.571429
32 Japan 0.521739
55 Spain 0.515152
61 UK 0.508929
58 Taiwan 0.500000
41 Norway 0.500000
... ... ...
36 Mexico 0.470588
23 Hong Kong 0.470588
12 China 0.433333
19 France 0.428571
3 Australia 0.381818
21 Germany 0.371134
25 Iceland 0.333333
29 Ireland 0.333333
62 USA 0.317310
10 Canada 0.309524
49 Russia 0.272727
5 Belgium 0.250000
52 South Africa 0.250000
59 Thailand 0.200000
40 Nigeria 0.000000
16 Dominican Republic 0.000000
2 Aruba 0.000000
60 Turkey 0.000000
20 Georgia 0.000000
57 Switzerland 0.000000
11 Chile 0.000000
4 Bahamas 0.000000
42 Official site 0.000000
51 Slovenia 0.000000
50 Slovakia 0.000000
46 Philippines 0.000000
45 Peru 0.000000
38 New Line 0.000000
8 Cambodia 0.000000
7 Bulgaria 0.000000

65 rows × 2 columns

Analyze by visualizing data

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (IMDb score).

Observations.

Decisions.


In [362]:
g = sns.FacetGrid(df, col='imdb_score')
g.map(plt.hist, 'title_year', bins=20)


Out[362]:
<seaborn.axisgrid.FacetGrid at 0x1130acd90>

Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

Observations.

Decisions.


In [363]:
# grid = sns.FacetGrid(df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='imdb_score', row='color', size=2.2, aspect=1.6)
grid.map(plt.hist, 'title_year', alpha=.5, bins=20)
grid.add_legend();


Correlating categorical features

Now we can correlate categorical features with our solution goal.

Observations.

Decisions.


In [364]:
# grid = sns.FacetGrid(df, col='Embarked')
# grid = sns.FacetGrid(df, row='Embarked', size=2.2, aspect=1.6)
# grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
# grid.add_legend()

Correlating categorical and numerical features

Observations.

Decisions.


In [365]:
# grid = sns.FacetGrid(df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
# grid = sns.FacetGrid(df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
# grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
# grid.add_legend()

Wrangle data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

Correcting by dropping features

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.


In [366]:
print("Before", df.shape)

df = df.drop(['genres', 'movie_title', 'plot_keywords', 'movie_imdb_link'], axis=1)

"After", df.shape


('Before', (5043, 28))
Out[366]:
('After', (5043, 24))

Creating new feature extracting from existing

We want to analyze if the Director and Actor name features can be engineered to extract number of films starred or directed and test correlation between number of films and score, before dropping Director and Actor name features.

In the following code we extract the num_of_films_director and num_of_films_actor features by iterating over the dataframe.

If the director name field is empty, we fill the field with 1.

Observations.

When we plot the number of films directed, and number of films acted in, we note the following observations.

  • Directors with more films under their belt tend to have a higher success rate. It seems practice does make perfect.
  • Films where the actors have a higher number of combined films have a higher success rate. A more experienced cast, a better movie.

Decision.

  • We decide to band the directors into groups by number of films directed.
  • We decide to band the "total combined films acted in" into groups.

In [367]:
actors = {}
directors = {}

for index, row in df.iterrows():
    for actor in row[['actor_1_name', 'actor_2_name', 'actor_3_name']]:
        if actor is not np.nan:
            if actor not in actors:
                actors[actor] = 0
            actors[actor] += 1
    director = row['director_name']
    if director is not np.nan:
        if director not in directors:
            directors[director] = 0
        directors[director] += 1

In [368]:
df['num_of_films_director'] = df["director_name"].dropna().map(directors).astype(int)
df['num_of_films_director'] = df['num_of_films_director'].fillna(1)

In [369]:
df['NumFilmsBand'] = pd.cut(df['num_of_films_director'], 4)
df[['NumFilmsBand', 'imdb_score']].groupby(['NumFilmsBand'], as_index=False).mean().sort_values(by='NumFilmsBand', ascending=True)


Out[369]:
NumFilmsBand imdb_score
0 (0.975, 7.25] 0.334201
1 (7.25, 13.5] 0.418641
2 (13.5, 19.75] 0.414894
3 (19.75, 26] 0.693182

In [370]:
df.loc[ df['num_of_films_director'] <= 7, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 7) & (df['num_of_films_director'] <= 13), 'num_of_films_director'] = 1
df.loc[(df['num_of_films_director'] > 13) & (df['num_of_films_director'] <= 19), 'num_of_films_director'] = 2
df.loc[ df['num_of_films_director'] > 19, 'num_of_films_director'] = 3
df.head()


Out[370]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director NumFilmsBand
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder 886204 4834 Wes Studi 0.0 3054.0 English USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 (0.975, 7.25]
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp 471220 48350 Jack Davenport 0.0 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 (0.975, 7.25]
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz 275868 11700 Stephanie Sigman 1.0 994.0 English UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 (7.25, 13.5]
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy 1144337 106759 Joseph Gordon-Levitt 0.0 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 (7.25, 13.5]
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Doug Walker 8 143 NaN 0.0 NaN NaN NaN NaN NaN NaN 12.0 1.0 NaN 0 0.0 (0.975, 7.25]

We can now remove the director_name and num_of_films_director_band features.


In [371]:
df = df.drop(['director_name', 'NumFilmsBand'], axis=1)

Now, let's examine actors by number of films acted in. Since we have three actors listed per film, we'll need to combine these numbers. Let's fill any empty fields with the median value for that field, and sum the columns.


In [372]:
df["actor_1_name"].dropna().map(actors).describe()


Out[372]:
count    5036.000000
mean       11.901509
std        11.397618
min         1.000000
25%         3.000000
50%         8.000000
75%        19.000000
max        54.000000
Name: actor_1_name, dtype: float64

In [373]:
df['num_of_films_actor_1'] = df["actor_1_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_1'] = df['num_of_films_actor_1'].fillna(8)

In [374]:
df["actor_2_name"].dropna().map(actors).describe()


Out[374]:
count    5030.000000
mean        6.411730
std         7.443835
min         1.000000
25%         1.000000
50%         4.000000
75%         8.000000
max        54.000000
Name: actor_2_name, dtype: float64

In [375]:
df['num_of_films_actor_2'] = df["actor_2_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_2'] = df['num_of_films_actor_2'].fillna(4)

In [376]:
df["actor_3_name"].dropna().map(actors).describe()


Out[376]:
count    5020.000000
mean        4.116135
std         4.746142
min         1.000000
25%         1.000000
50%         2.000000
75%         5.000000
max        47.000000
Name: actor_3_name, dtype: float64

In [377]:
df['num_of_films_actor_3'] = df["actor_3_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_3'] = df['num_of_films_actor_3'].fillna(2)

In [378]:
df['actor_sum'] = df["num_of_films_actor_1"] + df["num_of_films_actor_2"] + df["num_of_films_actor_3"]

In [379]:
df['ActorSumBand'] = pd.cut(df['actor_sum'], 5)
df[['ActorSumBand', 'imdb_score']].groupby(['ActorSumBand'], as_index=False).mean().sort_values(by='ActorSumBand', ascending=True)


Out[379]:
ActorSumBand imdb_score
0 (2.892, 24.6] 0.342672
1 (24.6, 46.2] 0.342809
2 (46.2, 67.8] 0.408257
3 (67.8, 89.4] 0.457746
4 (89.4, 111] 0.571429

In [380]:
df.loc[ df['actor_sum'] <= 24, 'actor_sum'] = 0
df.loc[(df['actor_sum'] > 24) & (df['actor_sum'] <= 46), 'actor_sum'] = 1
df.loc[(df['actor_sum'] > 46) & (df['actor_sum'] <= 67), 'actor_sum'] = 2
df.loc[(df['actor_sum'] > 67) & (df['actor_sum'] <= 89), 'actor_sum'] = 3
df.loc[ df['actor_sum'] > 89, 'actor_sum'] = 4

In [381]:
df.head()


Out[381]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross actor_1_name num_voted_users cast_total_facebook_likes actor_3_name facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director num_of_films_actor_1 num_of_films_actor_2 num_of_films_actor_3 actor_sum ActorSumBand
0 Color 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 CCH Pounder 886204 4834 Wes Studi 0.0 3054.0 English USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 7.0 7.0 6.0 0.0 (2.892, 24.6]
1 Color 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Johnny Depp 471220 48350 Jack Davenport 0.0 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 41.0 11.0 6.0 2.0 (46.2, 67.8]
2 Color 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Christoph Waltz 275868 11700 Stephanie Sigman 1.0 994.0 English UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 10.0 5.0 1.0 0.0 (2.892, 24.6]
3 Color 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Tom Hardy 1144337 106759 Joseph Gordon-Levitt 0.0 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 13.0 28.0 19.0 2.0 (46.2, 67.8]
4 NaN NaN NaN 131.0 NaN Rob Walker 131.0 NaN Doug Walker 8 143 NaN 0.0 NaN NaN NaN NaN NaN NaN 12.0 1.0 NaN 0 0.0 1.0 1.0 2.0 0.0 (2.892, 24.6]

Now we can remove actor_1_name, actor_2_name, actor_3_name and ActorSumBand


In [382]:
df = df.drop(['actor_1_name', 'num_of_films_actor_1', 'actor_2_name', 'num_of_films_actor_2', 'actor_3_name', 'num_of_films_actor_3', 'ActorSumBand'], axis=1)

In [383]:
df.head()


Out[383]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 Color 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 English USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 0.0
1 Color 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 2.0
2 Color 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 English UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 0.0
3 Color 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 2.0
4 NaN NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN NaN NaN NaN NaN NaN 12.0 1.0 NaN 0 0.0 0.0

Converting a categorical feature

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Color feature to a new feature called Color where black and white=1 and color=0.

Since some values are null, let's fill them with the most common value, Color


In [384]:
df['color'] = df['color'].fillna("Color")

In [385]:
df['color'] = df['color'].map( {' Black and White': 1, 'Color': 0} ).astype(int)

df.head()


Out[385]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 English USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 0.0
1 0 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 2.0
2 0 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 English UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 0.0
3 0 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 2.0
4 0 NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN NaN NaN NaN NaN NaN 12.0 1.0 NaN 0 0.0 0.0

Next, let's look at the language and country features


In [386]:
df['language'].value_counts()


Out[386]:
English       4704
French          73
Spanish         40
Hindi           28
Mandarin        26
German          19
Japanese        18
Cantonese       11
Russian         11
Italian         11
Korean           8
Portuguese       8
Arabic           5
Swedish          5
Hebrew           5
Danish           5
Persian          4
Dutch            4
Polish           4
Norwegian        4
Thai             3
Chinese          3
None             2
Zulu             2
Aboriginal       2
Icelandic        2
Romanian         2
Dari             2
Indonesian       2
Urdu             1
Czech            1
Filipino         1
Tamil            1
Slovenian        1
Swahili          1
Aramaic          1
Greek            1
Hungarian        1
Bosnian          1
Kazakh           1
Panjabi          1
Kannada          1
Mongolian        1
Dzongkha         1
Vietnamese       1
Maya             1
Telugu           1
Name: language, dtype: int64

The bulk of the films are in English. Let's convert this field to 1 for Non-English, and 0 for English

First, let's fill any null values with English


In [387]:
df['language'] = df['language'].fillna("English")

In [388]:
df['language'] = df['language'].map(lambda l: 0 if l == 'English' else 1)

In [389]:
df.head()


Out[389]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 USA PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 0.0
1 0 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 USA PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 2.0
2 0 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 UK PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 0.0
3 0 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 USA PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 2.0
4 0 NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 NaN NaN NaN NaN 12.0 1.0 NaN 0 0.0 0.0

Next, let's explore country


In [390]:
df['country'].value_counts()


Out[390]:
USA                     3807
UK                       448
France                   154
Canada                   126
Germany                   97
Australia                 55
India                     34
Spain                     33
China                     30
Italy                     23
Japan                     23
Mexico                    17
Hong Kong                 17
New Zealand               15
South Korea               14
Ireland                   12
Russia                    11
Denmark                   11
South Africa               8
Brazil                     8
Norway                     8
Sweden                     6
Thailand                   5
Poland                     5
Netherlands                5
Belgium                    4
Iran                       4
Israel                     4
Argentina                  4
Romania                    4
                        ... 
Greece                     2
Taiwan                     2
Bulgaria                   1
Cambodia                   1
Official site              1
Aruba                      1
Cameroon                   1
Slovenia                   1
Colombia                   1
United Arab Emirates       1
Georgia                    1
Pakistan                   1
Chile                      1
Soviet Union               1
Kenya                      1
Kyrgyzstan                 1
Turkey                     1
Afghanistan                1
Nigeria                    1
Indonesia                  1
Slovakia                   1
Peru                       1
Bahamas                    1
Philippines                1
Dominican Republic         1
Libya                      1
Finland                    1
Panama                     1
Egypt                      1
New Line                   1
Name: country, dtype: int64

Again, most films are from USA. Taking the same approach, we'll fill NaNs with USA, and transfrom USA to 0, all others to 1


In [391]:
df['country'] = df['country'].fillna("USA")

In [392]:
df['country'] = df['country'].map(lambda c: 0 if c == 'USA' else 1)

In [393]:
df.head()


Out[393]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 PG-13 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 0.0
1 0 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 PG-13 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 2.0
2 0 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 PG-13 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 0.0
3 0 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 PG-13 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 2.0
4 0 NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 NaN NaN NaN 12.0 1.0 NaN 0 0.0 0.0

Next up is content rating. Let's look at that


In [394]:
df['content_rating'].value_counts()


Out[394]:
R            2118
PG-13        1461
PG            701
Not Rated     116
G             112
Unrated        62
Approved       55
TV-14          30
TV-MA          20
X              13
TV-PG          13
TV-G           10
Passed          9
NC-17           7
GP              6
M               5
TV-Y7           1
TV-Y            1
Name: content_rating, dtype: int64

The majority of the films use the standard MPAA ratings: G, PG, PG-13, and R

Let's group the rest of the films (and null values) into the 'Not Rated' category, and then transform them to integers


In [395]:
df['content_rating'] = df['content_rating'].map({'G':0, 'PG':1, 'PG-13': 2, 'R': 3}).fillna(4).astype(int)

In [396]:
df.head()


Out[396]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.78 33000 0.0 0.0
1 0 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.35 0 0.0 2.0
2 0 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.35 85000 1.0 0.0
3 0 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.35 164000 1.0 2.0
4 0 NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 NaN 0 0.0 0.0

Aspect ratio may seem like a numerical feature, but it's somewhat of a categorial one. First, what values do we find in the dataset?


In [397]:
df['aspect_ratio'].value_counts()


Out[397]:
2.35     2360
1.85     1906
1.78      110
1.37      100
1.33       68
1.66       64
16.00      45
2.20       15
2.39       15
4.00        7
2.00        5
1.75        3
2.40        3
2.76        3
2.55        2
1.50        2
2.24        1
1.20        1
1.18        1
1.44        1
1.77        1
1.89        1
Name: aspect_ratio, dtype: int64

Some of these values seem to be in the wrong format, 16.00 is most likely 16:9 (1.78) and 4.00 is more likely 4:3 (1.33). Let's fix those.


In [398]:
df['aspect_ratio'] = df['aspect_ratio'].fillna(2.35)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.33 if ar == 4.00 else ar)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.78 if ar == 16.00 else ar)

In [399]:
df[['aspect_ratio', 'imdb_score']].groupby(pd.cut(df['aspect_ratio'], 4)).mean()


Out[399]:
aspect_ratio imdb_score
aspect_ratio
(1.178, 1.575] 1.353167 0.672222
(1.575, 1.97] 1.839038 0.357277
(1.97, 2.365] 2.348483 0.326199
(2.365, 2.76] 2.453478 0.521739

The above banding looks good. It sepearates out the two predominant aspect ratios (2.35 and 1.85), and also has two bands below and above these ratios. Let's use that.


In [400]:
df.loc[ df['aspect_ratio'] <= 1.575, 'aspect_ratio'] = 0
df.loc[(df['aspect_ratio'] > 1.575) & (df['aspect_ratio'] <= 1.97), 'aspect_ratio'] = 1
df.loc[(df['aspect_ratio'] > 1.97) & (df['aspect_ratio'] <= 2.365), 'aspect_ratio'] = 2
df.loc[ df['aspect_ratio'] > 2.365, 'aspect_ratio'] = 3

In [401]:
df.head()


Out[401]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 178.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.0 33000 0.0 0.0
1 0 302.0 169.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.0 0 0.0 2.0
2 0 602.0 148.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.0 85000 1.0 0.0
3 0 813.0 164.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.0 164000 1.0 2.0
4 0 NaN NaN 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 2.0 0 0.0 0.0

Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Duration feature.

We can consider three methods to complete a numerical continuous feature.

  1. The easist way is to use the median value.

  2. Another simple way is to generate random numbers between mean and standard deviation.

  3. More accurate way of guessing missing values is to use other correlated features, and use the median value based on other features.

  4. Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

We will use method 2.


In [402]:
mean = df['duration'].mean()
std = df['duration'].std()
mean, std


Out[402]:
(107.2010739856802, 25.19744080882403)

In [403]:
df['duration'] = df['duration'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)

Let us create Duration bands and determine correlations with IMDb score.


In [404]:
df[['duration', 'imdb_score']].groupby(pd.qcut(df['duration'], 5)).mean()


Out[404]:
duration imdb_score
duration
[7, 91] 81.340020 0.283286
(91, 99] 95.590629 0.217039
(99, 108] 103.739559 0.286982
(108, 122] 114.939788 0.396702
(122, 511] 143.169953 0.591815

Let us replace Duration with ordinals based on these bands.


In [405]:
df.loc[ df['duration'] <= 91, 'duration'] = 0
df.loc[(df['duration'] > 91) & (df['duration'] <= 99), 'duration'] = 1
df.loc[(df['duration'] > 99) & (df['duration'] <= 108), 'duration'] = 2
df.loc[(df['duration'] > 108) & (df['duration'] <= 122), 'duration'] = 3
df.loc[ df['duration'] > 122, 'duration'] = 4

df.head()


Out[405]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 723.0 4.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.0 33000 0.0 0.0
1 0 302.0 4.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.0 0 0.0 2.0
2 0 602.0 4.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.0 85000 1.0 0.0
3 0 813.0 4.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.0 164000 1.0 2.0
4 0 NaN 1.0 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 2.0 0 0.0 0.0

Let's apply the same techniques to the following features:

  • num_critic_for_reviews
  • director_facebook_likes
  • actor_1_facebook_likes
  • actor_2_facebook_likes
  • actor_3_facebook_likes
  • gross
  • facenumber_in_poster
  • num_user_for_reviews
  • budget
  • title_year
  • num_voted_users
  • cast_total_facebook_likes
  • movie_facebook_likes
  • num_of_films_director

num_critic_for_reviews


In [406]:
mean = df['num_critic_for_reviews'].mean()
std = df['num_critic_for_reviews'].std()
mean, std


Out[406]:
(140.1942719807731, 121.60167539623113)

In [407]:
df['num_critic_for_reviews'] = df['num_critic_for_reviews'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)

In [408]:
df[['num_critic_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_critic_for_reviews'], 5)).mean()


Out[408]:
num_critic_for_reviews imdb_score
num_critic_for_reviews
[1, 40] 19.028646 0.248790
(40, 84] 62.355562 0.276171
(84, 140] 111.599699 0.317682
(140, 222] 176.511550 0.386454
(222, 813] 335.060839 0.536926

In [409]:
df.loc[ df['num_critic_for_reviews'] <= 40, 'num_critic_for_reviews'] = 0
df.loc[(df['num_critic_for_reviews'] > 40) & (df['num_critic_for_reviews'] <= 84), 'num_critic_for_reviews'] = 1
df.loc[(df['num_critic_for_reviews'] > 84) & (df['num_critic_for_reviews'] <= 140), 'num_critic_for_reviews'] = 2
df.loc[(df['num_critic_for_reviews'] > 140) & (df['num_critic_for_reviews'] <= 222), 'num_critic_for_reviews'] = 3
df.loc[ df['num_critic_for_reviews'] > 222, 'num_critic_for_reviews'] = 4

df.head()


Out[409]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 563.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 22000.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 131.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 2.0 0 0.0 0.0

director_facebook_likes


In [410]:
mean = df['director_facebook_likes'].mean()
std = df['director_facebook_likes'].std()
mean, std


Out[410]:
(686.5092123911724, 2813.328606865637)

Since the standard deviation for this field is ~4x the mean, we'll just stick to using the mean value for nulls


In [411]:
df['director_facebook_likes'] = df['director_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [412]:
df[['director_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['director_facebook_likes'], 5)).mean()


Out[412]:
director_facebook_likes imdb_score
director_facebook_likes
[0, 3] 0.323077 0.429808
(3, 27.8] 13.329580 0.184237
(27.8, 91] 54.767717 0.245079
(91, 309] 175.680915 0.347913
(309, 23000] 3203.513902 0.549801

In [413]:
df.loc[ df['director_facebook_likes'] <= 3, 'director_facebook_likes'] = 0
df.loc[(df['director_facebook_likes'] > 3) & (df['director_facebook_likes'] <= 27.8), 'director_facebook_likes'] = 1
df.loc[(df['director_facebook_likes'] > 27.8) & (df['director_facebook_likes'] <= 91), 'director_facebook_likes'] = 2
df.loc[(df['director_facebook_likes'] > 91) & (df['director_facebook_likes'] <= 309), 'director_facebook_likes'] = 3
df.loc[ df['director_facebook_likes'] > 309, 'director_facebook_likes'] = 4

df.head()


Out[413]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 855.0 1000.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 1000.0 40000.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 161.0 11000.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 23000.0 27000.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 NaN 131.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 2.0 0 0.0 0.0

actor_1_facebook_likes


In [414]:
mean = df['actor_1_facebook_likes'].mean()
std = df['actor_1_facebook_likes'].std()
mean, std


Out[414]:
(6560.04706115965, 15020.759119983974)

In [415]:
df['actor_1_facebook_likes'] = df['actor_1_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [416]:
df['actor_1_facebook_likes'].describe()


Out[416]:
count      5043.000000
mean       6560.047061
std       15010.328553
min           0.000000
25%         615.500000
50%         989.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64

In [417]:
df[['actor_1_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_1_facebook_likes'], 5)).mean()


Out[417]:
actor_1_facebook_likes imdb_score
actor_1_facebook_likes
[0, 523] 250.647233 0.426877
(523, 865] 702.347869 0.302279
(865, 2000] 1137.117597 0.280034
(2000, 13000] 8956.717418 0.346652
(13000, 640000] 24597.374179 0.425602

In [418]:
df.loc[ df['actor_1_facebook_likes'] <= 523, 'actor_1_facebook_likes'] = 0
df.loc[(df['actor_1_facebook_likes'] > 523) & (df['actor_1_facebook_likes'] <= 865), 'actor_1_facebook_likes'] = 1
df.loc[(df['actor_1_facebook_likes'] > 865) & (df['actor_1_facebook_likes'] <= 2000), 'actor_1_facebook_likes'] = 2
df.loc[(df['actor_1_facebook_likes'] > 2000) & (df['actor_1_facebook_likes'] <= 13000), 'actor_1_facebook_likes'] = 3
df.loc[ df['actor_1_facebook_likes'] > 13000, 'actor_1_facebook_likes'] = 4

df.head()


Out[418]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 855.0 2.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 936.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 1000.0 4.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 5000.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 161.0 3.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 393.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 23000.0 4.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 23000.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 NaN 0.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 12.0 1.0 2.0 0 0.0 0.0

actor_2_facebook_likes


In [419]:
mean = df['actor_2_facebook_likes'].mean()
std = df['actor_2_facebook_likes'].std()
mean, std


Out[419]:
(1651.7544731610337, 4042.4388626418645)

In [420]:
df['actor_2_facebook_likes'] = df['actor_2_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [421]:
df[['actor_2_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_2_facebook_likes'], 5)).mean()


Out[421]:
actor_2_facebook_likes imdb_score
actor_2_facebook_likes
[0, 218] 88.086139 0.443564
(218, 486] 353.162537 0.335976
(486, 726.2] 602.864945 0.308838
(726.2, 979] 859.861249 0.302279
(979, 137000] 6358.933341 0.372024

In [422]:
df.loc[ df['actor_2_facebook_likes'] <= 218, 'actor_2_facebook_likes'] = 0
df.loc[(df['actor_2_facebook_likes'] > 218) & (df['actor_2_facebook_likes'] <= 486), 'actor_2_facebook_likes'] = 1
df.loc[(df['actor_2_facebook_likes'] > 486) & (df['actor_2_facebook_likes'] <= 726.2), 'actor_2_facebook_likes'] = 2
df.loc[(df['actor_2_facebook_likes'] > 726.2) & (df['actor_2_facebook_likes'] <= 979), 'actor_2_facebook_likes'] = 3
df.loc[ df['actor_2_facebook_likes'] > 979, 'actor_2_facebook_likes'] = 4

df.head()


Out[422]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 855.0 2.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 1000.0 4.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 161.0 3.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 23000.0 4.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 NaN 0.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 0.0 1.0 2.0 0 0.0 0.0

actor_3_facebook_likes


In [423]:
mean = df['actor_3_facebook_likes'].mean()
std = df['actor_3_facebook_likes'].std()
mean, std


Out[423]:
(645.0097609561753, 1665.0417284458572)

In [424]:
df['actor_3_facebook_likes'] = df['actor_3_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [425]:
df['actor_3_facebook_likes'].describe()


Out[425]:
count     5043.000000
mean       645.009761
std       1661.239692
min          0.000000
25%        134.500000
50%        374.000000
75%        638.000000
max      23000.000000
Name: actor_3_facebook_likes, dtype: float64

In [426]:
df[['actor_3_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_3_facebook_likes'], 5)).mean()


Out[426]:
actor_3_facebook_likes imdb_score
actor_3_facebook_likes
[0, 97] 38.345203 0.460930
(97, 265] 179.494059 0.347525
(265, 472] 370.867589 0.298419
(472, 700] 582.733025 0.315055
(700, 23000] 2058.519364 0.340616

In [427]:
df.loc[ df['actor_3_facebook_likes'] <= 97, 'actor_3_facebook_likes'] = 0
df.loc[(df['actor_3_facebook_likes'] > 97) & (df['actor_3_facebook_likes'] <= 265), 'actor_3_facebook_likes'] = 1
df.loc[(df['actor_3_facebook_likes'] > 265) & (df['actor_3_facebook_likes'] <= 472), 'actor_3_facebook_likes'] = 2
df.loc[(df['actor_3_facebook_likes'] > 472) & (df['actor_3_facebook_likes'] <= 700), 'actor_3_facebook_likes'] = 3
df.loc[ df['actor_3_facebook_likes'] > 700, 'actor_3_facebook_likes'] = 4

df.head()


Out[427]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 760505847.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 309404152.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 200074175.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 448130642.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 NaN 8 143 0.0 NaN 0 0 4 NaN NaN 0.0 1.0 2.0 0 0.0 0.0

gross


In [428]:
mean = df['gross'].mean()
std = df['gross'].std()
mean, std


Out[428]:
(48468407.52680933, 68452990.43875286)

In [429]:
df['gross'] = df['gross'].map(lambda v: mean if pd.isnull(v) else v)

In [430]:
df['gross'].describe()


Out[430]:
count    5.043000e+03
mean     4.846841e+07
std      6.216318e+07
min      1.620000e+02
25%      8.460992e+06
50%      3.743230e+07
75%      5.135707e+07
max      7.605058e+08
Name: gross, dtype: float64

In [431]:
df[['gross', 'imdb_score']].groupby(pd.qcut(df['gross'], 5)).mean()


Out[431]:
gross imdb_score
gross
[162, 4909758.4] 1.309169e+06 0.366700
(4909758.4, 24092475.2] 1.338883e+07 0.293651
(24092475.2, 48468407.527] 4.204045e+07 0.337059
(48468407.527, 64212162.4] 5.575126e+07 0.312303
(64212162.4, 760505847] 1.392144e+08 0.436075

In [432]:
df.loc[ df['gross'] <= 4909758.4, 'gross'] = 0
df.loc[(df['gross'] > 4909758.4) & (df['gross'] <= 24092475.2), 'gross'] = 1
df.loc[(df['gross'] > 24092475.2) & (df['gross'] <= 48468407.527), 'gross'] = 2
df.loc[(df['gross'] > 48468407.527) & (df['gross'] <= 64212162.4), 'gross'] = 3
df.loc[ df['gross'] > 64212162.4, 'gross'] = 4

df.head()


Out[432]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 8 143 0.0 NaN 0 0 4 NaN NaN 0.0 1.0 2.0 0 0.0 0.0

facenumber_in_poster


In [433]:
mean = df['facenumber_in_poster'].mean()
std = df['facenumber_in_poster'].std()
mean, std


Out[433]:
(1.3711729622266402, 2.0135759199960632)

In [434]:
df['facenumber_in_poster'].value_counts()


Out[434]:
0.0     2152
1.0     1251
2.0      716
3.0      380
4.0      207
5.0      114
6.0       76
7.0       48
8.0       37
9.0       18
10.0      10
15.0       6
11.0       5
12.0       4
13.0       2
19.0       1
14.0       1
31.0       1
43.0       1
Name: facenumber_in_poster, dtype: int64

In [435]:
df['facenumber_in_poster'].median()


Out[435]:
1.0

In [436]:
df['facenumber_in_poster'] = df['facenumber_in_poster'].map(lambda v: 1 if pd.isnull(v) else v)

In [437]:
df['facenumber_in_poster'].describe()


Out[437]:
count    5043.000000
mean        1.370216
std         2.011066
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        43.000000
Name: facenumber_in_poster, dtype: float64

In [438]:
df[['facenumber_in_poster', 'imdb_score']].groupby(pd.cut(df['facenumber_in_poster'], [-1,0,1,2,100])).mean()


Out[438]:
facenumber_in_poster imdb_score
facenumber_in_poster
(-1, 0] 0.000000 0.398234
(0, 1] 1.000000 0.347310
(1, 2] 2.000000 0.310056
(2, 100] 4.625686 0.285401

In [439]:
df.loc[ df['facenumber_in_poster'] <= 0, 'facenumber_in_poster'] = 0
df.loc[(df['facenumber_in_poster'] > 0) & (df['facenumber_in_poster'] <= 1), 'facenumber_in_poster'] = 1
df.loc[(df['facenumber_in_poster'] > 1) & (df['facenumber_in_poster'] <= 2), 'facenumber_in_poster'] = 2
df.loc[ df['facenumber_in_poster'] > 2, 'facenumber_in_poster'] = 3

df.head()


Out[439]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 886204 4834 0.0 3054.0 0 0 2 237000000.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 471220 48350 0.0 1238.0 0 0 2 300000000.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 275868 11700 1.0 994.0 0 1 2 245000000.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 1144337 106759 0.0 2701.0 0 0 2 250000000.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 8 143 0.0 NaN 0 0 4 NaN NaN 0.0 1.0 2.0 0 0.0 0.0

num_user_for_reviews


In [440]:
mean = df['num_user_for_reviews'].mean()
std = df['num_user_for_reviews'].std()
mean, std


Out[440]:
(272.77080844285143, 377.9828855657681)

In [441]:
df['num_user_for_reviews'] = df['num_user_for_reviews'].map(lambda v: mean if pd.isnull(v) else v)

In [442]:
df['num_user_for_reviews'].describe()


Out[442]:
count    5043.000000
mean      272.770808
std       377.194912
min         1.000000
25%        65.000000
50%       157.000000
75%       324.000000
max      5060.000000
Name: num_user_for_reviews, dtype: float64

In [443]:
df[['num_user_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_user_for_reviews'], 5)).mean()


Out[443]:
num_user_for_reviews imdb_score
num_user_for_reviews
[1, 48] 22.003960 0.238614
(48, 116] 81.848365 0.238850
(116, 210] 159.629336 0.303271
(210, 389] 286.411308 0.389275
(389, 5060] 814.773810 0.593254

In [444]:
df.loc[ df['num_user_for_reviews'] <= 48, 'num_user_for_reviews'] = 0
df.loc[(df['num_user_for_reviews'] > 48) & (df['num_user_for_reviews'] <= 116), 'num_user_for_reviews'] = 1
df.loc[(df['num_user_for_reviews'] > 116) & (df['num_user_for_reviews'] <= 210), 'num_user_for_reviews'] = 2
df.loc[(df['num_user_for_reviews'] > 210) & (df['num_user_for_reviews'] <= 389), 'num_user_for_reviews'] = 3
df.loc[ df['num_user_for_reviews'] > 389, 'num_user_for_reviews'] = 4

df.head()


Out[444]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 886204 4834 0.0 4.0 0 0 2 237000000.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 471220 48350 0.0 4.0 0 0 2 300000000.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 275868 11700 1.0 4.0 0 1 2 245000000.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 1144337 106759 0.0 4.0 0 0 2 250000000.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 8 143 0.0 3.0 0 0 4 NaN NaN 0.0 1.0 2.0 0 0.0 0.0

budget


In [445]:
mean = df['budget'].mean()
std = df['budget'].std()
mean, std


Out[445]:
(39752620.436387606, 206114898.44868386)

In [446]:
df['budget'] = df['budget'].map(lambda v: mean if pd.isnull(v) else v)

In [447]:
df['budget'].describe()


Out[447]:
count    5.043000e+03
mean     3.975262e+07
std      1.958004e+08
min      2.180000e+02
25%      7.000000e+06
50%      2.300000e+07
75%      4.000000e+07
max      1.221550e+10
Name: budget, dtype: float64

In [448]:
df[['budget', 'imdb_score']].groupby(pd.qcut(df['budget'], 3)).mean()


Out[448]:
budget imdb_score
budget
[218, 12000000] 4.710360e+06 0.398175
(12000000, 39752620.436] 2.748087e+07 0.343246
(39752620.436, 12215500000] 1.054312e+08 0.305513

In [449]:
df.loc[ df['budget'] <= 12000000, 'budget'] = 0
df.loc[(df['budget'] > 12000000) & (df['budget'] <= 39752620.436), 'budget'] = 1
df.loc[ df['budget'] > 39752620.436, 'budget'] = 2

df.head()


Out[449]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 886204 4834 0.0 4.0 0 0 2 2.0 2009.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 471220 48350 0.0 4.0 0 0 2 2.0 2007.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 275868 11700 1.0 4.0 0 1 2 2.0 2015.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 1144337 106759 0.0 4.0 0 0 2 2.0 2012.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 8 143 0.0 3.0 0 0 4 2.0 NaN 0.0 1.0 2.0 0 0.0 0.0

title_year


In [450]:
mean = df['title_year'].mean()
std = df['title_year'].std()
mean, std


Out[450]:
(2002.4705167173252, 12.47459891927068)

In [451]:
df['title_year'] = df['title_year'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)

In [452]:
df[['title_year', 'imdb_score']].groupby(pd.cut(df['title_year'], 5)).mean()


Out[452]:
title_year imdb_score
title_year
(1915.9, 1936] 1929.642857 0.714286
(1936, 1956] 1947.240741 0.814815
(1956, 1976] 1968.262069 0.710345
(1976, 1996] 1989.096261 0.417582
(1996, 2016] 2007.044377 0.321063

In [453]:
df.loc[ df['title_year'] <= 1936, 'title_year'] = 0
df.loc[(df['title_year'] > 1936) & (df['title_year'] <= 1956), 'title_year'] = 1
df.loc[(df['title_year'] > 1956) & (df['title_year'] <= 1976), 'title_year'] = 2
df.loc[(df['title_year'] > 1976) & (df['title_year'] <= 1996), 'title_year'] = 3
df.loc[ df['title_year'] > 1996, 'title_year'] = 4

df.head()


Out[453]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 886204 4834 0.0 4.0 0 0 2 2.0 4.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 471220 48350 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 275868 11700 1.0 4.0 0 1 2 2.0 4.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 1144337 106759 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 8 143 0.0 3.0 0 0 4 2.0 4.0 0.0 1.0 2.0 0 0.0 0.0

num_voted_users


In [454]:
mean = df['num_voted_users'].mean()
std = df['num_voted_users'].std()
mean, std


Out[454]:
(83668.16081697402, 138485.25680596207)

In [455]:
df['num_voted_users'] = df['num_voted_users'].map(lambda v: mean if pd.isnull(v) else v)

In [456]:
df['num_voted_users'].describe()


Out[456]:
count    5.043000e+03
mean     8.366816e+04
std      1.384853e+05
min      5.000000e+00
25%      8.593500e+03
50%      3.435900e+04
75%      9.630900e+04
max      1.689764e+06
Name: num_voted_users, dtype: float64

In [457]:
df[['num_voted_users', 'imdb_score']].groupby(pd.qcut(df['num_voted_users'], 5)).mean()


Out[457]:
num_voted_users imdb_score
num_voted_users
[5, 5623.8] 2063.805748 0.225966
(5623.8, 21478.4] 12599.737103 0.246032
(21478.4, 53178.2] 35314.418236 0.273538
(53178.2, 1.24e+05] 81242.864087 0.361111
(1.24e+05, 1689764] 287047.140733 0.656095

In [458]:
df.loc[ df['num_voted_users'] <= 5623.8, 'num_voted_users'] = 0
df.loc[(df['num_voted_users'] > 5623.8) & (df['num_voted_users'] <= 21478.4), 'num_voted_users'] = 1
df.loc[(df['num_voted_users'] > 21478.4) & (df['num_voted_users'] <= 53178.2), 'num_voted_users'] = 2
df.loc[(df['num_voted_users'] > 53178.2) & (df['num_voted_users'] <= 1.24e+05), 'num_voted_users'] = 3
df.loc[ df['num_voted_users'] > 1.24e+05, 'num_voted_users'] = 4

df.head()


Out[458]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 4 4834 0.0 4.0 0 0 2 2.0 4.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 4 48350 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 4 11700 1.0 4.0 0 1 2 2.0 4.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 4 106759 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 0 143 0.0 3.0 0 0 4 2.0 4.0 0.0 1.0 2.0 0 0.0 0.0

cast_total_facebook_likes


In [459]:
mean = df['cast_total_facebook_likes'].mean()
std = df['cast_total_facebook_likes'].std()
mean, std


Out[459]:
(9699.06385088241, 18163.799124045912)

In [460]:
df['cast_total_facebook_likes'] = df['cast_total_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [461]:
df['cast_total_facebook_likes'].describe()


Out[461]:
count      5043.000000
mean       9699.063851
std       18163.799124
min           0.000000
25%        1411.000000
50%        3090.000000
75%       13756.500000
max      656730.000000
Name: cast_total_facebook_likes, dtype: float64

In [462]:
df[['cast_total_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['cast_total_facebook_likes'], 5)).mean()


Out[462]:
cast_total_facebook_likes imdb_score
cast_total_facebook_likes
[0, 1136] 522.920792 0.436634
(1136, 2366.6] 1725.176763 0.328699
(2366.6, 4369.2] 3172.140733 0.258672
(4369.2, 16285.8] 10094.041667 0.317460
(16285.8, 656730] 32974.719524 0.421209

In [463]:
df.loc[ df['cast_total_facebook_likes'] <= 1136, 'cast_total_facebook_likes'] = 0
df.loc[(df['cast_total_facebook_likes'] > 1136) & (df['cast_total_facebook_likes'] <= 2366.6), 'cast_total_facebook_likes'] = 1
df.loc[(df['cast_total_facebook_likes'] > 2366.6) & (df['cast_total_facebook_likes'] <= 4369.2), 'cast_total_facebook_likes'] = 2
df.loc[(df['cast_total_facebook_likes'] > 4369.2) & (df['cast_total_facebook_likes'] <= 16285.8), 'cast_total_facebook_likes'] = 3
df.loc[ df['cast_total_facebook_likes'] > 16285.8, 'cast_total_facebook_likes'] = 4

df.head()


Out[463]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 4 3 0.0 4.0 0 0 2 2.0 4.0 3.0 1.0 1.0 33000 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 4 3 1.0 4.0 0 1 2 2.0 4.0 1.0 0.0 2.0 85000 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 164000 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 0 0 0.0 3.0 0 0 4 2.0 4.0 0.0 1.0 2.0 0 0.0 0.0

movie_facebook_likes


In [464]:
mean = df['movie_facebook_likes'].mean()
std = df['movie_facebook_likes'].std()
mean, std


Out[464]:
(7525.9645052548085, 19320.445109946737)

In [465]:
df['movie_facebook_likes'] = df['movie_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)

In [466]:
df['movie_facebook_likes'].describe()


Out[466]:
count      5043.000000
mean       7525.964505
std       19320.445110
min           0.000000
25%           0.000000
50%         166.000000
75%        3000.000000
max      349000.000000
Name: movie_facebook_likes, dtype: float64

In [467]:
df[df['movie_facebook_likes'] > 0][['movie_facebook_likes', 'imdb_score']].groupby(pd.qcut(df[df['movie_facebook_likes'] > 0]['movie_facebook_likes'], 4)).mean()


Out[467]:
movie_facebook_likes imdb_score
movie_facebook_likes
[2, 401] 184.490934 0.186890
(401, 1000] 725.322139 0.192786
(1000, 17000] 10425.727412 0.499234
(17000, 349000] 44229.651163 0.604651

In [468]:
df.loc[ df['movie_facebook_likes'] <= 0, 'movie_facebook_likes'] = 0
df.loc[(df['movie_facebook_likes'] > 0) & (df['movie_facebook_likes'] <= 401), 'movie_facebook_likes'] = 1
df.loc[(df['movie_facebook_likes'] > 401) & (df['movie_facebook_likes'] <= 1000), 'movie_facebook_likes'] = 2
df.loc[(df['movie_facebook_likes'] > 1000) & (df['movie_facebook_likes'] <= 17000), 'movie_facebook_likes'] = 3
df.loc[ df['movie_facebook_likes'] > 17000, 'movie_facebook_likes'] = 3

df.head()


Out[468]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 4 3 0.0 4.0 0 0 2 2.0 4.0 3.0 1.0 1.0 3 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 4 3 1.0 4.0 0 1 2 2.0 4.0 1.0 0.0 2.0 3 1.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 3 1.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 0 0 0.0 3.0 0 0 4 2.0 4.0 0.0 1.0 2.0 0 0.0 0.0

num_of_films_director


In [469]:
mean = df['num_of_films_director'].mean()
std = df['num_of_films_director'].std()
mean, std


Out[469]:
(0.21514971247273448, 0.5575849212989642)

In [470]:
df['num_of_films_director'].value_counts()


Out[470]:
0.0    4228
1.0     633
2.0      94
3.0      88
Name: num_of_films_director, dtype: int64

In [471]:
df['num_of_films_director'] = df['num_of_films_director'].map(lambda v: 1 if pd.isnull(v) else v)

In [472]:
df[['num_of_films_director', 'imdb_score']].groupby(pd.cut(df['num_of_films_director'], 3)).mean()


Out[472]:
num_of_films_director imdb_score
num_of_films_director
(-0.003, 1] 0.13022 0.345196
(1, 2] 2.00000 0.414894
(2, 3] 3.00000 0.693182

In [473]:
df.loc[ df['num_of_films_director'] <= 1, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 1) & (df['num_of_films_director'] <= 2), 'num_of_films_director'] = 1
df.loc[ df['num_of_films_director'] > 2, 'num_of_films_director'] = 2

df.head()


Out[473]:
color num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_1_facebook_likes gross num_voted_users cast_total_facebook_likes facenumber_in_poster num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes num_of_films_director actor_sum
0 0 4.0 4.0 0.0 4.0 2.0 4.0 4 3 0.0 4.0 0 0 2 2.0 4.0 3.0 1.0 1.0 3 0.0 0.0
1 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 0 0.0 2.0
2 0 4.0 4.0 0.0 1.0 3.0 4.0 4 3 1.0 4.0 0 1 2 2.0 4.0 1.0 0.0 2.0 3 0.0 0.0
3 0 4.0 4.0 4.0 4.0 4.0 4.0 4 4 0.0 4.0 0 0 2 2.0 4.0 4.0 1.0 2.0 3 0.0 2.0
4 0 4.0 1.0 3.0 3.0 0.0 2.0 0 0 0.0 3.0 0 0 4 2.0 4.0 0.0 1.0 2.0 0 0.0 0.0

In [474]:
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Empty DataFrame

Create new feature combining existing features

Completing a categorical feature

Converting categorical feature to numeric

Quick completing and converting a numeric feature

Partion Data

Now, we randomly partion our dataset into two DataFrames. 80% of the data will be our training set, the rest will become our test set


In [476]:
train_df, test_df = train_test_split(df, test_size = 0.2)

Model, predict and solve

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

  • Logistic Regression
  • KNN or k-Nearest Neighbors
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision Tree
  • Random Forrest
  • Perceptron
  • Artificial neural network
  • RVM or Relevance Vector Machine

In [477]:
X_train = train_df.drop("imdb_score", axis=1)
Y_train = train_df["imdb_score"]
X_test  = test_df.drop("imdb_score", axis=1)
Y_test = test_df["imdb_score"]
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape


Out[477]:
((4034, 21), (4034,), (1009, 21), (1009,))

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.

Note the confidence score generated by the model based on our training dataset.


In [478]:
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log


Out[478]:
75.26

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

  • Country is highest positivie coefficient, implying as the Country value increases (USA: 0 to Foreign: 1), the probability of IMDb score = 1 increases the most.
  • Inversely as Aspect Ratio increases, probability of IMDb score = 1 decreases the most.
  • Director Number of Films is a good artificial feature to model as it has a 0.2 positive coorelation with IMDb score.
  • So is Color as second highest positive correlation.

In [479]:
for i, value in enumerate(train_df.columns):
    print i, value


0 color
1 num_critic_for_reviews
2 duration
3 director_facebook_likes
4 actor_3_facebook_likes
5 actor_1_facebook_likes
6 gross
7 num_voted_users
8 cast_total_facebook_likes
9 facenumber_in_poster
10 num_user_for_reviews
11 language
12 country
13 content_rating
14 budget
15 title_year
16 actor_2_facebook_likes
17 imdb_score
18 aspect_ratio
19 movie_facebook_likes
20 num_of_films_director
21 actor_sum

In [480]:
coeff_df = pd.DataFrame(train_df.columns.delete(17))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)


Out[480]:
Feature Correlation
11 language 1.075740
0 color 0.814380
7 num_voted_users 0.619850
12 country 0.509361
13 content_rating 0.260260
2 duration 0.248950
18 movie_facebook_likes 0.182422
20 actor_sum 0.145884
19 num_of_films_director 0.137111
3 director_facebook_likes 0.087080
10 num_user_for_reviews 0.067997
8 cast_total_facebook_likes 0.042259
1 num_critic_for_reviews -0.020295
5 actor_1_facebook_likes -0.056932
16 actor_2_facebook_likes -0.062841
6 gross -0.118464
4 actor_3_facebook_likes -0.137546
9 facenumber_in_poster -0.173738
14 budget -0.276049
15 title_year -0.426405
17 aspect_ratio -0.479956

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference Wikipedia.

Note that the model generates a confidence score which is higher than Logistics Regression model.


In [481]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc


Out[481]:
83.42

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference Wikipedia.

KNN confidence score is better than Logistics Regression and SVM.


In [482]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn


Out[482]:
85.18

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference Wikipedia.

The model generated confidence score is the lowest among the models evaluated so far.


In [483]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian


Out[483]:
72.43

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference Wikipedia.


In [484]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron


Out[484]:
69.48

In [485]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc


Out[485]:
75.29

In [486]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd


Out[486]:
70.53

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far.


In [487]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree


Out[487]:
99.85

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.


In [488]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest


Out[488]:
99.85

Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.


In [489]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)


Out[489]:
Model Score
3 Random Forest 99.85
8 Decision Tree 99.85
1 KNN 85.18
0 Support Vector Machines 83.42
7 Linear SVC 75.29
2 Logistic Regression 75.26
4 Naive Bayes 72.43
6 Stochastic Gradient Decent 70.53
5 Perceptron 69.48

In [491]:
print accuracy_score(Y_test, Y_pred, normalize=False), '/', len(Y_test)
print accuracy_score(Y_test, Y_pred)


793 / 1009
0.785926660059

Using the Random Forrest Classifier resulted in correctly predicting 793 out of 1009 movies, or 78.6%. Not bad for our first attempt. Any suggestions to improve our score are most welcome.

References

This notebook has been created based on great work done solving the Titanic competition and other sources.