IMDB Predictive Analytics

This notebook explores using data science techniques on a data set of 5000+ movies, and predicting whether a movie will be highly rated on IMDb.

The objective of this notebook is to follow a step-by-step workflow, explaining each step and rationale for every decision we take during solution development.

This notebook is adapted from "Titanic Data Science Solutions" by Manav Sehgal

Workflow stages

This workflow goes through seven stages.

Question or problem definition.
Acquire training and testing data.
Wrangle, prepare, cleanse the data.
Analyze, identify patterns, and explore the data.
Model, predict and solve the problem.
Visualize, report, and present the problem solving steps and final solution.
Supply or submit the results.

The workflow indicates general sequence of how each stage may follow the other. However there are use cases with exceptions.

We may combine mulitple workflow stages. We may analyze by visualizing data.
Perform a stage earlier than indicated. We may analyze data before and after wrangling.
Perform a stage multiple times in our workflow. Visualize stage may be used multiple times.

Question and problem definition

The original data set used in this notebook can be found here at Kaggle.

Knowing from a training set of samples listing movies and their IMDb scores, can our model determine based on a given test dataset not containing the scores, if the movies in the test dataset scored highly or not?

Workflow goals

The data science solutions workflow solves for seven major goals.

Classifying. We may want to classify or categorize our samples. We may also want to understand the implications or correlation of different classes with our solution goal.

Correlating. One can approach the problem based on available features within the training dataset. Which features within the dataset contribute significantly to our solution goal? Statistically speaking is there a correlation among a feature and solution goal? As the feature values change does the solution state change as well, and visa-versa? This can be tested both for numerical and categorical features in the given dataset. We may also want to determine correlation among features other than survival for subsequent goals and workflow stages. Correlating certain features may help in creating, completing, or correcting features.

Converting. For modeling stage, one needs to prepare the data. Depending on the choice of model algorithm one may require all features to be converted to numerical equivalent values. So for instance converting text categorical values to numeric values.

Completing. Data preparation may also require us to estimate any missing values within a feature. Model algorithms may work best when there are no missing values.

Correcting. We may also analyze the given training dataset for errors or possibly innacurate values within features and try to corrent these values or exclude the samples containing the errors. One way to do this is to detect any outliers among our samples or features. We may also completely discard a feature if it is not contribting to the analysis or may significantly skew the results.

Creating. Can we create new features based on an existing feature or a set of features, such that the new feature follows the correlation, conversion, completeness goals.

Charting. How to select the right visualization plots and charts depending on nature of the data and the solution goals. A good start is to read the Tableau paper on Which chart or graph is right for you?.



In [348]:

    
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
from scipy.stats import truncnorm

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

Acquire data

The Python Pandas packages helps us work with our datasets. We start by acquiring the datasets into a Pandas DataFrame.

~~We will partition off 80% as our training data and 20% of the data as our test data. We also combine these datasets to run certain operations on both datasets together.~~

Let's move the partioning to after the data wrangling. Makes the code simpler, and doesn't make a real difference. Also removes any differences in the banding portions between runs.



In [349]:

    
df = pd.read_csv('../input/movie_metadata.csv')
# train_df, test_df = train_test_split(df, test_size = 0.2)
# test_actual = test_df['imdb_score']
# test_df = test_df.drop('imdb_score', axis=1)
# combine = [train_df, test_df]

Analyze by describing data

Pandas also helps describe the datasets answering following questions early in our project.

Which features are available in the dataset?

Noting the feature names for directly manipulating or analyzing these. These feature names are described on the Kaggle page here.



In [350]:

    
print(df.columns.values)









    



['color' 'director_name' 'num_critic_for_reviews' 'duration'
 'director_facebook_likes' 'actor_3_facebook_likes' 'actor_2_name'
 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title'
 'num_voted_users' 'cast_total_facebook_likes' 'actor_3_name'
 'facenumber_in_poster' 'plot_keywords' 'movie_imdb_link'
 'num_user_for_reviews' 'language' 'country' 'content_rating' 'budget'
 'title_year' 'actor_2_facebook_likes' 'imdb_score' 'aspect_ratio'
 'movie_facebook_likes']

Which features are categorical?

Color, Director name, Actor 1 name, Actor 2 name, Actor 3 name, Genres, Language, Country, Content Rating, Movie title, Plot keywords, Movie IMDb link

Which features are numerical?

Number of critics for reviews, Duration, Director Facebook likes, Actor 1 Facebook likes, Actor 2 Facebook likes, Actor 3 Facebook likes, Gross, Number of voted users, Cast total Facebook likes, Number of faces in poster, Number of users for reviews, Budget, Title year, IMDb score, Aspect ratio, Movie Facebook likes



In [351]:

    
pd.set_option('display.max_columns', 50)



In [352]:

    
# preview the data
df.head()









    Out[352]:






  
    
      
      color
      director_name
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_2_name
      actor_1_facebook_likes
      gross
      genres
      actor_1_name
      movie_title
      num_voted_users
      cast_total_facebook_likes
      actor_3_name
      facenumber_in_poster
      plot_keywords
      movie_imdb_link
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
    
  
  
    
      0
      Color
      James Cameron
      723.0
      178.0
      0.0
      855.0
      Joel David Moore
      1000.0
      760505847.0
      Action|Adventure|Fantasy|Sci-Fi
      CCH Pounder
      Avatar
      886204
      4834
      Wes Studi
      0.0
      avatar|future|marine|native|paraplegic
      http://www.imdb.com/title/tt0499549/?ref_=fn_t...
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      7.9
      1.78
      33000
    
    
      1
      Color
      Gore Verbinski
      302.0
      169.0
      563.0
      1000.0
      Orlando Bloom
      40000.0
      309404152.0
      Action|Adventure|Fantasy
      Johnny Depp
      Pirates of the Caribbean: At World's End
      471220
      48350
      Jack Davenport
      0.0
      goddess|marriage ceremony|marriage proposal|pi...
      http://www.imdb.com/title/tt0449088/?ref_=fn_t...
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      7.1
      2.35
      0
    
    
      2
      Color
      Sam Mendes
      602.0
      148.0
      0.0
      161.0
      Rory Kinnear
      11000.0
      200074175.0
      Action|Adventure|Thriller
      Christoph Waltz
      Spectre
      275868
      11700
      Stephanie Sigman
      1.0
      bomb|espionage|sequel|spy|terrorist
      http://www.imdb.com/title/tt2379713/?ref_=fn_t...
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      6.8
      2.35
      85000
    
    
      3
      Color
      Christopher Nolan
      813.0
      164.0
      22000.0
      23000.0
      Christian Bale
      27000.0
      448130642.0
      Action|Thriller
      Tom Hardy
      The Dark Knight Rises
      1144337
      106759
      Joseph Gordon-Levitt
      0.0
      deception|imprisonment|lawlessness|police offi...
      http://www.imdb.com/title/tt1345836/?ref_=fn_t...
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      8.5
      2.35
      164000
    
    
      4
      NaN
      Doug Walker
      NaN
      NaN
      131.0
      NaN
      Rob Walker
      131.0
      NaN
      Documentary
      Doug Walker
      Star Wars: Episode VII - The Force Awakens    ...
      8
      143
      NaN
      0.0
      NaN
      http://www.imdb.com/title/tt5289954/?ref_=fn_t...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      7.1
      NaN
      0



In [353]:

    
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 21 columns):
color                      5024 non-null object
director_name              4939 non-null object
num_critic_for_reviews     4993 non-null float64
duration                   5028 non-null float64
director_facebook_likes    4939 non-null float64
actor_3_facebook_likes     5020 non-null float64
actor_2_name               5030 non-null object
actor_1_facebook_likes     5036 non-null float64
gross                      4159 non-null float64
actor_1_name               5036 non-null object
actor_3_name               5020 non-null object
facenumber_in_poster       5030 non-null float64
plot_keywords              4890 non-null object
num_user_for_reviews       5022 non-null float64
language                   5031 non-null object
country                    5038 non-null object
content_rating             4740 non-null object
budget                     4551 non-null float64
title_year                 4935 non-null float64
actor_2_facebook_likes     5030 non-null float64
aspect_ratio               4714 non-null float64
dtypes: float64(12), object(9)
memory usage: 827.4+ KB

Which features contain blank, null or empty values?

These will require correcting.

color
director_name
num_critic_for_reviews
duration
director_facebook_likes
actor_3_facebook_likes
actor_2_name
actor_1_facebook_likes
gross
actor_1_name
actor_3_name
facenumber_in_poster
plot_keywords
num_user_for_reviews
language
country
content_rating
budget
title_year
actor_2_facebook_likes
aspect_ratio

What are the data types for various features?

Helping us during converting goal.

Twelve features are floats.
Nine features are strings (object).



In [354]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Data columns (total 28 columns):
color                        5024 non-null object
director_name                4939 non-null object
num_critic_for_reviews       4993 non-null float64
duration                     5028 non-null float64
director_facebook_likes      4939 non-null float64
actor_3_facebook_likes       5020 non-null float64
actor_2_name                 5030 non-null object
actor_1_facebook_likes       5036 non-null float64
gross                        4159 non-null float64
genres                       5043 non-null object
actor_1_name                 5036 non-null object
movie_title                  5043 non-null object
num_voted_users              5043 non-null int64
cast_total_facebook_likes    5043 non-null int64
actor_3_name                 5020 non-null object
facenumber_in_poster         5030 non-null float64
plot_keywords                4890 non-null object
movie_imdb_link              5043 non-null object
num_user_for_reviews         5022 non-null float64
language                     5031 non-null object
country                      5038 non-null object
content_rating               4740 non-null object
budget                       4551 non-null float64
title_year                   4935 non-null float64
actor_2_facebook_likes       5030 non-null float64
imdb_score                   5043 non-null float64
aspect_ratio                 4714 non-null float64
movie_facebook_likes         5043 non-null int64
dtypes: float64(13), int64(3), object(12)
memory usage: 1.1+ MB



In [355]:

    
df.describe()









    Out[355]:






  
    
      
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
    
  
  
    
      count
      4993.000000
      5028.000000
      4939.000000
      5020.000000
      5036.000000
      4.159000e+03
      5.043000e+03
      5043.000000
      5030.000000
      5022.000000
      4.551000e+03
      4935.000000
      5030.000000
      5043.000000
      4714.000000
      5043.000000
    
    
      mean
      140.194272
      107.201074
      686.509212
      645.009761
      6560.047061
      4.846841e+07
      8.366816e+04
      9699.063851
      1.371173
      272.770808
      3.975262e+07
      2002.470517
      1651.754473
      6.442138
      2.220403
      7525.964505
    
    
      std
      121.601675
      25.197441
      2813.328607
      1665.041728
      15020.759120
      6.845299e+07
      1.384853e+05
      18163.799124
      2.013576
      377.982886
      2.061149e+08
      12.474599
      4042.438863
      1.125116
      1.385113
      19320.445110
    
    
      min
      1.000000
      7.000000
      0.000000
      0.000000
      0.000000
      1.620000e+02
      5.000000e+00
      0.000000
      0.000000
      1.000000
      2.180000e+02
      1916.000000
      0.000000
      1.600000
      1.180000
      0.000000
    
    
      25%
      50.000000
      93.000000
      7.000000
      133.000000
      614.000000
      5.340988e+06
      8.593500e+03
      1411.000000
      0.000000
      65.000000
      6.000000e+06
      1999.000000
      281.000000
      5.800000
      1.850000
      0.000000
    
    
      50%
      110.000000
      103.000000
      49.000000
      371.500000
      988.000000
      2.551750e+07
      3.435900e+04
      3090.000000
      1.000000
      156.000000
      2.000000e+07
      2005.000000
      595.000000
      6.600000
      2.350000
      166.000000
    
    
      75%
      195.000000
      118.000000
      194.500000
      636.000000
      11000.000000
      6.230944e+07
      9.630900e+04
      13756.500000
      2.000000
      326.000000
      4.500000e+07
      2011.000000
      918.000000
      7.200000
      2.350000
      3000.000000
    
    
      max
      813.000000
      511.000000
      23000.000000
      23000.000000
      640000.000000
      7.605058e+08
      1.689764e+06
      656730.000000
      43.000000
      5060.000000
      1.221550e+10
      2016.000000
      137000.000000
      9.500000
      16.000000
      349000.000000

What is the distribution of categorical features?



In [356]:

    
df.describe(include=['O'])









    Out[356]:






  
    
      
      color
      director_name
      actor_2_name
      genres
      actor_1_name
      movie_title
      actor_3_name
      plot_keywords
      movie_imdb_link
      language
      country
      content_rating
    
  
  
    
      count
      5024
      4939
      5030
      5043
      5036
      5043
      5020
      4890
      5043
      5031
      5038
      4740
    
    
      unique
      2
      2398
      3032
      914
      2097
      4917
      3521
      4760
      4919
      47
      65
      18
    
    
      top
      Color
      Steven Spielberg
      Morgan Freeman
      Drama
      Robert De Niro
      Halloween
      Steve Coogan
      based on novel
      http://www.imdb.com/title/tt0077651/?ref_=fn_t...
      English
      USA
      R
    
    
      freq
      4815
      26
      20
      236
      49
      3
      8
      4
      3
      4704
      3807
      2118

Transformation of IMDb score

Let's simplify this problem into a binary classification/regression. Let us treat all movies with an IMDb score of 7.0 or higher as "good" (with a value of '1') and all below as "bad" (with a value of '0').



In [357]:

    
df.loc[ df['imdb_score'] < 7.0, 'imdb_score'] = 0
df.loc[ df['imdb_score'] >= 7.0, 'imdb_score'] = 1

df.head()









    Out[357]:






  
    
      
      color
      director_name
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_2_name
      actor_1_facebook_likes
      gross
      genres
      actor_1_name
      movie_title
      num_voted_users
      cast_total_facebook_likes
      actor_3_name
      facenumber_in_poster
      plot_keywords
      movie_imdb_link
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
    
  
  
    
      0
      Color
      James Cameron
      723.0
      178.0
      0.0
      855.0
      Joel David Moore
      1000.0
      760505847.0
      Action|Adventure|Fantasy|Sci-Fi
      CCH Pounder
      Avatar
      886204
      4834
      Wes Studi
      0.0
      avatar|future|marine|native|paraplegic
      http://www.imdb.com/title/tt0499549/?ref_=fn_t...
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
    
    
      1
      Color
      Gore Verbinski
      302.0
      169.0
      563.0
      1000.0
      Orlando Bloom
      40000.0
      309404152.0
      Action|Adventure|Fantasy
      Johnny Depp
      Pirates of the Caribbean: At World's End
      471220
      48350
      Jack Davenport
      0.0
      goddess|marriage ceremony|marriage proposal|pi...
      http://www.imdb.com/title/tt0449088/?ref_=fn_t...
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
    
    
      2
      Color
      Sam Mendes
      602.0
      148.0
      0.0
      161.0
      Rory Kinnear
      11000.0
      200074175.0
      Action|Adventure|Thriller
      Christoph Waltz
      Spectre
      275868
      11700
      Stephanie Sigman
      1.0
      bomb|espionage|sequel|spy|terrorist
      http://www.imdb.com/title/tt2379713/?ref_=fn_t...
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
    
    
      3
      Color
      Christopher Nolan
      813.0
      164.0
      22000.0
      23000.0
      Christian Bale
      27000.0
      448130642.0
      Action|Thriller
      Tom Hardy
      The Dark Knight Rises
      1144337
      106759
      Joseph Gordon-Levitt
      0.0
      deception|imprisonment|lawlessness|police offi...
      http://www.imdb.com/title/tt1345836/?ref_=fn_t...
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
    
    
      4
      NaN
      Doug Walker
      NaN
      NaN
      131.0
      NaN
      Rob Walker
      131.0
      NaN
      Documentary
      Doug Walker
      Star Wars: Episode VII - The Force Awakens    ...
      8
      143
      NaN
      0.0
      NaN
      http://www.imdb.com/title/tt5289954/?ref_=fn_t...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0

Assumtions based on data analysis

We arrive at following assumptions based on data analysis done so far. We may validate these assumptions further before taking appropriate actions.

Correlating.

We want to know how well does each feature correlate with IMDb score. We want to do this early in our project and match these quick correlations with modelled correlations later in the project.

Completing.

Correcting.

Creating.

Classifying.

Analyze by pivoting features

To confirm some of our observations and assumptions, we can quickly analyze our feature correlations by pivoting features against each other. We can only do so at this stage for features which do not have any empty values. It also makes sense doing so only for features which are categorical (Sex), ordinal (Pclass) or discrete (SibSp, Parch) type.

Pclass We observe significant correlation (>0.5) among Pclass=1 and Survived (classifying #3). We decide to include this feature in our model.
Sex We confirm the observation during problem definition that Sex=female had very high survival rate at 74% (classifying #1).
SibSp and Parch These features have zero correlation for certain values. It may be best to derive a feature or a set of features from these individual features (creating #1).



In [358]:

    
df[['content_rating', 'imdb_score']].groupby(['content_rating'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)









    Out[358]:






  
    
      
      content_rating
      imdb_score
    
  
  
    
      15
      TV-Y7
      1.000000
    
    
      14
      TV-Y
      1.000000
    
    
      12
      TV-MA
      1.000000
    
    
      0
      Approved
      0.781818
    
    
      8
      Passed
      0.777778
    
    
      13
      TV-PG
      0.769231
    
    
      10
      TV-14
      0.766667
    
    
      16
      Unrated
      0.548387
    
    
      11
      TV-G
      0.500000
    
    
      17
      X
      0.461538
    
    
      5
      Not Rated
      0.456897
    
    
      1
      G
      0.455357
    
    
      4
      NC-17
      0.428571
    
    
      3
      M
      0.400000
    
    
      9
      R
      0.367328
    
    
      2
      GP
      0.333333
    
    
      6
      PG
      0.310984
    
    
      7
      PG-13
      0.266940



In [359]:

    
df[["color", "imdb_score"]].groupby(['color'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)









    Out[359]:






  
    
      
      color
      imdb_score
    
  
  
    
      0
      Black and White
      0.665072
    
    
      1
      Color
      0.339356



In [360]:

    
df[["director_name", "imdb_score"]].groupby(['director_name'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)









    Out[360]:






  
    
      
      director_name
      imdb_score
    
  
  
    
      1277
      Ken Loach
      1.0
    
    
      353
      Christian Carion
      1.0
    
    
      589
      Don Kempf
      1.0
    
    
      2020
      Sam Martin
      1.0
    
    
      593
      Don Siegel
      1.0
    
    
      1620
      Mitchell Altieri
      1.0
    
    
      596
      Doug Atchison
      1.0
    
    
      597
      Doug Block
      1.0
    
    
      1622
      Molly Bernstein
      1.0
    
    
      600
      Doug Walker
      1.0
    
    
      2012
      Sai Varadan
      1.0
    
    
      2011
      Sadyk Sher-Niyaz
      1.0
    
    
      607
      Drew Goddard
      1.0
    
    
      608
      Dror Moreh
      1.0
    
    
      1625
      Mor Loushy
      1.0
    
    
      610
      Duke Johnson
      1.0
    
    
      611
      Duncan Jones
      1.0
    
    
      612
      Duncan Tucker
      1.0
    
    
      2009
      S.S. Rajamouli
      1.0
    
    
      1628
      Morgan Neville
      1.0
    
    
      2003
      Ryan Fleck
      1.0
    
    
      588
      Don Hall
      1.0
    
    
      1618
      Miranda July
      1.0
    
    
      2029
      Sanjay Rawal
      1.0
    
    
      566
      Dennis Hopper
      1.0
    
    
      1274
      Ken Annakin
      1.0
    
    
      1613
      Mike van Diem
      1.0
    
    
      558
      Debra Granik
      1.0
    
    
      559
      Deepa Mehta
      1.0
    
    
      561
      Dena Seidel
      1.0
    
    
      ...
      ...
      ...
    
    
      917
      Jamel Debbouze
      0.0
    
    
      961
      Jared Hess
      0.0
    
    
      962
      Jason Alexander
      0.0
    
    
      963
      Jason Bateman
      0.0
    
    
      996
      Jeff Crook
      0.0
    
    
      1010
      Jennifer Finnigan
      0.0
    
    
      1009
      Jem Cohen
      0.0
    
    
      1007
      Jeffrey W. Byrd
      0.0
    
    
      1006
      Jeffrey St. Jules
      0.0
    
    
      1005
      Jeff Wadlow
      0.0
    
    
      1003
      Jeff Schaffer
      0.0
    
    
      1001
      Jeff Nathanson
      0.0
    
    
      1000
      Jeff Lowell
      0.0
    
    
      999
      Jeff Kanew
      0.0
    
    
      998
      Jeff Garlin
      0.0
    
    
      997
      Jeff Franklin
      0.0
    
    
      995
      Jeff Burr
      0.0
    
    
      964
      Jason Connery
      0.0
    
    
      993
      Jeb Stuart
      0.0
    
    
      986
      Jean-Jacques Mantello
      0.0
    
    
      984
      Jean-François Richet
      0.0
    
    
      978
      Jay Duplass
      0.0
    
    
      976
      Jay Alaimo
      0.0
    
    
      973
      Jason Zada
      0.0
    
    
      972
      Jason Trost
      0.0
    
    
      971
      Jason Stone
      0.0
    
    
      969
      Jason Naumann
      0.0
    
    
      966
      Jason Friedberg
      0.0
    
    
      965
      Jason Eisener
      0.0
    
    
      2397
      Étienne Faure
      0.0
    
  

2398 rows × 2 columns



In [361]:

    
df[["country", "imdb_score"]].groupby(['country'], as_index=False).mean().sort_values(by='imdb_score', ascending=False)









    Out[361]:






  
    
      
      country
      imdb_score
    
  
  
    
      0
      Afghanistan
      1.000000
    
    
      33
      Kenya
      1.000000
    
    
      63
      United Arab Emirates
      1.000000
    
    
      18
      Finland
      1.000000
    
    
      54
      Soviet Union
      1.000000
    
    
      17
      Egypt
      1.000000
    
    
      44
      Panama
      1.000000
    
    
      43
      Pakistan
      1.000000
    
    
      30
      Israel
      1.000000
    
    
      9
      Cameroon
      1.000000
    
    
      27
      Indonesia
      1.000000
    
    
      35
      Libya
      1.000000
    
    
      34
      Kyrgyzstan
      1.000000
    
    
      13
      Colombia
      1.000000
    
    
      37
      Netherlands
      0.800000
    
    
      47
      Poland
      0.800000
    
    
      1
      Argentina
      0.750000
    
    
      28
      Iran
      0.750000
    
    
      39
      New Zealand
      0.733333
    
    
      56
      Sweden
      0.666667
    
    
      64
      West Germany
      0.666667
    
    
      14
      Czech Republic
      0.666667
    
    
      15
      Denmark
      0.636364
    
    
      6
      Brazil
      0.625000
    
    
      53
      South Korea
      0.571429
    
    
      32
      Japan
      0.521739
    
    
      55
      Spain
      0.515152
    
    
      61
      UK
      0.508929
    
    
      58
      Taiwan
      0.500000
    
    
      41
      Norway
      0.500000
    
    
      ...
      ...
      ...
    
    
      36
      Mexico
      0.470588
    
    
      23
      Hong Kong
      0.470588
    
    
      12
      China
      0.433333
    
    
      19
      France
      0.428571
    
    
      3
      Australia
      0.381818
    
    
      21
      Germany
      0.371134
    
    
      25
      Iceland
      0.333333
    
    
      29
      Ireland
      0.333333
    
    
      62
      USA
      0.317310
    
    
      10
      Canada
      0.309524
    
    
      49
      Russia
      0.272727
    
    
      5
      Belgium
      0.250000
    
    
      52
      South Africa
      0.250000
    
    
      59
      Thailand
      0.200000
    
    
      40
      Nigeria
      0.000000
    
    
      16
      Dominican Republic
      0.000000
    
    
      2
      Aruba
      0.000000
    
    
      60
      Turkey
      0.000000
    
    
      20
      Georgia
      0.000000
    
    
      57
      Switzerland
      0.000000
    
    
      11
      Chile
      0.000000
    
    
      4
      Bahamas
      0.000000
    
    
      42
      Official site
      0.000000
    
    
      51
      Slovenia
      0.000000
    
    
      50
      Slovakia
      0.000000
    
    
      46
      Philippines
      0.000000
    
    
      45
      Peru
      0.000000
    
    
      38
      New Line
      0.000000
    
    
      8
      Cambodia
      0.000000
    
    
      7
      Bulgaria
      0.000000
    
  

65 rows × 2 columns

Analyze by visualizing data

Now we can continue confirming some of our assumptions using visualizations for analyzing the data.

Correlating numerical features

Let us start by understanding correlations between numerical features and our solution goal (IMDb score).

Observations.

Decisions.



In [362]:

    
g = sns.FacetGrid(df, col='imdb_score')
g.map(plt.hist, 'title_year', bins=20)









    Out[362]:





<seaborn.axisgrid.FacetGrid at 0x1130acd90>

Correlating numerical and ordinal features

We can combine multiple features for identifying correlations using a single plot. This can be done with numerical and categorical features which have numeric values.

Observations.

Decisions.



In [363]:

    
# grid = sns.FacetGrid(df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(df, col='imdb_score', row='color', size=2.2, aspect=1.6)
grid.map(plt.hist, 'title_year', alpha=.5, bins=20)
grid.add_legend();

Correlating categorical features

Now we can correlate categorical features with our solution goal.

Observations.

Decisions.



In [364]:

    
# grid = sns.FacetGrid(df, col='Embarked')
# grid = sns.FacetGrid(df, row='Embarked', size=2.2, aspect=1.6)
# grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
# grid.add_legend()

Correlating categorical and numerical features

Observations.

Decisions.



In [365]:

    
# grid = sns.FacetGrid(df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
# grid = sns.FacetGrid(df, row='Embarked', col='Survived', size=2.2, aspect=1.6)
# grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
# grid.add_legend()

Wrangle data

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

Correcting by dropping features

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our notebook and eases the analysis.

Based on our assumptions and decisions we want to drop the actor_2_name, genres, actor_1_name, movie_title, actor_3_name, plot_keywords, movie_imdb_link features.

Note that where applicable we perform operations on both training and testing datasets together to stay consistent.



In [366]:

    
print("Before", df.shape)

df = df.drop(['genres', 'movie_title', 'plot_keywords', 'movie_imdb_link'], axis=1)

"After", df.shape









    



('Before', (5043, 28))






    Out[366]:





('After', (5043, 24))

Creating new feature extracting from existing

We want to analyze if the Director and Actor name features can be engineered to extract number of films starred or directed and test correlation between number of films and score, before dropping Director and Actor name features.

In the following code we extract the num_of_films_director and num_of_films_actor features by iterating over the dataframe.

If the director name field is empty, we fill the field with 1.

Observations.

When we plot the number of films directed, and number of films acted in, we note the following observations.

Directors with more films under their belt tend to have a higher success rate. It seems practice does make perfect.
Films where the actors have a higher number of combined films have a higher success rate. A more experienced cast, a better movie.

Decision.

We decide to band the directors into groups by number of films directed.
We decide to band the "total combined films acted in" into groups.



In [367]:

    
actors = {}
directors = {}

for index, row in df.iterrows():
    for actor in row[['actor_1_name', 'actor_2_name', 'actor_3_name']]:
        if actor is not np.nan:
            if actor not in actors:
                actors[actor] = 0
            actors[actor] += 1
    director = row['director_name']
    if director is not np.nan:
        if director not in directors:
            directors[director] = 0
        directors[director] += 1



In [368]:

    
df['num_of_films_director'] = df["director_name"].dropna().map(directors).astype(int)
df['num_of_films_director'] = df['num_of_films_director'].fillna(1)



In [369]:

    
df['NumFilmsBand'] = pd.cut(df['num_of_films_director'], 4)
df[['NumFilmsBand', 'imdb_score']].groupby(['NumFilmsBand'], as_index=False).mean().sort_values(by='NumFilmsBand', ascending=True)









    Out[369]:






  
    
      
      NumFilmsBand
      imdb_score
    
  
  
    
      0
      (0.975, 7.25]
      0.334201
    
    
      1
      (7.25, 13.5]
      0.418641
    
    
      2
      (13.5, 19.75]
      0.414894
    
    
      3
      (19.75, 26]
      0.693182



In [370]:

    
df.loc[ df['num_of_films_director'] <= 7, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 7) & (df['num_of_films_director'] <= 13), 'num_of_films_director'] = 1
df.loc[(df['num_of_films_director'] > 13) & (df['num_of_films_director'] <= 19), 'num_of_films_director'] = 2
df.loc[ df['num_of_films_director'] > 19, 'num_of_films_director'] = 3
df.head()









    Out[370]:






  
    
      
      color
      director_name
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_2_name
      actor_1_facebook_likes
      gross
      actor_1_name
      num_voted_users
      cast_total_facebook_likes
      actor_3_name
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      NumFilmsBand
    
  
  
    
      0
      Color
      James Cameron
      723.0
      178.0
      0.0
      855.0
      Joel David Moore
      1000.0
      760505847.0
      CCH Pounder
      886204
      4834
      Wes Studi
      0.0
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      (0.975, 7.25]
    
    
      1
      Color
      Gore Verbinski
      302.0
      169.0
      563.0
      1000.0
      Orlando Bloom
      40000.0
      309404152.0
      Johnny Depp
      471220
      48350
      Jack Davenport
      0.0
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      (0.975, 7.25]
    
    
      2
      Color
      Sam Mendes
      602.0
      148.0
      0.0
      161.0
      Rory Kinnear
      11000.0
      200074175.0
      Christoph Waltz
      275868
      11700
      Stephanie Sigman
      1.0
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      (7.25, 13.5]
    
    
      3
      Color
      Christopher Nolan
      813.0
      164.0
      22000.0
      23000.0
      Christian Bale
      27000.0
      448130642.0
      Tom Hardy
      1144337
      106759
      Joseph Gordon-Levitt
      0.0
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      (7.25, 13.5]
    
    
      4
      NaN
      Doug Walker
      NaN
      NaN
      131.0
      NaN
      Rob Walker
      131.0
      NaN
      Doug Walker
      8
      143
      NaN
      0.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      (0.975, 7.25]

We can now remove the director_name and num_of_films_director_band features.



In [371]:

    
df = df.drop(['director_name', 'NumFilmsBand'], axis=1)

Now, let's examine actors by number of films acted in. Since we have three actors listed per film, we'll need to combine these numbers. Let's fill any empty fields with the median value for that field, and sum the columns.



In [372]:

    
df["actor_1_name"].dropna().map(actors).describe()









    Out[372]:





count    5036.000000
mean       11.901509
std        11.397618
min         1.000000
25%         3.000000
50%         8.000000
75%        19.000000
max        54.000000
Name: actor_1_name, dtype: float64



In [373]:

    
df['num_of_films_actor_1'] = df["actor_1_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_1'] = df['num_of_films_actor_1'].fillna(8)



In [374]:

    
df["actor_2_name"].dropna().map(actors).describe()









    Out[374]:





count    5030.000000
mean        6.411730
std         7.443835
min         1.000000
25%         1.000000
50%         4.000000
75%         8.000000
max        54.000000
Name: actor_2_name, dtype: float64



In [375]:

    
df['num_of_films_actor_2'] = df["actor_2_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_2'] = df['num_of_films_actor_2'].fillna(4)



In [376]:

    
df["actor_3_name"].dropna().map(actors).describe()









    Out[376]:





count    5020.000000
mean        4.116135
std         4.746142
min         1.000000
25%         1.000000
50%         2.000000
75%         5.000000
max        47.000000
Name: actor_3_name, dtype: float64



In [377]:

    
df['num_of_films_actor_3'] = df["actor_3_name"].dropna().map(actors).astype(int)
df['num_of_films_actor_3'] = df['num_of_films_actor_3'].fillna(2)



In [378]:

    
df['actor_sum'] = df["num_of_films_actor_1"] + df["num_of_films_actor_2"] + df["num_of_films_actor_3"]



In [379]:

    
df['ActorSumBand'] = pd.cut(df['actor_sum'], 5)
df[['ActorSumBand', 'imdb_score']].groupby(['ActorSumBand'], as_index=False).mean().sort_values(by='ActorSumBand', ascending=True)









    Out[379]:






  
    
      
      ActorSumBand
      imdb_score
    
  
  
    
      0
      (2.892, 24.6]
      0.342672
    
    
      1
      (24.6, 46.2]
      0.342809
    
    
      2
      (46.2, 67.8]
      0.408257
    
    
      3
      (67.8, 89.4]
      0.457746
    
    
      4
      (89.4, 111]
      0.571429



In [380]:

    
df.loc[ df['actor_sum'] <= 24, 'actor_sum'] = 0
df.loc[(df['actor_sum'] > 24) & (df['actor_sum'] <= 46), 'actor_sum'] = 1
df.loc[(df['actor_sum'] > 46) & (df['actor_sum'] <= 67), 'actor_sum'] = 2
df.loc[(df['actor_sum'] > 67) & (df['actor_sum'] <= 89), 'actor_sum'] = 3
df.loc[ df['actor_sum'] > 89, 'actor_sum'] = 4



In [381]:

    
df.head()









    Out[381]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_2_name
      actor_1_facebook_likes
      gross
      actor_1_name
      num_voted_users
      cast_total_facebook_likes
      actor_3_name
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      num_of_films_actor_1
      num_of_films_actor_2
      num_of_films_actor_3
      actor_sum
      ActorSumBand
    
  
  
    
      0
      Color
      723.0
      178.0
      0.0
      855.0
      Joel David Moore
      1000.0
      760505847.0
      CCH Pounder
      886204
      4834
      Wes Studi
      0.0
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      7.0
      7.0
      6.0
      0.0
      (2.892, 24.6]
    
    
      1
      Color
      302.0
      169.0
      563.0
      1000.0
      Orlando Bloom
      40000.0
      309404152.0
      Johnny Depp
      471220
      48350
      Jack Davenport
      0.0
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      41.0
      11.0
      6.0
      2.0
      (46.2, 67.8]
    
    
      2
      Color
      602.0
      148.0
      0.0
      161.0
      Rory Kinnear
      11000.0
      200074175.0
      Christoph Waltz
      275868
      11700
      Stephanie Sigman
      1.0
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      10.0
      5.0
      1.0
      0.0
      (2.892, 24.6]
    
    
      3
      Color
      813.0
      164.0
      22000.0
      23000.0
      Christian Bale
      27000.0
      448130642.0
      Tom Hardy
      1144337
      106759
      Joseph Gordon-Levitt
      0.0
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      13.0
      28.0
      19.0
      2.0
      (46.2, 67.8]
    
    
      4
      NaN
      NaN
      NaN
      131.0
      NaN
      Rob Walker
      131.0
      NaN
      Doug Walker
      8
      143
      NaN
      0.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      1.0
      1.0
      2.0
      0.0
      (2.892, 24.6]

Now we can remove actor_1_name, actor_2_name, actor_3_name and ActorSumBand



In [382]:

    
df = df.drop(['actor_1_name', 'num_of_films_actor_1', 'actor_2_name', 'num_of_films_actor_2', 'actor_3_name', 'num_of_films_actor_3', 'ActorSumBand'], axis=1)



In [383]:

    
df.head()









    Out[383]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      Color
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      0.0
    
    
      1
      Color
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      2.0
    
    
      2
      Color
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      0.0
    
    
      3
      Color
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      2.0
    
    
      4
      NaN
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      0.0

Converting a categorical feature

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Color feature to a new feature called Color where black and white=1 and color=0.

Since some values are null, let's fill them with the most common value, Color



In [384]:

    
df['color'] = df['color'].fillna("Color")



In [385]:

    
df['color'] = df['color'].map( {' Black and White': 1, 'Color': 0} ).astype(int)

df.head()









    Out[385]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      English
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      English
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      English
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      English
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      0.0

Next, let's look at the language and country features



In [386]:

    
df['language'].value_counts()









    Out[386]:





English       4704
French          73
Spanish         40
Hindi           28
Mandarin        26
German          19
Japanese        18
Cantonese       11
Russian         11
Italian         11
Korean           8
Portuguese       8
Arabic           5
Swedish          5
Hebrew           5
Danish           5
Persian          4
Dutch            4
Polish           4
Norwegian        4
Thai             3
Chinese          3
None             2
Zulu             2
Aboriginal       2
Icelandic        2
Romanian         2
Dari             2
Indonesian       2
Urdu             1
Czech            1
Filipino         1
Tamil            1
Slovenian        1
Swahili          1
Aramaic          1
Greek            1
Hungarian        1
Bosnian          1
Kazakh           1
Panjabi          1
Kannada          1
Mongolian        1
Dzongkha         1
Vietnamese       1
Maya             1
Telugu           1
Name: language, dtype: int64

The bulk of the films are in English. Let's convert this field to 1 for Non-English, and 0 for English

First, let's fill any null values with English



In [387]:

    
df['language'] = df['language'].fillna("English")



In [388]:

    
df['language'] = df['language'].map(lambda l: 0 if l == 'English' else 1)



In [389]:

    
df.head()









    Out[389]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      USA
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      USA
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      UK
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      USA
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      NaN
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      0.0

Next, let's explore country



In [390]:

    
df['country'].value_counts()









    Out[390]:





USA                     3807
UK                       448
France                   154
Canada                   126
Germany                   97
Australia                 55
India                     34
Spain                     33
China                     30
Italy                     23
Japan                     23
Mexico                    17
Hong Kong                 17
New Zealand               15
South Korea               14
Ireland                   12
Russia                    11
Denmark                   11
South Africa               8
Brazil                     8
Norway                     8
Sweden                     6
Thailand                   5
Poland                     5
Netherlands                5
Belgium                    4
Iran                       4
Israel                     4
Argentina                  4
Romania                    4
                        ... 
Greece                     2
Taiwan                     2
Bulgaria                   1
Cambodia                   1
Official site              1
Aruba                      1
Cameroon                   1
Slovenia                   1
Colombia                   1
United Arab Emirates       1
Georgia                    1
Pakistan                   1
Chile                      1
Soviet Union               1
Kenya                      1
Kyrgyzstan                 1
Turkey                     1
Afghanistan                1
Nigeria                    1
Indonesia                  1
Slovakia                   1
Peru                       1
Bahamas                    1
Philippines                1
Dominican Republic         1
Libya                      1
Finland                    1
Panama                     1
Egypt                      1
New Line                   1
Name: country, dtype: int64

Again, most films are from USA. Taking the same approach, we'll fill NaNs with USA, and transfrom USA to 0, all others to 1



In [391]:

    
df['country'] = df['country'].fillna("USA")



In [392]:

    
df['country'] = df['country'].map(lambda c: 0 if c == 'USA' else 1)



In [393]:

    
df.head()









    Out[393]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      PG-13
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      PG-13
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      PG-13
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      PG-13
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      NaN
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      0.0

Next up is content rating. Let's look at that



In [394]:

    
df['content_rating'].value_counts()









    Out[394]:





R            2118
PG-13        1461
PG            701
Not Rated     116
G             112
Unrated        62
Approved       55
TV-14          30
TV-MA          20
X              13
TV-PG          13
TV-G           10
Passed          9
NC-17           7
GP              6
M               5
TV-Y7           1
TV-Y            1
Name: content_rating, dtype: int64

The majority of the films use the standard MPAA ratings: G, PG, PG-13, and R

Let's group the rest of the films (and null values) into the 'Not Rated' category, and then transform them to integers



In [395]:

    
df['content_rating'] = df['content_rating'].map({'G':0, 'PG':1, 'PG-13': 2, 'R': 3}).fillna(4).astype(int)



In [396]:

    
df.head()









    Out[396]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.78
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.35
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.35
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.35
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      NaN
      0
      0.0
      0.0

Aspect ratio may seem like a numerical feature, but it's somewhat of a categorial one. First, what values do we find in the dataset?



In [397]:

    
df['aspect_ratio'].value_counts()









    Out[397]:





2.35     2360
1.85     1906
1.78      110
1.37      100
1.33       68
1.66       64
16.00      45
2.20       15
2.39       15
4.00        7
2.00        5
1.75        3
2.40        3
2.76        3
2.55        2
1.50        2
2.24        1
1.20        1
1.18        1
1.44        1
1.77        1
1.89        1
Name: aspect_ratio, dtype: int64

Some of these values seem to be in the wrong format, 16.00 is most likely 16:9 (1.78) and 4.00 is more likely 4:3 (1.33). Let's fix those.



In [398]:

    
df['aspect_ratio'] = df['aspect_ratio'].fillna(2.35)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.33 if ar == 4.00 else ar)
df['aspect_ratio'] = df['aspect_ratio'].map(lambda ar: 1.78 if ar == 16.00 else ar)



In [399]:

    
df[['aspect_ratio', 'imdb_score']].groupby(pd.cut(df['aspect_ratio'], 4)).mean()









    Out[399]:






  
    
      
      aspect_ratio
      imdb_score
    
    
      aspect_ratio
      
      
    
  
  
    
      (1.178, 1.575]
      1.353167
      0.672222
    
    
      (1.575, 1.97]
      1.839038
      0.357277
    
    
      (1.97, 2.365]
      2.348483
      0.326199
    
    
      (2.365, 2.76]
      2.453478
      0.521739

The above banding looks good. It sepearates out the two predominant aspect ratios (2.35 and 1.85), and also has two bands below and above these ratios. Let's use that.



In [400]:

    
df.loc[ df['aspect_ratio'] <= 1.575, 'aspect_ratio'] = 0
df.loc[(df['aspect_ratio'] > 1.575) & (df['aspect_ratio'] <= 1.97), 'aspect_ratio'] = 1
df.loc[(df['aspect_ratio'] > 1.97) & (df['aspect_ratio'] <= 2.365), 'aspect_ratio'] = 2
df.loc[ df['aspect_ratio'] > 2.365, 'aspect_ratio'] = 3



In [401]:

    
df.head()









    Out[401]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      178.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      169.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      148.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      164.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      NaN
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      2.0
      0
      0.0
      0.0

Completing a numerical continuous feature

Now we should start estimating and completing features with missing or null values. We will first do this for the Duration feature.

We can consider three methods to complete a numerical continuous feature.

The easist way is to use the median value.
Another simple way is to generate random numbers between mean and standard deviation.
More accurate way of guessing missing values is to use other correlated features, and use the median value based on other features.
Combine methods 1 and 2. So instead of guessing age values based on median, use random numbers between mean and standard deviation, based on sets of Pclass and Gender combinations.

We will use method 2.



In [402]:

    
mean = df['duration'].mean()
std = df['duration'].std()
mean, std









    Out[402]:





(107.2010739856802, 25.19744080882403)



In [403]:

    
df['duration'] = df['duration'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)

Let us create Duration bands and determine correlations with IMDb score.



In [404]:

    
df[['duration', 'imdb_score']].groupby(pd.qcut(df['duration'], 5)).mean()

Let us replace Duration with ordinals based on these bands.



In [405]:

    
df.loc[ df['duration'] <= 91, 'duration'] = 0
df.loc[(df['duration'] > 91) & (df['duration'] <= 99), 'duration'] = 1
df.loc[(df['duration'] > 99) & (df['duration'] <= 108), 'duration'] = 2
df.loc[(df['duration'] > 108) & (df['duration'] <= 122), 'duration'] = 3
df.loc[ df['duration'] > 122, 'duration'] = 4

df.head()









    Out[405]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      723.0
      4.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      302.0
      4.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      602.0
      4.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      813.0
      4.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      NaN
      1.0
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      2.0
      0
      0.0
      0.0

Let's apply the same techniques to the following features:

num_critic_for_reviews
director_facebook_likes
actor_1_facebook_likes
actor_2_facebook_likes
actor_3_facebook_likes
gross
facenumber_in_poster
num_user_for_reviews
budget
title_year
num_voted_users
cast_total_facebook_likes
movie_facebook_likes
num_of_films_director

num_critic_for_reviews



In [406]:

    
mean = df['num_critic_for_reviews'].mean()
std = df['num_critic_for_reviews'].std()
mean, std









    Out[406]:





(140.1942719807731, 121.60167539623113)



In [407]:

    
df['num_critic_for_reviews'] = df['num_critic_for_reviews'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)



In [408]:

    
df[['num_critic_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_critic_for_reviews'], 5)).mean()









    Out[408]:






  
    
      
      num_critic_for_reviews
      imdb_score
    
    
      num_critic_for_reviews
      
      
    
  
  
    
      [1, 40]
      19.028646
      0.248790
    
    
      (40, 84]
      62.355562
      0.276171
    
    
      (84, 140]
      111.599699
      0.317682
    
    
      (140, 222]
      176.511550
      0.386454
    
    
      (222, 813]
      335.060839
      0.536926



In [409]:

    
df.loc[ df['num_critic_for_reviews'] <= 40, 'num_critic_for_reviews'] = 0
df.loc[(df['num_critic_for_reviews'] > 40) & (df['num_critic_for_reviews'] <= 84), 'num_critic_for_reviews'] = 1
df.loc[(df['num_critic_for_reviews'] > 84) & (df['num_critic_for_reviews'] <= 140), 'num_critic_for_reviews'] = 2
df.loc[(df['num_critic_for_reviews'] > 140) & (df['num_critic_for_reviews'] <= 222), 'num_critic_for_reviews'] = 3
df.loc[ df['num_critic_for_reviews'] > 222, 'num_critic_for_reviews'] = 4

df.head()









    Out[409]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      563.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      22000.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      131.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      2.0
      0
      0.0
      0.0

director_facebook_likes



In [410]:

    
mean = df['director_facebook_likes'].mean()
std = df['director_facebook_likes'].std()
mean, std









    Out[410]:





(686.5092123911724, 2813.328606865637)

Since the standard deviation for this field is ~4x the mean, we'll just stick to using the mean value for nulls



In [411]:

    
df['director_facebook_likes'] = df['director_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [412]:

    
df[['director_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['director_facebook_likes'], 5)).mean()









    Out[412]:






  
    
      
      director_facebook_likes
      imdb_score
    
    
      director_facebook_likes
      
      
    
  
  
    
      [0, 3]
      0.323077
      0.429808
    
    
      (3, 27.8]
      13.329580
      0.184237
    
    
      (27.8, 91]
      54.767717
      0.245079
    
    
      (91, 309]
      175.680915
      0.347913
    
    
      (309, 23000]
      3203.513902
      0.549801



In [413]:

    
df.loc[ df['director_facebook_likes'] <= 3, 'director_facebook_likes'] = 0
df.loc[(df['director_facebook_likes'] > 3) & (df['director_facebook_likes'] <= 27.8), 'director_facebook_likes'] = 1
df.loc[(df['director_facebook_likes'] > 27.8) & (df['director_facebook_likes'] <= 91), 'director_facebook_likes'] = 2
df.loc[(df['director_facebook_likes'] > 91) & (df['director_facebook_likes'] <= 309), 'director_facebook_likes'] = 3
df.loc[ df['director_facebook_likes'] > 309, 'director_facebook_likes'] = 4

df.head()









    Out[413]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      855.0
      1000.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      1000.0
      40000.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      161.0
      11000.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      23000.0
      27000.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      NaN
      131.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      2.0
      0
      0.0
      0.0

actor_1_facebook_likes



In [414]:

    
mean = df['actor_1_facebook_likes'].mean()
std = df['actor_1_facebook_likes'].std()
mean, std









    Out[414]:





(6560.04706115965, 15020.759119983974)



In [415]:

    
df['actor_1_facebook_likes'] = df['actor_1_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [416]:

    
df['actor_1_facebook_likes'].describe()









    Out[416]:





count      5043.000000
mean       6560.047061
std       15010.328553
min           0.000000
25%         615.500000
50%         989.000000
75%       11000.000000
max      640000.000000
Name: actor_1_facebook_likes, dtype: float64



In [417]:

    
df[['actor_1_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_1_facebook_likes'], 5)).mean()









    Out[417]:






  
    
      
      actor_1_facebook_likes
      imdb_score
    
    
      actor_1_facebook_likes
      
      
    
  
  
    
      [0, 523]
      250.647233
      0.426877
    
    
      (523, 865]
      702.347869
      0.302279
    
    
      (865, 2000]
      1137.117597
      0.280034
    
    
      (2000, 13000]
      8956.717418
      0.346652
    
    
      (13000, 640000]
      24597.374179
      0.425602



In [418]:

    
df.loc[ df['actor_1_facebook_likes'] <= 523, 'actor_1_facebook_likes'] = 0
df.loc[(df['actor_1_facebook_likes'] > 523) & (df['actor_1_facebook_likes'] <= 865), 'actor_1_facebook_likes'] = 1
df.loc[(df['actor_1_facebook_likes'] > 865) & (df['actor_1_facebook_likes'] <= 2000), 'actor_1_facebook_likes'] = 2
df.loc[(df['actor_1_facebook_likes'] > 2000) & (df['actor_1_facebook_likes'] <= 13000), 'actor_1_facebook_likes'] = 3
df.loc[ df['actor_1_facebook_likes'] > 13000, 'actor_1_facebook_likes'] = 4

df.head()









    Out[418]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      855.0
      2.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      936.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      1000.0
      4.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      5000.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      161.0
      3.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      393.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      23000.0
      4.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      23000.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      NaN
      0.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      12.0
      1.0
      2.0
      0
      0.0
      0.0

actor_2_facebook_likes



In [419]:

    
mean = df['actor_2_facebook_likes'].mean()
std = df['actor_2_facebook_likes'].std()
mean, std









    Out[419]:





(1651.7544731610337, 4042.4388626418645)



In [420]:

    
df['actor_2_facebook_likes'] = df['actor_2_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [421]:

    
df[['actor_2_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_2_facebook_likes'], 5)).mean()









    Out[421]:






  
    
      
      actor_2_facebook_likes
      imdb_score
    
    
      actor_2_facebook_likes
      
      
    
  
  
    
      [0, 218]
      88.086139
      0.443564
    
    
      (218, 486]
      353.162537
      0.335976
    
    
      (486, 726.2]
      602.864945
      0.308838
    
    
      (726.2, 979]
      859.861249
      0.302279
    
    
      (979, 137000]
      6358.933341
      0.372024



In [422]:

    
df.loc[ df['actor_2_facebook_likes'] <= 218, 'actor_2_facebook_likes'] = 0
df.loc[(df['actor_2_facebook_likes'] > 218) & (df['actor_2_facebook_likes'] <= 486), 'actor_2_facebook_likes'] = 1
df.loc[(df['actor_2_facebook_likes'] > 486) & (df['actor_2_facebook_likes'] <= 726.2), 'actor_2_facebook_likes'] = 2
df.loc[(df['actor_2_facebook_likes'] > 726.2) & (df['actor_2_facebook_likes'] <= 979), 'actor_2_facebook_likes'] = 3
df.loc[ df['actor_2_facebook_likes'] > 979, 'actor_2_facebook_likes'] = 4

df.head()









    Out[422]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      855.0
      2.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      1000.0
      4.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      161.0
      3.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      23000.0
      4.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      NaN
      0.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

actor_3_facebook_likes



In [423]:

    
mean = df['actor_3_facebook_likes'].mean()
std = df['actor_3_facebook_likes'].std()
mean, std









    Out[423]:





(645.0097609561753, 1665.0417284458572)



In [424]:

    
df['actor_3_facebook_likes'] = df['actor_3_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [425]:

    
df['actor_3_facebook_likes'].describe()









    Out[425]:





count     5043.000000
mean       645.009761
std       1661.239692
min          0.000000
25%        134.500000
50%        374.000000
75%        638.000000
max      23000.000000
Name: actor_3_facebook_likes, dtype: float64



In [426]:

    
df[['actor_3_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['actor_3_facebook_likes'], 5)).mean()









    Out[426]:






  
    
      
      actor_3_facebook_likes
      imdb_score
    
    
      actor_3_facebook_likes
      
      
    
  
  
    
      [0, 97]
      38.345203
      0.460930
    
    
      (97, 265]
      179.494059
      0.347525
    
    
      (265, 472]
      370.867589
      0.298419
    
    
      (472, 700]
      582.733025
      0.315055
    
    
      (700, 23000]
      2058.519364
      0.340616



In [427]:

    
df.loc[ df['actor_3_facebook_likes'] <= 97, 'actor_3_facebook_likes'] = 0
df.loc[(df['actor_3_facebook_likes'] > 97) & (df['actor_3_facebook_likes'] <= 265), 'actor_3_facebook_likes'] = 1
df.loc[(df['actor_3_facebook_likes'] > 265) & (df['actor_3_facebook_likes'] <= 472), 'actor_3_facebook_likes'] = 2
df.loc[(df['actor_3_facebook_likes'] > 472) & (df['actor_3_facebook_likes'] <= 700), 'actor_3_facebook_likes'] = 3
df.loc[ df['actor_3_facebook_likes'] > 700, 'actor_3_facebook_likes'] = 4

df.head()









    Out[427]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      760505847.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      309404152.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      200074175.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      448130642.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      NaN
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

gross



In [428]:

    
mean = df['gross'].mean()
std = df['gross'].std()
mean, std









    Out[428]:





(48468407.52680933, 68452990.43875286)



In [429]:

    
df['gross'] = df['gross'].map(lambda v: mean if pd.isnull(v) else v)



In [430]:

    
df['gross'].describe()









    Out[430]:





count    5.043000e+03
mean     4.846841e+07
std      6.216318e+07
min      1.620000e+02
25%      8.460992e+06
50%      3.743230e+07
75%      5.135707e+07
max      7.605058e+08
Name: gross, dtype: float64



In [431]:

    
df[['gross', 'imdb_score']].groupby(pd.qcut(df['gross'], 5)).mean()









    Out[431]:






  
    
      
      gross
      imdb_score
    
    
      gross
      
      
    
  
  
    
      [162, 4909758.4]
      1.309169e+06
      0.366700
    
    
      (4909758.4, 24092475.2]
      1.338883e+07
      0.293651
    
    
      (24092475.2, 48468407.527]
      4.204045e+07
      0.337059
    
    
      (48468407.527, 64212162.4]
      5.575126e+07
      0.312303
    
    
      (64212162.4, 760505847]
      1.392144e+08
      0.436075



In [432]:

    
df.loc[ df['gross'] <= 4909758.4, 'gross'] = 0
df.loc[(df['gross'] > 4909758.4) & (df['gross'] <= 24092475.2), 'gross'] = 1
df.loc[(df['gross'] > 24092475.2) & (df['gross'] <= 48468407.527), 'gross'] = 2
df.loc[(df['gross'] > 48468407.527) & (df['gross'] <= 64212162.4), 'gross'] = 3
df.loc[ df['gross'] > 64212162.4, 'gross'] = 4

df.head()









    Out[432]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

facenumber_in_poster



In [433]:

    
mean = df['facenumber_in_poster'].mean()
std = df['facenumber_in_poster'].std()
mean, std









    Out[433]:





(1.3711729622266402, 2.0135759199960632)



In [434]:

    
df['facenumber_in_poster'].value_counts()









    Out[434]:





0.0     2152
1.0     1251
2.0      716
3.0      380
4.0      207
5.0      114
6.0       76
7.0       48
8.0       37
9.0       18
10.0      10
15.0       6
11.0       5
12.0       4
13.0       2
19.0       1
14.0       1
31.0       1
43.0       1
Name: facenumber_in_poster, dtype: int64



In [435]:

    
df['facenumber_in_poster'].median()









    Out[435]:





1.0



In [436]:

    
df['facenumber_in_poster'] = df['facenumber_in_poster'].map(lambda v: 1 if pd.isnull(v) else v)



In [437]:

    
df['facenumber_in_poster'].describe()









    Out[437]:





count    5043.000000
mean        1.370216
std         2.011066
min         0.000000
25%         0.000000
50%         1.000000
75%         2.000000
max        43.000000
Name: facenumber_in_poster, dtype: float64



In [438]:

    
df[['facenumber_in_poster', 'imdb_score']].groupby(pd.cut(df['facenumber_in_poster'], [-1,0,1,2,100])).mean()









    Out[438]:






  
    
      
      facenumber_in_poster
      imdb_score
    
    
      facenumber_in_poster
      
      
    
  
  
    
      (-1, 0]
      0.000000
      0.398234
    
    
      (0, 1]
      1.000000
      0.347310
    
    
      (1, 2]
      2.000000
      0.310056
    
    
      (2, 100]
      4.625686
      0.285401



In [439]:

    
df.loc[ df['facenumber_in_poster'] <= 0, 'facenumber_in_poster'] = 0
df.loc[(df['facenumber_in_poster'] > 0) & (df['facenumber_in_poster'] <= 1), 'facenumber_in_poster'] = 1
df.loc[(df['facenumber_in_poster'] > 1) & (df['facenumber_in_poster'] <= 2), 'facenumber_in_poster'] = 2
df.loc[ df['facenumber_in_poster'] > 2, 'facenumber_in_poster'] = 3

df.head()









    Out[439]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      886204
      4834
      0.0
      3054.0
      0
      0
      2
      237000000.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      471220
      48350
      0.0
      1238.0
      0
      0
      2
      300000000.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      275868
      11700
      1.0
      994.0
      0
      1
      2
      245000000.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      1144337
      106759
      0.0
      2701.0
      0
      0
      2
      250000000.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      8
      143
      0.0
      NaN
      0
      0
      4
      NaN
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

num_user_for_reviews



In [440]:

    
mean = df['num_user_for_reviews'].mean()
std = df['num_user_for_reviews'].std()
mean, std









    Out[440]:





(272.77080844285143, 377.9828855657681)



In [441]:

    
df['num_user_for_reviews'] = df['num_user_for_reviews'].map(lambda v: mean if pd.isnull(v) else v)



In [442]:

    
df['num_user_for_reviews'].describe()









    Out[442]:





count    5043.000000
mean      272.770808
std       377.194912
min         1.000000
25%        65.000000
50%       157.000000
75%       324.000000
max      5060.000000
Name: num_user_for_reviews, dtype: float64



In [443]:

    
df[['num_user_for_reviews', 'imdb_score']].groupby(pd.qcut(df['num_user_for_reviews'], 5)).mean()









    Out[443]:






  
    
      
      num_user_for_reviews
      imdb_score
    
    
      num_user_for_reviews
      
      
    
  
  
    
      [1, 48]
      22.003960
      0.238614
    
    
      (48, 116]
      81.848365
      0.238850
    
    
      (116, 210]
      159.629336
      0.303271
    
    
      (210, 389]
      286.411308
      0.389275
    
    
      (389, 5060]
      814.773810
      0.593254



In [444]:

    
df.loc[ df['num_user_for_reviews'] <= 48, 'num_user_for_reviews'] = 0
df.loc[(df['num_user_for_reviews'] > 48) & (df['num_user_for_reviews'] <= 116), 'num_user_for_reviews'] = 1
df.loc[(df['num_user_for_reviews'] > 116) & (df['num_user_for_reviews'] <= 210), 'num_user_for_reviews'] = 2
df.loc[(df['num_user_for_reviews'] > 210) & (df['num_user_for_reviews'] <= 389), 'num_user_for_reviews'] = 3
df.loc[ df['num_user_for_reviews'] > 389, 'num_user_for_reviews'] = 4

df.head()









    Out[444]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      886204
      4834
      0.0
      4.0
      0
      0
      2
      237000000.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      471220
      48350
      0.0
      4.0
      0
      0
      2
      300000000.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      275868
      11700
      1.0
      4.0
      0
      1
      2
      245000000.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      1144337
      106759
      0.0
      4.0
      0
      0
      2
      250000000.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      8
      143
      0.0
      3.0
      0
      0
      4
      NaN
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

budget



In [445]:

    
mean = df['budget'].mean()
std = df['budget'].std()
mean, std









    Out[445]:





(39752620.436387606, 206114898.44868386)



In [446]:

    
df['budget'] = df['budget'].map(lambda v: mean if pd.isnull(v) else v)



In [447]:

    
df['budget'].describe()









    Out[447]:





count    5.043000e+03
mean     3.975262e+07
std      1.958004e+08
min      2.180000e+02
25%      7.000000e+06
50%      2.300000e+07
75%      4.000000e+07
max      1.221550e+10
Name: budget, dtype: float64



In [448]:

    
df[['budget', 'imdb_score']].groupby(pd.qcut(df['budget'], 3)).mean()









    Out[448]:






  
    
      
      budget
      imdb_score
    
    
      budget
      
      
    
  
  
    
      [218, 12000000]
      4.710360e+06
      0.398175
    
    
      (12000000, 39752620.436]
      2.748087e+07
      0.343246
    
    
      (39752620.436, 12215500000]
      1.054312e+08
      0.305513



In [449]:

    
df.loc[ df['budget'] <= 12000000, 'budget'] = 0
df.loc[(df['budget'] > 12000000) & (df['budget'] <= 39752620.436), 'budget'] = 1
df.loc[ df['budget'] > 39752620.436, 'budget'] = 2

df.head()









    Out[449]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      886204
      4834
      0.0
      4.0
      0
      0
      2
      2.0
      2009.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      471220
      48350
      0.0
      4.0
      0
      0
      2
      2.0
      2007.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      275868
      11700
      1.0
      4.0
      0
      1
      2
      2.0
      2015.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      1144337
      106759
      0.0
      4.0
      0
      0
      2
      2.0
      2012.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      8
      143
      0.0
      3.0
      0
      0
      4
      2.0
      NaN
      0.0
      1.0
      2.0
      0
      0.0
      0.0

title_year



In [450]:

    
mean = df['title_year'].mean()
std = df['title_year'].std()
mean, std









    Out[450]:





(2002.4705167173252, 12.47459891927068)



In [451]:

    
df['title_year'] = df['title_year'].map(lambda v: truncnorm.rvs(-1, 1, loc=mean, scale=std) if pd.isnull(v) else v)



In [452]:

    
df[['title_year', 'imdb_score']].groupby(pd.cut(df['title_year'], 5)).mean()









    Out[452]:






  
    
      
      title_year
      imdb_score
    
    
      title_year
      
      
    
  
  
    
      (1915.9, 1936]
      1929.642857
      0.714286
    
    
      (1936, 1956]
      1947.240741
      0.814815
    
    
      (1956, 1976]
      1968.262069
      0.710345
    
    
      (1976, 1996]
      1989.096261
      0.417582
    
    
      (1996, 2016]
      2007.044377
      0.321063



In [453]:

    
df.loc[ df['title_year'] <= 1936, 'title_year'] = 0
df.loc[(df['title_year'] > 1936) & (df['title_year'] <= 1956), 'title_year'] = 1
df.loc[(df['title_year'] > 1956) & (df['title_year'] <= 1976), 'title_year'] = 2
df.loc[(df['title_year'] > 1976) & (df['title_year'] <= 1996), 'title_year'] = 3
df.loc[ df['title_year'] > 1996, 'title_year'] = 4

df.head()









    Out[453]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      886204
      4834
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      471220
      48350
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      275868
      11700
      1.0
      4.0
      0
      1
      2
      2.0
      4.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      1144337
      106759
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      8
      143
      0.0
      3.0
      0
      0
      4
      2.0
      4.0
      0.0
      1.0
      2.0
      0
      0.0
      0.0

num_voted_users



In [454]:

    
mean = df['num_voted_users'].mean()
std = df['num_voted_users'].std()
mean, std









    Out[454]:





(83668.16081697402, 138485.25680596207)



In [455]:

    
df['num_voted_users'] = df['num_voted_users'].map(lambda v: mean if pd.isnull(v) else v)



In [456]:

    
df['num_voted_users'].describe()









    Out[456]:





count    5.043000e+03
mean     8.366816e+04
std      1.384853e+05
min      5.000000e+00
25%      8.593500e+03
50%      3.435900e+04
75%      9.630900e+04
max      1.689764e+06
Name: num_voted_users, dtype: float64



In [457]:

    
df[['num_voted_users', 'imdb_score']].groupby(pd.qcut(df['num_voted_users'], 5)).mean()









    Out[457]:






  
    
      
      num_voted_users
      imdb_score
    
    
      num_voted_users
      
      
    
  
  
    
      [5, 5623.8]
      2063.805748
      0.225966
    
    
      (5623.8, 21478.4]
      12599.737103
      0.246032
    
    
      (21478.4, 53178.2]
      35314.418236
      0.273538
    
    
      (53178.2, 1.24e+05]
      81242.864087
      0.361111
    
    
      (1.24e+05, 1689764]
      287047.140733
      0.656095



In [458]:

    
df.loc[ df['num_voted_users'] <= 5623.8, 'num_voted_users'] = 0
df.loc[(df['num_voted_users'] > 5623.8) & (df['num_voted_users'] <= 21478.4), 'num_voted_users'] = 1
df.loc[(df['num_voted_users'] > 21478.4) & (df['num_voted_users'] <= 53178.2), 'num_voted_users'] = 2
df.loc[(df['num_voted_users'] > 53178.2) & (df['num_voted_users'] <= 1.24e+05), 'num_voted_users'] = 3
df.loc[ df['num_voted_users'] > 1.24e+05, 'num_voted_users'] = 4

df.head()









    Out[458]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      4
      4834
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      48350
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      4
      11700
      1.0
      4.0
      0
      1
      2
      2.0
      4.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      106759
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      0
      143
      0.0
      3.0
      0
      0
      4
      2.0
      4.0
      0.0
      1.0
      2.0
      0
      0.0
      0.0

cast_total_facebook_likes



In [459]:

    
mean = df['cast_total_facebook_likes'].mean()
std = df['cast_total_facebook_likes'].std()
mean, std









    Out[459]:





(9699.06385088241, 18163.799124045912)



In [460]:

    
df['cast_total_facebook_likes'] = df['cast_total_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [461]:

    
df['cast_total_facebook_likes'].describe()









    Out[461]:





count      5043.000000
mean       9699.063851
std       18163.799124
min           0.000000
25%        1411.000000
50%        3090.000000
75%       13756.500000
max      656730.000000
Name: cast_total_facebook_likes, dtype: float64



In [462]:

    
df[['cast_total_facebook_likes', 'imdb_score']].groupby(pd.qcut(df['cast_total_facebook_likes'], 5)).mean()









    Out[462]:






  
    
      
      cast_total_facebook_likes
      imdb_score
    
    
      cast_total_facebook_likes
      
      
    
  
  
    
      [0, 1136]
      522.920792
      0.436634
    
    
      (1136, 2366.6]
      1725.176763
      0.328699
    
    
      (2366.6, 4369.2]
      3172.140733
      0.258672
    
    
      (4369.2, 16285.8]
      10094.041667
      0.317460
    
    
      (16285.8, 656730]
      32974.719524
      0.421209



In [463]:

    
df.loc[ df['cast_total_facebook_likes'] <= 1136, 'cast_total_facebook_likes'] = 0
df.loc[(df['cast_total_facebook_likes'] > 1136) & (df['cast_total_facebook_likes'] <= 2366.6), 'cast_total_facebook_likes'] = 1
df.loc[(df['cast_total_facebook_likes'] > 2366.6) & (df['cast_total_facebook_likes'] <= 4369.2), 'cast_total_facebook_likes'] = 2
df.loc[(df['cast_total_facebook_likes'] > 4369.2) & (df['cast_total_facebook_likes'] <= 16285.8), 'cast_total_facebook_likes'] = 3
df.loc[ df['cast_total_facebook_likes'] > 16285.8, 'cast_total_facebook_likes'] = 4

df.head()









    Out[463]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      4
      3
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      3.0
      1.0
      1.0
      33000
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      4
      3
      1.0
      4.0
      0
      1
      2
      2.0
      4.0
      1.0
      0.0
      2.0
      85000
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      164000
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      0
      0
      0.0
      3.0
      0
      0
      4
      2.0
      4.0
      0.0
      1.0
      2.0
      0
      0.0
      0.0

movie_facebook_likes



In [464]:

    
mean = df['movie_facebook_likes'].mean()
std = df['movie_facebook_likes'].std()
mean, std









    Out[464]:





(7525.9645052548085, 19320.445109946737)



In [465]:

    
df['movie_facebook_likes'] = df['movie_facebook_likes'].map(lambda v: mean if pd.isnull(v) else v)



In [466]:

    
df['movie_facebook_likes'].describe()









    Out[466]:





count      5043.000000
mean       7525.964505
std       19320.445110
min           0.000000
25%           0.000000
50%         166.000000
75%        3000.000000
max      349000.000000
Name: movie_facebook_likes, dtype: float64



In [467]:

    
df[df['movie_facebook_likes'] > 0][['movie_facebook_likes', 'imdb_score']].groupby(pd.qcut(df[df['movie_facebook_likes'] > 0]['movie_facebook_likes'], 4)).mean()









    Out[467]:






  
    
      
      movie_facebook_likes
      imdb_score
    
    
      movie_facebook_likes
      
      
    
  
  
    
      [2, 401]
      184.490934
      0.186890
    
    
      (401, 1000]
      725.322139
      0.192786
    
    
      (1000, 17000]
      10425.727412
      0.499234
    
    
      (17000, 349000]
      44229.651163
      0.604651



In [468]:

    
df.loc[ df['movie_facebook_likes'] <= 0, 'movie_facebook_likes'] = 0
df.loc[(df['movie_facebook_likes'] > 0) & (df['movie_facebook_likes'] <= 401), 'movie_facebook_likes'] = 1
df.loc[(df['movie_facebook_likes'] > 401) & (df['movie_facebook_likes'] <= 1000), 'movie_facebook_likes'] = 2
df.loc[(df['movie_facebook_likes'] > 1000) & (df['movie_facebook_likes'] <= 17000), 'movie_facebook_likes'] = 3
df.loc[ df['movie_facebook_likes'] > 17000, 'movie_facebook_likes'] = 3

df.head()









    Out[468]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      4
      3
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      3.0
      1.0
      1.0
      3
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      4
      3
      1.0
      4.0
      0
      1
      2
      2.0
      4.0
      1.0
      0.0
      2.0
      3
      1.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      3
      1.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      0
      0
      0.0
      3.0
      0
      0
      4
      2.0
      4.0
      0.0
      1.0
      2.0
      0
      0.0
      0.0

num_of_films_director



In [469]:

    
mean = df['num_of_films_director'].mean()
std = df['num_of_films_director'].std()
mean, std









    Out[469]:





(0.21514971247273448, 0.5575849212989642)



In [470]:

    
df['num_of_films_director'].value_counts()









    Out[470]:





0.0    4228
1.0     633
2.0      94
3.0      88
Name: num_of_films_director, dtype: int64



In [471]:

    
df['num_of_films_director'] = df['num_of_films_director'].map(lambda v: 1 if pd.isnull(v) else v)



In [472]:

    
df[['num_of_films_director', 'imdb_score']].groupby(pd.cut(df['num_of_films_director'], 3)).mean()









    Out[472]:






  
    
      
      num_of_films_director
      imdb_score
    
    
      num_of_films_director
      
      
    
  
  
    
      (-0.003, 1]
      0.13022
      0.345196
    
    
      (1, 2]
      2.00000
      0.414894
    
    
      (2, 3]
      3.00000
      0.693182



In [473]:

    
df.loc[ df['num_of_films_director'] <= 1, 'num_of_films_director'] = 0
df.loc[(df['num_of_films_director'] > 1) & (df['num_of_films_director'] <= 2), 'num_of_films_director'] = 1
df.loc[ df['num_of_films_director'] > 2, 'num_of_films_director'] = 2

df.head()









    Out[473]:






  
    
      
      color
      num_critic_for_reviews
      duration
      director_facebook_likes
      actor_3_facebook_likes
      actor_1_facebook_likes
      gross
      num_voted_users
      cast_total_facebook_likes
      facenumber_in_poster
      num_user_for_reviews
      language
      country
      content_rating
      budget
      title_year
      actor_2_facebook_likes
      imdb_score
      aspect_ratio
      movie_facebook_likes
      num_of_films_director
      actor_sum
    
  
  
    
      0
      0
      4.0
      4.0
      0.0
      4.0
      2.0
      4.0
      4
      3
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      3.0
      1.0
      1.0
      3
      0.0
      0.0
    
    
      1
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      0
      0.0
      2.0
    
    
      2
      0
      4.0
      4.0
      0.0
      1.0
      3.0
      4.0
      4
      3
      1.0
      4.0
      0
      1
      2
      2.0
      4.0
      1.0
      0.0
      2.0
      3
      0.0
      0.0
    
    
      3
      0
      4.0
      4.0
      4.0
      4.0
      4.0
      4.0
      4
      4
      0.0
      4.0
      0
      0
      2
      2.0
      4.0
      4.0
      1.0
      2.0
      3
      0.0
      2.0
    
    
      4
      0
      4.0
      1.0
      3.0
      3.0
      0.0
      2.0
      0
      0
      0.0
      3.0
      0
      0
      4
      2.0
      4.0
      0.0
      1.0
      2.0
      0
      0.0
      0.0



In [474]:

    
incomplete = df.columns[pd.isnull(df).any()].tolist()
df[incomplete].info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5043 entries, 0 to 5042
Empty DataFrame

Create new feature combining existing features

Completing a categorical feature

Converting categorical feature to numeric

Quick completing and converting a numeric feature

Partion Data

Now, we randomly partion our dataset into two DataFrames. 80% of the data will be our training set, the rest will become our test set



In [476]:

    
train_df, test_df = train_test_split(df, test_size = 0.2)

Model, predict and solve

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine



In [477]:

    
X_train = train_df.drop("imdb_score", axis=1)
Y_train = train_df["imdb_score"]
X_test  = test_df.drop("imdb_score", axis=1)
Y_test = test_df["imdb_score"]
X_train.shape, Y_train.shape, X_test.shape, Y_test.shape









    Out[477]:





((4034, 21), (4034,), (1009, 21), (1009,))

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.

Note the confidence score generated by the model based on our training dataset.



In [478]:

    
# Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log









    Out[478]:





75.26

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

Country is highest positivie coefficient, implying as the Country value increases (USA: 0 to Foreign: 1), the probability of IMDb score = 1 increases the most.
Inversely as Aspect Ratio increases, probability of IMDb score = 1 decreases the most.
Director Number of Films is a good artificial feature to model as it has a 0.2 positive coorelation with IMDb score.
So is Color as second highest positive correlation.



In [479]:

    
for i, value in enumerate(train_df.columns):
    print i, value









    



0 color
1 num_critic_for_reviews
2 duration
3 director_facebook_likes
4 actor_3_facebook_likes
5 actor_1_facebook_likes
6 gross
7 num_voted_users
8 cast_total_facebook_likes
9 facenumber_in_poster
10 num_user_for_reviews
11 language
12 country
13 content_rating
14 budget
15 title_year
16 actor_2_facebook_likes
17 imdb_score
18 aspect_ratio
19 movie_facebook_likes
20 num_of_films_director
21 actor_sum



In [480]:

    
coeff_df = pd.DataFrame(train_df.columns.delete(17))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)









    Out[480]:






  
    
      
      Feature
      Correlation
    
  
  
    
      11
      language
      1.075740
    
    
      0
      color
      0.814380
    
    
      7
      num_voted_users
      0.619850
    
    
      12
      country
      0.509361
    
    
      13
      content_rating
      0.260260
    
    
      2
      duration
      0.248950
    
    
      18
      movie_facebook_likes
      0.182422
    
    
      20
      actor_sum
      0.145884
    
    
      19
      num_of_films_director
      0.137111
    
    
      3
      director_facebook_likes
      0.087080
    
    
      10
      num_user_for_reviews
      0.067997
    
    
      8
      cast_total_facebook_likes
      0.042259
    
    
      1
      num_critic_for_reviews
      -0.020295
    
    
      5
      actor_1_facebook_likes
      -0.056932
    
    
      16
      actor_2_facebook_likes
      -0.062841
    
    
      6
      gross
      -0.118464
    
    
      4
      actor_3_facebook_likes
      -0.137546
    
    
      9
      facenumber_in_poster
      -0.173738
    
    
      14
      budget
      -0.276049
    
    
      15
      title_year
      -0.426405
    
    
      17
      aspect_ratio
      -0.479956

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier. Reference Wikipedia.

Note that the model generates a confidence score which is higher than Logistics Regression model.



In [481]:

    
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc









    Out[481]:





83.42

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. Reference Wikipedia.

KNN confidence score is better than Logistics Regression and SVM.



In [482]:

    
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn









    Out[482]:





85.18

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem. Reference Wikipedia.

The model generated confidence score is the lowest among the models evaluated so far.



In [483]:

    
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian









    Out[483]:





72.43

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. Reference Wikipedia.



In [484]:

    
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron









    Out[484]:





69.48



In [485]:

    
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc









    Out[485]:





75.29



In [486]:

    
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd









    Out[486]:





70.53

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far.



In [487]:

    
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree









    Out[487]:





99.85

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Reference Wikipedia.

The model confidence score is the highest among models evaluated so far. We decide to use this model's output (Y_pred) for creating our competition submission of results.



In [488]:

    
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest









    Out[488]:





99.85

Model evaluation

We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.



In [489]:

    
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)









    Out[489]:






  
    
      
      Model
      Score
    
  
  
    
      3
      Random Forest
      99.85
    
    
      8
      Decision Tree
      99.85
    
    
      1
      KNN
      85.18
    
    
      0
      Support Vector Machines
      83.42
    
    
      7
      Linear SVC
      75.29
    
    
      2
      Logistic Regression
      75.26
    
    
      4
      Naive Bayes
      72.43
    
    
      6
      Stochastic Gradient Decent
      70.53
    
    
      5
      Perceptron
      69.48



In [491]:

    
print accuracy_score(Y_test, Y_pred, normalize=False), '/', len(Y_test)
print accuracy_score(Y_test, Y_pred)









    



793 / 1009
0.785926660059

Using the Random Forrest Classifier resulted in correctly predicting 793 out of 1009 movies, or 78.6%. Not bad for our first attempt. Any suggestions to improve our score are most welcome.

References

This notebook has been created based on great work done solving the Titanic competition and other sources.

	duration	imdb_score
duration
[7, 91]	81.340020	0.283286
(91, 99]	95.590629	0.217039
(99, 108]	103.739559	0.286982
(108, 122]	114.939788	0.396702
(122, 511]	143.169953	0.591815

	color	director_name	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_2_name	actor_1_facebook_likes	gross	genres	actor_1_name	movie_title	num_voted_users	cast_total_facebook_likes	actor_3_name	facenumber_in_poster	plot_keywords	movie_imdb_link	num_user_for_reviews	language	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
0	Color	James Cameron	723.0	178.0	0.0	855.0	Joel David Moore	1000.0	760505847.0	Action\|Adventure\|Fantasy\|Sci-Fi	CCH Pounder	Avatar	886204	4834	Wes Studi	0.0	avatar\|future\|marine\|native\|paraplegic	http://www.imdb.com/title/tt0499549/?ref_=fn_t...	3054.0	English	USA	PG-13	237000000.0	2009.0	936.0	7.9	1.78	33000
1	Color	Gore Verbinski	302.0	169.0	563.0	1000.0	Orlando Bloom	40000.0	309404152.0	Action\|Adventure\|Fantasy	Johnny Depp	Pirates of the Caribbean: At World's End	471220	48350	Jack Davenport	0.0	goddess\|marriage ceremony\|marriage proposal\|pi...	http://www.imdb.com/title/tt0449088/?ref_=fn_t...	1238.0	English	USA	PG-13	300000000.0	2007.0	5000.0	7.1	2.35	0
2	Color	Sam Mendes	602.0	148.0	0.0	161.0	Rory Kinnear	11000.0	200074175.0	Action\|Adventure\|Thriller	Christoph Waltz	Spectre	275868	11700	Stephanie Sigman	1.0	bomb\|espionage\|sequel\|spy\|terrorist	http://www.imdb.com/title/tt2379713/?ref_=fn_t...	994.0	English	UK	PG-13	245000000.0	2015.0	393.0	6.8	2.35	85000
3	Color	Christopher Nolan	813.0	164.0	22000.0	23000.0	Christian Bale	27000.0	448130642.0	Action\|Thriller	Tom Hardy	The Dark Knight Rises	1144337	106759	Joseph Gordon-Levitt	0.0	deception\|imprisonment\|lawlessness\|police offi...	http://www.imdb.com/title/tt1345836/?ref_=fn_t...	2701.0	English	USA	PG-13	250000000.0	2012.0	23000.0	8.5	2.35	164000
4	NaN	Doug Walker	NaN	NaN	131.0	NaN	Rob Walker	131.0	NaN	Documentary	Doug Walker	Star Wars: Episode VII - The Force Awakens ...	8	143	NaN	0.0	NaN	http://www.imdb.com/title/tt5289954/?ref_=fn_t...	NaN	NaN	NaN	NaN	NaN	NaN	12.0	7.1	NaN	0

	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_1_facebook_likes	gross	num_voted_users	cast_total_facebook_likes	facenumber_in_poster	num_user_for_reviews	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes
count	4993.000000	5028.000000	4939.000000	5020.000000	5036.000000	4.159000e+03	5.043000e+03	5043.000000	5030.000000	5022.000000	4.551000e+03	4935.000000	5030.000000	5043.000000	4714.000000	5043.000000
mean	140.194272	107.201074	686.509212	645.009761	6560.047061	4.846841e+07	8.366816e+04	9699.063851	1.371173	272.770808	3.975262e+07	2002.470517	1651.754473	6.442138	2.220403	7525.964505
std	121.601675	25.197441	2813.328607	1665.041728	15020.759120	6.845299e+07	1.384853e+05	18163.799124	2.013576	377.982886	2.061149e+08	12.474599	4042.438863	1.125116	1.385113	19320.445110
min	1.000000	7.000000	0.000000	0.000000	0.000000	1.620000e+02	5.000000e+00	0.000000	0.000000	1.000000	2.180000e+02	1916.000000	0.000000	1.600000	1.180000	0.000000
25%	50.000000	93.000000	7.000000	133.000000	614.000000	5.340988e+06	8.593500e+03	1411.000000	0.000000	65.000000	6.000000e+06	1999.000000	281.000000	5.800000	1.850000	0.000000
50%	110.000000	103.000000	49.000000	371.500000	988.000000	2.551750e+07	3.435900e+04	3090.000000	1.000000	156.000000	2.000000e+07	2005.000000	595.000000	6.600000	2.350000	166.000000
75%	195.000000	118.000000	194.500000	636.000000	11000.000000	6.230944e+07	9.630900e+04	13756.500000	2.000000	326.000000	4.500000e+07	2011.000000	918.000000	7.200000	2.350000	3000.000000
max	813.000000	511.000000	23000.000000	23000.000000	640000.000000	7.605058e+08	1.689764e+06	656730.000000	43.000000	5060.000000	1.221550e+10	2016.000000	137000.000000	9.500000	16.000000	349000.000000

	color	director_name	actor_2_name	genres	actor_1_name	movie_title	actor_3_name	plot_keywords	movie_imdb_link	language	country	content_rating
count	5024	4939	5030	5043	5036	5043	5020	4890	5043	5031	5038	4740
unique	2	2398	3032	914	2097	4917	3521	4760	4919	47	65	18
top	Color	Steven Spielberg	Morgan Freeman	Drama	Robert De Niro	Halloween	Steve Coogan	based on novel	http://www.imdb.com/title/tt0077651/?ref_=fn_t...	English	USA	R
freq	4815	26	20	236	49	3	8	4	3	4704	3807	2118

	content_rating	imdb_score
15	TV-Y7	1.000000
14	TV-Y	1.000000
12	TV-MA	1.000000
0	Approved	0.781818
8	Passed	0.777778
13	TV-PG	0.769231
10	TV-14	0.766667
16	Unrated	0.548387
11	TV-G	0.500000
17	X	0.461538
5	Not Rated	0.456897
1	G	0.455357
4	NC-17	0.428571
3	M	0.400000
9	R	0.367328
2	GP	0.333333
6	PG	0.310984
7	PG-13	0.266940

	director_name	imdb_score
1277	Ken Loach	1.0
353	Christian Carion	1.0
589	Don Kempf	1.0
2020	Sam Martin	1.0
593	Don Siegel	1.0
1620	Mitchell Altieri	1.0
596	Doug Atchison	1.0
597	Doug Block	1.0
1622	Molly Bernstein	1.0
600	Doug Walker	1.0
2012	Sai Varadan	1.0
2011	Sadyk Sher-Niyaz	1.0
607	Drew Goddard	1.0
608	Dror Moreh	1.0
1625	Mor Loushy	1.0
610	Duke Johnson	1.0
611	Duncan Jones	1.0
612	Duncan Tucker	1.0
2009	S.S. Rajamouli	1.0
1628	Morgan Neville	1.0
2003	Ryan Fleck	1.0
588	Don Hall	1.0
1618	Miranda July	1.0
2029	Sanjay Rawal	1.0
566	Dennis Hopper	1.0
1274	Ken Annakin	1.0
1613	Mike van Diem	1.0
558	Debra Granik	1.0
559	Deepa Mehta	1.0
561	Dena Seidel	1.0
...	...	...
917	Jamel Debbouze	0.0
961	Jared Hess	0.0
962	Jason Alexander	0.0
963	Jason Bateman	0.0
996	Jeff Crook	0.0
1010	Jennifer Finnigan	0.0
1009	Jem Cohen	0.0
1007	Jeffrey W. Byrd	0.0
1006	Jeffrey St. Jules	0.0
1005	Jeff Wadlow	0.0
1003	Jeff Schaffer	0.0
1001	Jeff Nathanson	0.0
1000	Jeff Lowell	0.0
999	Jeff Kanew	0.0
998	Jeff Garlin	0.0
997	Jeff Franklin	0.0
995	Jeff Burr	0.0
964	Jason Connery	0.0
993	Jeb Stuart	0.0
986	Jean-Jacques Mantello	0.0
984	Jean-François Richet	0.0
978	Jay Duplass	0.0
976	Jay Alaimo	0.0
973	Jason Zada	0.0
972	Jason Trost	0.0
971	Jason Stone	0.0
969	Jason Naumann	0.0
966	Jason Friedberg	0.0
965	Jason Eisener	0.0
2397	Étienne Faure	0.0

	country	imdb_score
0	Afghanistan	1.000000
33	Kenya	1.000000
63	United Arab Emirates	1.000000
18	Finland	1.000000
54	Soviet Union	1.000000
17	Egypt	1.000000
44	Panama	1.000000
43	Pakistan	1.000000
30	Israel	1.000000
9	Cameroon	1.000000
27	Indonesia	1.000000
35	Libya	1.000000
34	Kyrgyzstan	1.000000
13	Colombia	1.000000
37	Netherlands	0.800000
47	Poland	0.800000
1	Argentina	0.750000
28	Iran	0.750000
39	New Zealand	0.733333
56	Sweden	0.666667
64	West Germany	0.666667
14	Czech Republic	0.666667
15	Denmark	0.636364
6	Brazil	0.625000
53	South Korea	0.571429
32	Japan	0.521739
55	Spain	0.515152
61	UK	0.508929
58	Taiwan	0.500000
41	Norway	0.500000
...	...	...
36	Mexico	0.470588
23	Hong Kong	0.470588
12	China	0.433333
19	France	0.428571
3	Australia	0.381818
21	Germany	0.371134
25	Iceland	0.333333
29	Ireland	0.333333
62	USA	0.317310
10	Canada	0.309524
49	Russia	0.272727
5	Belgium	0.250000
52	South Africa	0.250000
59	Thailand	0.200000
40	Nigeria	0.000000
16	Dominican Republic	0.000000
2	Aruba	0.000000
60	Turkey	0.000000
20	Georgia	0.000000
57	Switzerland	0.000000
11	Chile	0.000000
4	Bahamas	0.000000
42	Official site	0.000000
51	Slovenia	0.000000
50	Slovakia	0.000000
46	Philippines	0.000000
45	Peru	0.000000
38	New Line	0.000000
8	Cambodia	0.000000
7	Bulgaria	0.000000

	NumFilmsBand	imdb_score
0	(0.975, 7.25]	0.334201
1	(7.25, 13.5]	0.418641
2	(13.5, 19.75]	0.414894
3	(19.75, 26]	0.693182

	ActorSumBand	imdb_score
0	(2.892, 24.6]	0.342672
1	(24.6, 46.2]	0.342809
2	(46.2, 67.8]	0.408257
3	(67.8, 89.4]	0.457746
4	(89.4, 111]	0.571429

	aspect_ratio	imdb_score
aspect_ratio
(1.178, 1.575]	1.353167	0.672222
(1.575, 1.97]	1.839038	0.357277
(1.97, 2.365]	2.348483	0.326199
(2.365, 2.76]	2.453478	0.521739

	num_critic_for_reviews	imdb_score
num_critic_for_reviews
[1, 40]	19.028646	0.248790
(40, 84]	62.355562	0.276171
(84, 140]	111.599699	0.317682
(140, 222]	176.511550	0.386454
(222, 813]	335.060839	0.536926

	director_facebook_likes	imdb_score
director_facebook_likes
[0, 3]	0.323077	0.429808
(3, 27.8]	13.329580	0.184237
(27.8, 91]	54.767717	0.245079
(91, 309]	175.680915	0.347913
(309, 23000]	3203.513902	0.549801

	actor_1_facebook_likes	imdb_score
actor_1_facebook_likes
[0, 523]	250.647233	0.426877
(523, 865]	702.347869	0.302279
(865, 2000]	1137.117597	0.280034
(2000, 13000]	8956.717418	0.346652
(13000, 640000]	24597.374179	0.425602

	actor_2_facebook_likes	imdb_score
actor_2_facebook_likes
[0, 218]	88.086139	0.443564
(218, 486]	353.162537	0.335976
(486, 726.2]	602.864945	0.308838
(726.2, 979]	859.861249	0.302279
(979, 137000]	6358.933341	0.372024

	actor_3_facebook_likes	imdb_score
actor_3_facebook_likes
[0, 97]	38.345203	0.460930
(97, 265]	179.494059	0.347525
(265, 472]	370.867589	0.298419
(472, 700]	582.733025	0.315055
(700, 23000]	2058.519364	0.340616

	gross	imdb_score
gross
[162, 4909758.4]	1.309169e+06	0.366700
(4909758.4, 24092475.2]	1.338883e+07	0.293651
(24092475.2, 48468407.527]	4.204045e+07	0.337059
(48468407.527, 64212162.4]	5.575126e+07	0.312303
(64212162.4, 760505847]	1.392144e+08	0.436075

	facenumber_in_poster	imdb_score
facenumber_in_poster
(-1, 0]	0.000000	0.398234
(0, 1]	1.000000	0.347310
(1, 2]	2.000000	0.310056
(2, 100]	4.625686	0.285401

	num_user_for_reviews	imdb_score
num_user_for_reviews
[1, 48]	22.003960	0.238614
(48, 116]	81.848365	0.238850
(116, 210]	159.629336	0.303271
(210, 389]	286.411308	0.389275
(389, 5060]	814.773810	0.593254

	budget	imdb_score
budget
[218, 12000000]	4.710360e+06	0.398175
(12000000, 39752620.436]	2.748087e+07	0.343246
(39752620.436, 12215500000]	1.054312e+08	0.305513

	title_year	imdb_score
title_year
(1915.9, 1936]	1929.642857	0.714286
(1936, 1956]	1947.240741	0.814815
(1956, 1976]	1968.262069	0.710345
(1976, 1996]	1989.096261	0.417582
(1996, 2016]	2007.044377	0.321063

	num_voted_users	imdb_score
num_voted_users
[5, 5623.8]	2063.805748	0.225966
(5623.8, 21478.4]	12599.737103	0.246032
(21478.4, 53178.2]	35314.418236	0.273538
(53178.2, 1.24e+05]	81242.864087	0.361111
(1.24e+05, 1689764]	287047.140733	0.656095

	cast_total_facebook_likes	imdb_score
cast_total_facebook_likes
[0, 1136]	522.920792	0.436634
(1136, 2366.6]	1725.176763	0.328699
(2366.6, 4369.2]	3172.140733	0.258672
(4369.2, 16285.8]	10094.041667	0.317460
(16285.8, 656730]	32974.719524	0.421209

	movie_facebook_likes	imdb_score
movie_facebook_likes
[2, 401]	184.490934	0.186890
(401, 1000]	725.322139	0.192786
(1000, 17000]	10425.727412	0.499234
(17000, 349000]	44229.651163	0.604651

	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_1_facebook_likes	gross	num_voted_users	cast_total_facebook_likes	facenumber_in_poster	num_user_for_reviews	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes	num_of_films_director	actor_sum
0	4.0	4.0	0.0	4.0	2.0	4.0	4	3	0.0	4.0	0	2	2.0	4.0	3.0	1.0	1.0	3	0.0	0.0
1	4.0	4.0	4.0	4.0	4.0	4.0	4	4	0.0	4.0	0	2	2.0	4.0	4.0	1.0	2.0	0	0.0	2.0
2	4.0	4.0	0.0	1.0	3.0	4.0	4	3	1.0	4.0	1	2	2.0	4.0	1.0	0.0	2.0	3	1.0	0.0
3	4.0	4.0	4.0	4.0	4.0	4.0	4	4	0.0	4.0	0	2	2.0	4.0	4.0	1.0	2.0	3	1.0	2.0
4	4.0	1.0	3.0	3.0	0.0	2.0	0	0	0.0	3.0	0	4	2.0	4.0	0.0	1.0	2.0	0	0.0	0.0

	num_of_films_director	imdb_score
num_of_films_director
(-0.003, 1]	0.13022	0.345196
(1, 2]	2.00000	0.414894
(2, 3]	3.00000	0.693182

	num_critic_for_reviews	duration	director_facebook_likes	actor_3_facebook_likes	actor_1_facebook_likes	gross	num_voted_users	cast_total_facebook_likes	facenumber_in_poster	num_user_for_reviews	country	content_rating	budget	title_year	actor_2_facebook_likes	imdb_score	aspect_ratio	movie_facebook_likes	actor_sum
0	4.0	4.0	0.0	4.0	2.0	4.0	4	3	0.0	4.0	0	2	2.0	4.0	3.0	1.0	1.0	3	0.0
1	4.0	4.0	4.0	4.0	4.0	4.0	4	4	0.0	4.0	0	2	2.0	4.0	4.0	1.0	2.0	0	2.0
2	4.0	4.0	0.0	1.0	3.0	4.0	4	3	1.0	4.0	1	2	2.0	4.0	1.0	0.0	2.0	3	0.0
3	4.0	4.0	4.0	4.0	4.0	4.0	4	4	0.0	4.0	0	2	2.0	4.0	4.0	1.0	2.0	3	2.0
4	4.0	1.0	3.0	3.0	0.0	2.0	0	0	0.0	3.0	0	4	2.0	4.0	0.0	1.0	2.0	0	0.0

	Feature	Correlation
11	language	1.075740
0	color	0.814380
7	num_voted_users	0.619850
12	country	0.509361
13	content_rating	0.260260
2	duration	0.248950
18	movie_facebook_likes	0.182422
20	actor_sum	0.145884
19	num_of_films_director	0.137111
3	director_facebook_likes	0.087080
10	num_user_for_reviews	0.067997
8	cast_total_facebook_likes	0.042259
1	num_critic_for_reviews	-0.020295
5	actor_1_facebook_likes	-0.056932
16	actor_2_facebook_likes	-0.062841
6	gross	-0.118464
4	actor_3_facebook_likes	-0.137546
9	facenumber_in_poster	-0.173738
14	budget	-0.276049
15	title_year	-0.426405
17	aspect_ratio	-0.479956

	Model	Score
3	Random Forest	99.85
8	Decision Tree	99.85
1	KNN	85.18
0	Support Vector Machines	83.42
7	Linear SVC	75.29
2	Logistic Regression	75.26
4	Naive Bayes	72.43
6	Stochastic Gradient Decent	70.53
5	Perceptron	69.48