Movie recommendation

NTDS Project

Author: Qiao Qianqian / Luo Yaxiong / Deng Wenlong / Wang Pei

This notebook is to design a system for recommendation of movies.

Dataset Information

We make recommendations using the content of the TMDB dataset that contains around 45,000 movies listed in the Full MovieLens Dataset. The dataset consists of movies released on or before July 2017. Data points include cast, crew, plot keywords, budget, revenue, posters, release dates, languages, production companies, countries, TMDB vote counts and vote averages.

System Mechanism

In practice, recommendation engines are of three kinds:

popularity-based engines: usually the most simple to implement be also the most impersonal
content-based engines: the recommendations are based on the description of the products
collaborative filtering engines: records from various users provide recommendations based on user similarities

We aimed to model an engine for recommendation of movies based on several aspects such as popularity, similarity. We would like to find similar items by using a similarity metric. Once we have the matrix, we use it to determine the best recommendations for a user based on the movies he has already liked and watched. In other words, our engine bases on popularity and content.

Recommendation Process

Basically,our system will work as follows: after the user has provided the name of a film he liked, the engine should be able to select in the database a list of 5 films that the user will enjoy. These files contain metadata for all 45,000 movies.

The Organization

This notebook is organized as follows:

1. Data Acquisition

1.1 Keywords
1.2 Missing values
1.3 Number of films per year
1.4 Genres
1.5 Actors

2. Data Explpration

2.1 Cleaning of the keywords
- 2.1.1 Grouping by roots
- 2.1.2 Groups of synonyms
- 2.1.2 Groups of word vector
2.2 Actor Relations
- 2.2.1 Actor with genres preference
- 2.2.2 Actor distance
2.3 Correlations
2.4 Missing values
- 2.4.1 Setting missing title years
- 2.4.2 Extracting keywords from the title
- 2.4.3 Imputing from regressions
2.5 Movies clustering

3. Recommendation Engine Based on Movie

3.1 Basic functioning of the engine
- 3.1.1 Similarity
- 3.1.2 Popularity
3.2 Definition of the recommendation engine functions
3.3 Making meaningfull recommendations
3.4 Exemple of recommendation: test-case

4. Recommendation Engine Based on User

4.1 Basic functioning of the engine
4.2 Definition of the engine functions
4.3 Making meaningfull recommendations

5. Conclusion

1. Data Acquisition

First, we define a few functions to create an interface with the new structure of the dataset. Because the original dataset organize data in a complex way, we must extract useful data from the original. This stage also cleans the data briefly.



In [1]:

    
import json
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import math, nltk, warnings
from scipy import sparse, stats, spatial
import scipy.sparse.linalg
import scipy.sparse.csgraph
from nltk.corpus import wordnet
from sklearn import linear_model
from sklearn.neighbors import NearestNeighbors
from fuzzywuzzy import fuzz
from wordcloud import WordCloud, STOPWORDS
import spacy

nlp = spacy.load('en')
plt.rcParams["patch.force_edgecolor"] = True
plt.style.use('fivethirtyeight')
mpl.rc('patch', edgecolor = 'dimgray', linewidth=1)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "last_expr"
pd.options.display.max_columns = 50
%matplotlib inline
warnings.filterwarnings('ignore')
PS = nltk.stem.PorterStemmer()

Load the dataset.

We use three packages which contain different information about the same movies. In recommendation stages, we will then load $rating.csv$ to make recommendations according to users.



In [2]:

    
def load_credits(path):
    df = pd.read_csv(path)
    return df
def load_movies(path):
    df = pd.read_csv(path)
    df['release_date'] = pd.to_datetime(df['release_date'],errors='coerce').apply(lambda x: x.date())
    return df
def load_keywords(path):
    df = pd.read_csv(path)
    return df



In [3]:

    
# load the dataset
credits = load_credits("../data/credits.csv")
movies = load_movies("../data/movies_metadata.csv")
keywords = load_keywords("../data/keywords.csv")

You can see the data in original dataset. We aim at extract useful information from them.



In [4]:

    
credits.iloc[:5]









    Out[4]:







  
    
      
      cast
      crew
      id
    
  
  
    
      0
      [{'cast_id': 14, 'character': 'Woody (voice)',...
      [{'credit_id': '52fe4284c3a36847f8024f49', 'de...
      862
    
    
      1
      [{'cast_id': 1, 'character': 'Alan Parrish', '...
      [{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...
      8844
    
    
      2
      [{'cast_id': 2, 'character': 'Max Goldman', 'c...
      [{'credit_id': '52fe466a9251416c75077a89', 'de...
      15602
    
    
      3
      [{'cast_id': 1, 'character': "Savannah 'Vannah...
      [{'credit_id': '52fe44779251416c91011acb', 'de...
      31357
    
    
      4
      [{'cast_id': 1, 'character': 'George Banks', '...
      [{'credit_id': '52fe44959251416c75039ed7', 'de...
      11862



In [5]:

    
# Function for getting strings of keywords and genres and connecting them by "|"
def get_str(series, index_values):
    # return missing value rather than an error upon indexing/key failure
    try:
        idx = ast.literal_eval(series)
        if type(idx)== bool or type(idx) != list:
            values=[]
        else:
            values ='|'.join(str(x[index_values]) for x in idx)
    except ValueError or TypeError:
        values = []
    return values



In [6]:

    
# Function for getting values
# series:        df.series
# index_values:  Column name
# index:         return which one
def get_value(series,index_values,index):
    try:
        list_t = ast.literal_eval(series)
        if list_t:
            length = len(list_t)
            if index > length:
                values = pd.np.nan
            else:
                values =list_t[index-1][index_values]
        else:
            values= pd.np.nan
    except ValueError or TypeError:
        values = pd.np.nan
    return values


def get_director(crew_data,index_values):
    crew_data = ast.literal_eval(crew_data)
    for x in crew_data:
        if x['job'] == 'Director':
            return x[index_values]
        else:
            return pd.np.nan



In [7]:

    
def convert_to_easy(movies, credits, keywords):
    temp_movies = movies.copy(deep=True)
    temp_movies['title_year'] = pd.to_datetime(temp_movies['release_date'],errors='coerce').apply(lambda x: x.year)
    temp_movies['production_company_id'] = temp_movies['production_companies'].apply(lambda x: get_value(x, 'id',1))
    
    # Crew information
    temp_movies['director_id'] = credits['crew'].apply(lambda x: get_director(x,'id')) 
    temp_movies['director'] = credits['crew'].apply(lambda x: get_director(x,'name'))
    temp_movies['actor_id_1'] = credits['cast'].apply(lambda x: get_value(x, 'id',1))
    temp_movies['actor_name_1'] = credits['cast'].apply(lambda x: get_value(x,'name',1))
    temp_movies['actor_id_2'] = credits['cast'].apply(lambda x: get_value(x, 'id',2))
    temp_movies['actor_name_2'] = credits['cast'].apply(lambda x: get_value(x,'name',2))
    temp_movies['actor_id_3'] = credits['cast'].apply(lambda x: get_value(x, 'id',3))
    temp_movies['actor_name_3'] = credits['cast'].apply(lambda x: get_value(x,'name',3)) 
    
    # Content information
    temp_movies['keywords'] = keywords['keywords'].apply(lambda x: get_str(x,'name'))
    temp_movies['genres_id'] = temp_movies['genres'].apply(lambda x: get_value(x, 'id',1))
    temp_movies['genres'] = temp_movies['genres'].apply(lambda x: get_str(x,'name'))
    
    # Movie information
    temp_movies['country'] = temp_movies['production_countries'].apply(lambda x: get_str(x, 'name'))
    temp_movies['language'] = temp_movies['spoken_languages'].apply(lambda x: get_str(x, 'name'))
    temp_movies['production_company'] = temp_movies['production_companies'].apply(lambda x: get_str(x, 'name'))
    del temp_movies['tagline']
    del temp_movies['homepage']
    return temp_movies

From the loaded data, we use the function to convert those data in a clear layout.



In [8]:

    
df_original = convert_to_easy(movies, credits, keywords)



In [9]:

    
df_original.iloc[:5]









    Out[9]:







  
    
      
      adult
      belongs_to_collection
      budget
      genres
      id
      imdb_id
      original_language
      original_title
      overview
      popularity
      poster_path
      production_companies
      production_countries
      release_date
      revenue
      runtime
      spoken_languages
      status
      title
      video
      vote_average
      vote_count
      title_year
      production_company_id
      director_id
      director
      actor_id_1
      actor_name_1
      actor_id_2
      actor_name_2
      actor_id_3
      actor_name_3
      keywords
      genres_id
      country
      language
      production_company
    
  
  
    
      0
      False
      {'id': 10194, 'name': 'Toy Story Collection', ...
      30000000
      Animation|Comedy|Family
      862
      tt0114709
      en
      Toy Story
      Led by Woody, Andy's toys live happily in his ...
      21.9469
      /rhIRbceoE9lR4veEXuwCC2wARtG.jpg
      [{'name': 'Pixar Animation Studios', 'id': 3}]
      [{'iso_3166_1': 'US', 'name': 'United States o...
      1995-10-30
      373554033.0
      81.0
      [{'iso_639_1': 'en', 'name': 'English'}]
      Released
      Toy Story
      False
      7.7
      5415.0
      1995.0
      3.0
      7879.0
      John Lasseter
      31.0
      Tom Hanks
      12898.0
      Tim Allen
      7167.0
      Don Rickles
      jealousy|toy|boy|friendship|friends|rivalry|bo...
      16.0
      United States of America
      English
      Pixar Animation Studios
    
    
      1
      False
      NaN
      65000000
      Adventure|Fantasy|Family
      8844
      tt0113497
      en
      Jumanji
      When siblings Judy and Peter discover an encha...
      17.0155
      /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg
      [{'name': 'TriStar Pictures', 'id': 559}, {'na...
      [{'iso_3166_1': 'US', 'name': 'United States o...
      1995-12-15
      262797249.0
      104.0
      [{'iso_639_1': 'en', 'name': 'English'}, {'iso...
      Released
      Jumanji
      False
      6.9
      2413.0
      1995.0
      559.0
      NaN
      NaN
      2157.0
      Robin Williams
      8537.0
      Jonathan Hyde
      205.0
      Kirsten Dunst
      board game|disappearance|based on children's b...
      12.0
      United States of America
      English|Français
      TriStar Pictures|Teitler Film|Interscope Commu...
    
    
      2
      False
      {'id': 119050, 'name': 'Grumpy Old Men Collect...
      0
      Romance|Comedy
      15602
      tt0113228
      en
      Grumpier Old Men
      A family wedding reignites the ancient feud be...
      11.7129
      /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg
      [{'name': 'Warner Bros.', 'id': 6194}, {'name'...
      [{'iso_3166_1': 'US', 'name': 'United States o...
      1995-12-22
      0.0
      101.0
      [{'iso_639_1': 'en', 'name': 'English'}]
      Released
      Grumpier Old Men
      False
      6.5
      92.0
      1995.0
      6194.0
      26502.0
      Howard Deutch
      6837.0
      Walter Matthau
      3151.0
      Jack Lemmon
      13567.0
      Ann-Margret
      fishing|best friend|duringcreditsstinger|old men
      10749.0
      United States of America
      English
      Warner Bros.|Lancaster Gate
    
    
      3
      False
      NaN
      16000000
      Comedy|Drama|Romance
      31357
      tt0114885
      en
      Waiting to Exhale
      Cheated on, mistreated and stepped on, the wom...
      3.85949
      /16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg
      [{'name': 'Twentieth Century Fox Film Corporat...
      [{'iso_3166_1': 'US', 'name': 'United States o...
      1995-12-22
      81452156.0
      127.0
      [{'iso_639_1': 'en', 'name': 'English'}]
      Released
      Waiting to Exhale
      False
      6.1
      34.0
      1995.0
      306.0
      2178.0
      Forest Whitaker
      8851.0
      Whitney Houston
      9780.0
      Angela Bassett
      18284.0
      Loretta Devine
      based on novel|interracial relationship|single...
      35.0
      United States of America
      English
      Twentieth Century Fox Film Corporation
    
    
      4
      False
      {'id': 96871, 'name': 'Father of the Bride Col...
      0
      Comedy
      11862
      tt0113041
      en
      Father of the Bride Part II
      Just when George Banks has recovered from his ...
      8.38752
      /e64sOI48hQXyru7naBFyssKFxVd.jpg
      [{'name': 'Sandollar Productions', 'id': 5842}...
      [{'iso_3166_1': 'US', 'name': 'United States o...
      1995-02-10
      76578911.0
      106.0
      [{'iso_639_1': 'en', 'name': 'English'}]
      Released
      Father of the Bride Part II
      False
      5.7
      173.0
      1995.0
      5842.0
      NaN
      NaN
      67773.0
      Steve Martin
      3092.0
      Diane Keaton
      519.0
      Martin Short
      baby|midlife crisis|confidence|aging|daughter|...
      35.0
      United States of America
      English
      Sandollar Productions|Touchstone Pictures

Delete useless columns

There are too many columns we will delete some. Because we have used the information from the original dataset, we now delete the repeated and useless columns.



In [10]:

    
# df     : dataframe
# columns: name of removed columns
def delete_column(df,columns):
    df_new = df.copy(deep=True)
    delete = columns
    for i in delete:
        if i in df_new.columns:
            
            del df_new[i]
        else:
            print('There is no column named '+ str(i) +' in original dataframe')
    return df_new



In [11]:

    
col = ['adult','belongs_to_collection','imdb_id',
        'original_language','original_title','poster_path','production_companies',
        'production_countries','spoken_languages','status','video']
df_initial = delete_column(df_original,col)



In [12]:

    
print('Shape:',df_initial.shape)
# info on variable types and filling factor
tab_info=pd.DataFrame(df_initial.dtypes).T.rename(index={0:'column type'})
tab_info=tab_info.append(pd.DataFrame(df_initial.isnull().sum()).T.rename(index={0:'null values'}))
tab_info=tab_info.append(pd.DataFrame(df_initial.isnull().sum()/df_initial.shape[0]*100).T.
                         rename(index={0:'null values (%)'}))
tab_info









    



Shape: (45466, 26)






    Out[12]:







  
    
      
      budget
      genres
      id
      overview
      popularity
      release_date
      revenue
      runtime
      title
      vote_average
      vote_count
      title_year
      production_company_id
      director_id
      director
      actor_id_1
      actor_name_1
      actor_id_2
      actor_name_2
      actor_id_3
      actor_name_3
      keywords
      genres_id
      country
      language
      production_company
    
  
  
    
      column type
      object
      object
      object
      object
      object
      object
      float64
      float64
      object
      float64
      float64
      float64
      float64
      float64
      object
      float64
      object
      float64
      object
      float64
      object
      object
      float64
      object
      object
      object
    
    
      null values
      0
      0
      0
      954
      5
      90
      6
      263
      6
      6
      6
      90
      11881
      19941
      19941
      2417
      2417
      3749
      3749
      4659
      4659
      0
      2442
      0
      0
      0
    
    
      null values (%)
      0
      0
      0
      2.09827
      0.0109972
      0.19795
      0.0131967
      0.578454
      0.0131967
      0.0131967
      0.0131967
      0.19795
      26.1316
      43.8591
      43.8591
      5.31606
      5.31606
      8.24572
      8.24572
      10.2472
      10.2472
      0
      5.37105
      0
      0
      0

Now we have reached the clear dataframe.



In [13]:

    
df_initial.head(5)









    Out[13]:







  
    
      
      budget
      genres
      id
      overview
      popularity
      release_date
      revenue
      runtime
      title
      vote_average
      vote_count
      title_year
      production_company_id
      director_id
      director
      actor_id_1
      actor_name_1
      actor_id_2
      actor_name_2
      actor_id_3
      actor_name_3
      keywords
      genres_id
      country
      language
      production_company
    
  
  
    
      0
      30000000
      Animation|Comedy|Family
      862
      Led by Woody, Andy's toys live happily in his ...
      21.9469
      1995-10-30
      373554033.0
      81.0
      Toy Story
      7.7
      5415.0
      1995.0
      3.0
      7879.0
      John Lasseter
      31.0
      Tom Hanks
      12898.0
      Tim Allen
      7167.0
      Don Rickles
      jealousy|toy|boy|friendship|friends|rivalry|bo...
      16.0
      United States of America
      English
      Pixar Animation Studios
    
    
      1
      65000000
      Adventure|Fantasy|Family
      8844
      When siblings Judy and Peter discover an encha...
      17.0155
      1995-12-15
      262797249.0
      104.0
      Jumanji
      6.9
      2413.0
      1995.0
      559.0
      NaN
      NaN
      2157.0
      Robin Williams
      8537.0
      Jonathan Hyde
      205.0
      Kirsten Dunst
      board game|disappearance|based on children's b...
      12.0
      United States of America
      English|Français
      TriStar Pictures|Teitler Film|Interscope Commu...
    
    
      2
      0
      Romance|Comedy
      15602
      A family wedding reignites the ancient feud be...
      11.7129
      1995-12-22
      0.0
      101.0
      Grumpier Old Men
      6.5
      92.0
      1995.0
      6194.0
      26502.0
      Howard Deutch
      6837.0
      Walter Matthau
      3151.0
      Jack Lemmon
      13567.0
      Ann-Margret
      fishing|best friend|duringcreditsstinger|old men
      10749.0
      United States of America
      English
      Warner Bros.|Lancaster Gate
    
    
      3
      16000000
      Comedy|Drama|Romance
      31357
      Cheated on, mistreated and stepped on, the wom...
      3.85949
      1995-12-22
      81452156.0
      127.0
      Waiting to Exhale
      6.1
      34.0
      1995.0
      306.0
      2178.0
      Forest Whitaker
      8851.0
      Whitney Houston
      9780.0
      Angela Bassett
      18284.0
      Loretta Devine
      based on novel|interracial relationship|single...
      35.0
      United States of America
      English
      Twentieth Century Fox Film Corporation
    
    
      4
      0
      Comedy
      11862
      Just when George Banks has recovered from his ...
      8.38752
      1995-02-10
      76578911.0
      106.0
      Father of the Bride Part II
      5.7
      173.0
      1995.0
      5842.0
      NaN
      NaN
      67773.0
      Steve Martin
      3092.0
      Diane Keaton
      519.0
      Martin Short
      baby|midlife crisis|confidence|aging|daughter|...
      35.0
      United States of America
      English
      Sandollar Productions|Touchstone Pictures

1.1 Keywords

We plan to make an extensive use of the keywords that describe the movies. Indeed, a basic assumption is that films described by similar keywords should have similar contents. Hence, we will have a close look at the way keywords are defined and as a first step.

We define a function to calculate the frequency of content and return an ordered list.

Note that this function will be used again in other sections of this notebook, when exploring the content of the 'genres' variable and subsequently, when cleaning the keywords. Finally, calling this function gives access to a list of keywords which are sorted by decreasing frequency:



In [14]:

    
def count_word(df, column, list_obj, type_store):
    
    # If the type you need count is str with |, type_score = 2
    # If the type is others, just type_score = 1
    keyword_count = dict()
    for s in list_obj: keyword_count[s] = 0
        
    if type_store == 1:
        set_temp = df[column].str.split('|')
    elif type_store == 2:
        set_temp = df[column]
        
    for list_keywords in set_temp:
        if type(list_keywords) == float and pd.isnull(list_keywords): continue # judge if this set is validaty     
        for s in [s for s in list_keywords if s in list_obj]: 
            if pd.notnull(s): keyword_count[s] += 1 
                
    # convert the dictionary in a list to sort the keywords by frequency
    keyword_occurences = []
    for k,v in keyword_count.items():
        keyword_occurences.append([k,v])
    keyword_occurences.sort(key = lambda x:x[1], reverse = True)
    # keyword_count is a dictionary and keyword_occurences is sorted list [].
    return keyword_occurences, keyword_count

Use the set to get the keywords set and then calculate the frequency



In [15]:

    
# keep the original keywords set
set_keywords_0 = set()
for list_keywords in df_initial['keywords'].str.split('|').values:
    if isinstance(list_keywords, float): continue  # only happen if liste_keywords = NaN
    set_keywords_0 = set_keywords_0.union(list_keywords)
    
# remove null chain entry
set_keywords_0.remove('')



In [16]:

    
keyword_occurences_0, dum = count_word(df_initial, 'keywords', set_keywords_0,1)
keyword_occurences_0[-20:]









    Out[16]:





[['past prime', 1],
 ['weapons history', 1],
 ['flute', 1],
 ['buried', 1],
 ['hamburgers', 1],
 ['team roping', 1],
 ['car seat', 1],
 ['head nurse', 1],
 ['better life', 1],
 ['simpler life', 1],
 ['woman journalist', 1],
 ['history of film', 1],
 ['thompson sub machine gun', 1],
 ['murderous pair', 1],
 ['fifth arrondissement of paris toilet', 1],
 ['cool hair', 1],
 ['guinevere', 1],
 ['herbicide', 1],
 ['british fleet', 1],
 ['half naked man', 1]]

You can see there are many keywords appearing only once. We plan to set a threshold for keeping "more" important keywords.

1.1.2 Cleaning Keywords

We can see there are many keywords whose occurences are 1. We try to delete them.



In [17]:

    
# Delete the low-frequency elements
# df:           the original dataframe
# column:       which column you want to change
# replacement:  the few keywords set
def delete_few(df,column,few_set):
    df_new = df.copy(deep = True)
    for index, row in df_new.iterrows():
        chaine = row[column]
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            if s not in few_set:
                nouvelle_liste.append(s) 
        df_new.set_value(index, column, '|'.join(nouvelle_liste)) 
    return df_new

Delete the frequency 1 keywords.



In [18]:

    
keyword_few = [i[0] for i in keyword_occurences_0 if i[1]<2]
len(keyword_few)









    Out[18]:





8521



In [19]:

    
df_initial = delete_few(df_initial,'keywords',keyword_few)

Calculate the new set of keywords, in which we only keep keywords appeared more times.



In [20]:

    
set_keywords = set()
for list_keywords in df_initial['keywords'].str.split('|').values:
    if isinstance(list_keywords, float): continue  # only happen if liste_keywords = NaN
    set_keywords = set_keywords.union(list_keywords)
set_keywords.remove('')



In [21]:

    
keyword_occurences, dum = count_word(df_initial, 'keywords', set_keywords,1)
keyword_occurences[-5:]









    Out[21]:





[['scary', 2],
 ['naval combat', 2],
 ['trapdoor', 2],
 ['labor day', 2],
 ['shot in the throat', 2]]



In [22]:

    
len(keyword_occurences)









    Out[22]:





11320

However, there are still 11320 keywords. Later, in data exploration, we plan to decrease the keywords by different aspects like meaning.

At this stage, the list of keywords has been created and we know the number of times each of them appear in the dataset. In fact, this list can be used to have a feeling of the content of the most popular movies. A fancy manner to give that information makes use of the wordcloud package. All the words are arranged in a figure with sizes that depend on their respective frequencies.



In [23]:

    
fig = plt.figure(1, figsize=(18,13))
ax1 = fig.add_subplot(2,1,1)

words = dict()
trunc_occurences = keyword_occurences[0:100]
for s in trunc_occurences:
    words[s[0]] = s[1]

wordcloud = WordCloud(width=1000,height=300, background_color='white', 
                      max_words=1628,relative_scaling=1,
                      colormap="Set2_r",
                      normalize_plurals = False)
wordcloud.generate_from_frequencies(words)
ax1.imshow(wordcloud, interpolation="bilinear")
ax1.axis('off')

# LOWER PANEL: HISTOGRAMS
ax2 = fig.add_subplot(2,1,2)
y_axis = [i[1] for i in trunc_occurences]
x_axis = [k for k,i in enumerate(trunc_occurences)]
x_label = [i[0] for i in trunc_occurences]
plt.xticks(rotation=85, fontsize = 10)
plt.yticks(fontsize = 15)
plt.xticks(x_axis, x_label)
plt.ylabel("Nb. of occurences", fontsize = 15, labelpad = 10)
ax2.bar(x_axis, y_axis, align = 'center', color=(1.0,0.5,0.62))
#_______________________
plt.title("Keywords popularity",bbox={'facecolor':'k', 'pad':5},color='w',fontsize = 25)
plt.show()

We can see people pay much attention to the gender of the director.

1.2 Filling factor: missing values

As in every analysis, at some point, we will have to deal with the missing values and as a first step, we determine the amount of data which is missing in every variable:



In [24]:

    
missing_df = df_initial.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['filling_factor'] = (df_initial.shape[0] - missing_df['missing_count']) / df_initial.shape[0] * 100
missing_df.sort_values('filling_factor').reset_index(drop = True)









    Out[24]:







  
    
      
      column_name
      missing_count
      filling_factor
    
  
  
    
      0
      director
      19941
      56.140853
    
    
      1
      director_id
      19941
      56.140853
    
    
      2
      production_company_id
      11881
      73.868385
    
    
      3
      actor_id_3
      4659
      89.752782
    
    
      4
      actor_name_3
      4659
      89.752782
    
    
      5
      actor_id_2
      3749
      91.754278
    
    
      6
      actor_name_2
      3749
      91.754278
    
    
      7
      genres_id
      2442
      94.628954
    
    
      8
      actor_id_1
      2417
      94.683940
    
    
      9
      actor_name_1
      2417
      94.683940
    
    
      10
      overview
      954
      97.901729
    
    
      11
      runtime
      263
      99.421546
    
    
      12
      release_date
      90
      99.802050
    
    
      13
      title_year
      90
      99.802050
    
    
      14
      revenue
      6
      99.986803
    
    
      15
      title
      6
      99.986803
    
    
      16
      vote_average
      6
      99.986803
    
    
      17
      vote_count
      6
      99.986803
    
    
      18
      popularity
      5
      99.989003
    
    
      19
      keywords
      0
      100.000000
    
    
      20
      country
      0
      100.000000
    
    
      21
      budget
      0
      100.000000
    
    
      22
      id
      0
      100.000000
    
    
      23
      genres
      0
      100.000000
    
    
      24
      language
      0
      100.000000
    
    
      25
      production_company
      0
      100.000000

We can see that most of the variables are well filled since 9 of them have a filling factor below 93%.

1.3 Number of films per year

The title_year variable indicates when films were released.



In [25]:

    
sns.set(style="white")
# Draw a count plot to show the number of movies each year
g = sns.factorplot(x="title_year", data=df_initial, kind="count",palette="BuPu", size=7, aspect=1.5)
g.set_xticklabels(step=10)









    Out[25]:





<seaborn.axisgrid.FacetGrid at 0x1228c4fd0>

The number of films increases with years. There is a peak at year 2013.

1.4 Genres

The genres will surely be important while building the recommendation system based on content since it describes the content of the film (i.e. Drama, Comedy, Action, ...). To see exactly which genres are the most popular, we use the same approach applied in keywords.

Like the keywords, we still get the original set of genres.



In [26]:

    
set_genre_0 = set()
for s in df_initial['genres'].str.split('|').values:
    set_genre_0 = set_genre_0.union(set(s))

and then counting how many times each of them occur:



In [27]:

    
genre_occurences_0, genre_dum = count_word(df_initial, 'genres', set_genre_0,1)
genre_occurences_0









    Out[27]:





[['Drama', 20265],
 ['Comedy', 13182],
 ['Thriller', 7624],
 ['Romance', 6735],
 ['Action', 6596],
 ['Horror', 4673],
 ['Crime', 4307],
 ['Documentary', 3932],
 ['Adventure', 3496],
 ['Science Fiction', 3049],
 ['Family', 2770],
 ['Mystery', 2467],
 ['', 2442],
 ['Fantasy', 2313],
 ['Animation', 1935],
 ['Foreign', 1622],
 ['Music', 1598],
 ['History', 1398],
 ['War', 1323],
 ['Western', 1042],
 ['TV Movie', 767],
 ['Pulser Productions', 1],
 ['GoHands', 1],
 ['Mardock Scramble Production Committee', 1],
 ['BROSTA TV', 1],
 ['Aniplex', 1],
 ['Carousel Productions', 1],
 ['The Cartel', 1],
 ['Telescene Film Group Productions', 1],
 ['Sentai Filmworks', 1],
 ['Rogue State', 1],
 ['Vision View Entertainment', 1],
 ['Odyssey Media', 1]]

We would like to remove the genres which appear few times.



In [28]:

    
genre_few = [i[0] for i in genre_occurences_0 if i[1]<10]
len(genre_few)
df_initial = delete_few(df_initial,'genres',genre_few)

Re-calculate the set of genre



In [29]:

    
set_genre = set()
for s in df_initial['genres'].str.split('|').values:
    set_genre = set_genre.union(set(s))
set_genre.remove('')



In [30]:

    
genre_occurences, genre_dum = count_word(df_initial, 'genres', set_genre,1)
genre_occurences









    Out[30]:





[['Drama', 20265],
 ['Comedy', 13182],
 ['Thriller', 7624],
 ['Romance', 6735],
 ['Action', 6596],
 ['Horror', 4673],
 ['Crime', 4307],
 ['Documentary', 3932],
 ['Adventure', 3496],
 ['Science Fiction', 3049],
 ['Family', 2770],
 ['Mystery', 2467],
 ['Fantasy', 2313],
 ['Animation', 1935],
 ['Foreign', 1622],
 ['Music', 1598],
 ['History', 1398],
 ['War', 1323],
 ['Western', 1042],
 ['TV Movie', 767]]



In [31]:

    
len(genre_occurences)









    Out[31]:





20

Now we only have 20 genres in our dataframe. Finally, the result is shown as a wordcloud:



In [32]:

    
words = dict()
trunc_occurences = genre_occurences[0:50]
for s in trunc_occurences:
    words[s[0]] = s[1]
tone = 14 # define the color of the words
f, ax = plt.subplots(figsize=(13, 14))
wordcloud = WordCloud(width=700,height=400, background_color='white', 
                      max_words=1628,relative_scaling=0.5,
                      colormap="Set2_r",
                      normalize_plurals=False)
wordcloud.generate_from_frequencies(words)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

1.5 Actors

We will try to get some insight on the habits of actors. We will focus on a few of data.

Our first goal is to see the favorite genre of actors. For simplicity, in what follows, we will only consider the actors who appear in first 3 actors in original dataset. In fact, a more exhaustive view would be achieved considering also the two other categories. To proceed with that, I perform some one hot encoding:

At beginning, we should get the actor set to see the hard-working actors.



In [33]:

    
df_actor = df_initial.copy(deep=True)
set_actor = set(df_actor['actor_name_1']).union(df_actor['actor_name_2']).union(df_actor['actor_name_3'])
set_actor.remove(np.nan)

Calculate the actor_occurences



In [34]:

    
list_actor = df_actor['actor_name_1'].append(df_actor['actor_name_2']).append(df_actor['actor_name_3'])
list_actor = list_actor.value_counts().reset_index()
actor_occurences = []
for index in list_actor.index:
    actor_occurences.append([list_actor.iloc[index]['index'],list_actor.iloc[index][0]])

Now we could see the wordcloud picture to see the most hard-working actors.



In [35]:

    
words = dict()
trunc_occurences = actor_occurences[0:50]
for s in trunc_occurences:
    words[s[0]] = s[1]
tone = 14 # define the color of the words
f, ax = plt.subplots(figsize=(14, 13))
wordcloud = WordCloud(width=900,height=400, background_color='white', 
                      max_words=1628,relative_scaling=0.5,
                      colormap="Set2_r",
                      normalize_plurals=False)
wordcloud.generate_from_frequencies(words)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

2. Data Exploration

2.1 Cleaning the keywords

In our system, the Keywords play an inportant role in the function of the engine. Our recommendation engine calculates the similarity between movies. For the keywords, we will look for films described by the same keywords. In order to speed up the egine, we have to reserve small enough number of the keywords.

2.1.1 Grouping by roots

NLTK package is an important package to process words, here we group the keywords by their "root" since word's root present most of their information. Then,we caculate the number of occurence of the various keywords for further processing.

Collect the keywords



In [36]:

    
df_cleaned = df_initial.copy(deep = True)



In [37]:

    
def keywords_inventory(dataframe, colonne = 'keywords'):
    PS = nltk.stem.PorterStemmer() # implement nltk PorterStemmer
    LB = nltk.stem.WordNetLemmatizer() # implement Lemmatizer
    keywords_roots  = dict()  # collect the words / root
    keywords_select = dict()  # association: root <-> keyword
    category_keys = []
    icount = 0
    for s in dataframe[colonne]:
        if pd.isnull(s): continue
        for t in s.split('|'):
            t = t.lower() ; 
            pre_racine = LB.lemmatize(t) # WordNetLemmatizer can process the plural irregular words 
            racine = PS.stem(pre_racine) # stem PorterStemmer can process a set of words and get it's root
            if racine in keywords_roots:                
                keywords_roots[racine].add(t)
            else:
                keywords_roots[racine] = {t}
   # process the sentence keywords : split the setence to several words and select one representative word
    long_similar = dict()
    for s in keywords_roots.keys(): 
        tt = s.split()
        for t in tt:
            count = 0
            for elem in keywords_roots.keys():# make sure this word also appearace in every words in the root's dictionary
                if t == elem and s!=elem:
                    for ss in keywords_roots[s]:
                        if t in ss.split(): 
                            count += 1
                    if count >= len(keywords_roots[elem]): 
                        long_similar[s] = elem
    for s in long_similar.keys():
        for tt in keywords_roots[s]:
            keywords_roots[long_similar[s]].add(tt)
        del keywords_roots[s]
   # select a proper keywords: 
    for s in keywords_roots.keys():
        if len(keywords_roots[s]) > 1:  
            min_length = 1000 #
            for k in keywords_roots[s]:
                if len(k) < min_length:
                    clef = k ; min_length = len(k)
            # if the shortest is also a sentence, we have to use the most common words to represent this keywords
            for tt in clef.split():
                count = 0
                for element in keywords_roots[s]:
                    if tt in element :
                        count += 1
                if count >= len(keywords_roots[s]):
                    final_clef = tt
            category_keys.append(final_clef)
            keywords_select[s] = final_clef
        else:
            category_keys.append(list(keywords_roots[s])[0])
            keywords_select[s] = list(keywords_roots[s])[0]
                   
    print("Nb of keywords in variable '{}': {}".format(colonne,len(category_keys)))
    return category_keys, keywords_roots, keywords_select

Find the keywords having the same roots and get the shortest word in the same root



In [38]:

    
keywords, keywords_roots, keywords_select = keywords_inventory(df_cleaned, colonne = 'keywords')









    



Nb of keywords in variable 'keywords': 7799

Plot 10 keywords that appear in close varieties



In [39]:

    
# Plot of a sample of keywords that appear in close varieties 
icount = 0
complex_set = dict()
for s in keywords_roots.keys():
    if len(keywords_roots[s]) > 5: 
        icount += 1
        complex_set[s] = keywords_roots[s]
        if icount < 2: 
            print(icount, keywords_roots[s], len(keywords_roots[s]))









    



1 {'male friendship', 'friendship', 'friendship bracelet', 'interracial friendship', 'unlikely friendship', 'female friendship', 'adult child friendship'} 7

Now, we only know the relations between the similar keywords. Then we plan to replace the original one by the shortest similar one.

Replacement of the keywords by the main form using relation keywords_select



In [40]:

    
def remplacement_df_keywords(df, dico_remplacement, roots):
    df_new = df.copy(deep = True)
    LB = nltk.stem.WordNetLemmatizer()
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            for key in roots.keys():
                if s in roots[key]:
                    if key in dico_remplacement.keys():
                        nouvelle_liste.append(dico_remplacement[key])
                    else:
                        nouvelle_liste.append(s)       
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste)) 
    return df_new

Then we do keywords replacement using the relations we get from above.



In [41]:

    
df_keywords_cleaned = remplacement_df_keywords(df_cleaned, keywords_select,keywords_roots)
keywords.remove('')



In [42]:

    
# Plot of a sample of keywords that appear in close varieties 
icount = 0
for s in keywords_roots.keys():
    if len(keywords_roots[s]) > 3: 
        icount += 1
        if icount < 5: print(icount, keywords_roots[s], len(keywords_roots[s]))









    



1 {'boy', 'boy scout', 'boy scouts', 'boys'} 4
2 {'male friendship', 'friendship', 'friendship bracelet', 'interracial friendship', 'unlikely friendship', 'female friendship', 'adult child friendship'} 7
3 {'age gap', 'aging', 'age difference', 'old age romance'} 4
4 {'mother daughter relationship', 'father daughter relationship', 'mother daughter estrangement', 'daughter', 'daughter of the president'} 5



In [43]:

    
# Count of the keywords occurences
keyword_occurences, keywords_count = count_word(df_keywords_cleaned,'keywords',keywords,type_store=1)



In [44]:

    
print(keyword_occurences[:5])
print('\n'+'The number of keywords-cleaned data has changed to '+str(len(keyword_occurences)))









    



[['director', 3139], ['film', 2720], ['murder', 1908], ['war', 1275], ['love', 1142]]

The number of keywords-cleaned data has changed to 7590

2.1.2 Groups of synonyms

In the former data processing section, we have delete the keywords that appearance one time. Then,in this section, we find the keywords that appear less that 5 times and replace them by a synomym of higher frequency.

First step: replace low-frequency words by a synomym of higher frequency



In [46]:

    
# get the synomyms of the word 'word'
def get_synonymes(word):
    lemma = set()
    for ss in wordnet.synsets(word):
        for w in ss.lemma_names():
            
            # We just get the 'nouns':
            index = ss.name().find('.')+1
            if ss.name()[index] == 'n': lemma.add(w.lower().replace('_',' '))
    return lemma



In [47]:

    
# Exemple of a list of synonyms given by NLTK
mot_cle = 'alien'
lemma = get_synonymes(mot_cle)
for s in lemma:
    print(' "{:<30}" in keywords list -> {} {}'.format(s, s in keywords,keywords_count[s] if s in keywords else 0 ))









    



 "foreigner                     " in keywords list -> False 0
 "extraterrestrial              " in keywords list -> True 12
 "stranger                      " in keywords list -> True 31
 "extraterrestrial being        " in keywords list -> False 0
 "alien                         " in keywords list -> True 293
 "outlander                     " in keywords list -> False 0
 "noncitizen                    " in keywords list -> False 0
 "unknown                       " in keywords list -> False 0



In [48]:

    
# check if 'mot' is a key of 'key_count' with a test on the number of occurences   
def test_keyword(mot, key_count, threshold):
    return (False , True)[key_count.get(mot, 0) >= threshold]

Replace the words with synonymes



In [49]:

    
keyword_occurences.sort(key = lambda x:x[1], reverse = False)
key_count = dict()
for s in keyword_occurences:
    key_count[s[0]] = s[1]
# Creation of a dictionary to replace keywords by higher frequency keywords
remplacement_mot = dict()
icount = 0
for index, [mot, nb_apparitions] in enumerate(keyword_occurences):
    if nb_apparitions > 5: continue  # only the keywords that appear less than 5 times
    lemma = get_synonymes(mot)
    if len(lemma) == 0: continue     # case of the plurals

    liste_mots = [(s, key_count[s]) for s in lemma 
                  if test_keyword(s, key_count, key_count[mot])]
    liste_mots.sort(key = lambda x:(x[1],x[0]), reverse = True)    
    if len(liste_mots) <= 1: continue       # no replacement
    if mot == liste_mots[0][0]: continue    # replacement by himself
    icount += 1
    if  icount < 8:
        print('{:<12} -> {:<12} (init: {})'.format(mot, liste_mots[0][0], liste_mots))    
    remplacement_mot[mot] = liste_mots[0][0]

print(90*'_'+'\n'+'The replacement concerns {}% of the keywords.'
      .format(round(len(remplacement_mot)/len(keywords)*100,2)))









    



goose        -> jackass      (init: [('jackass', 2), ('goose', 2)])
evisceration -> disembowelment (init: [('disembowelment', 7), ('evisceration', 2)])
bookstore    -> bookshop     (init: [('bookshop', 10), ('bookstore', 2)])
jurist       -> judge        (init: [('judge', 64), ('justice', 28), ('jurist', 2)])
disk         -> record       (init: [('record', 10), ('disk', 2)])
hugging      -> fondling     (init: [('fondling', 5), ('hugging', 2)])
worrying     -> torment      (init: [('torment', 6), ('worrying', 2)])
__________________________________________________________________________________________
The replacement concerns 4.73% of the keywords.



In [50]:

    
print('Keywords that appear both in keys and values:'.upper()+'\n'+45*'-')
icount = 0
for s in remplacement_mot.values():
    if s in remplacement_mot.keys():
        icount += 1
        if icount < 10: print('{:<20} -> {:<20}'.format(s, remplacement_mot[s]))

for key, value in remplacement_mot.items():
    if value in remplacement_mot.keys():
        remplacement_mot[key] = remplacement_mot[value]









    



KEYWORDS THAT APPEAR BOTH IN KEYS AND VALUES:
---------------------------------------------
wind                 -> jazz                
conquest             -> seduction           
guide                -> scout               
rebirth              -> reincarnation       
nirvana              -> heaven              
seal                 -> navy seal           
enchantment          -> spell               
oldtimer             -> veteran             
mistake              -> misunderstanding



In [51]:

    
# Replacement of the keywords by the main form(for synonyms)
def remplacement_df_keywords1(df, dico_remplacement, roots):
    df_new = df.copy(deep = True)
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            clef = PS.stem(LB.lemmatize(s)) if roots else s
            if clef in dico_remplacement.keys():
                nouvelle_liste.append(dico_remplacement[clef])
            else:
                nouvelle_liste.append(s)       
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste)) 
    return df_new



In [52]:

    
# replacement of keyword varieties by the main keyword
df_keywords_synonyms = remplacement_df_keywords1(df_keywords_cleaned, remplacement_mot, roots = False)   
keywords, keywords_roots, keywords_select = keywords_inventory(df_keywords_synonyms, colonne = 'keywords')
keywords.remove('')









    



Nb of keywords in variable 'keywords': 6649



In [53]:

    
# New count of keyword occurences
new_keyword_occurences, keywords_count = count_word(df_keywords_synonyms,'keywords',keywords,1)

Second step successive replacements

suppress all the keywords that appear in less than 5 films



In [54]:

    
# deletion of keywords with low frequencies
def remplacement_df_low_frequency_keywords(df, keyword_occurences):
    df_new = df.copy(deep = True)
    key_count = dict()
    for s in keyword_occurences: 
        key_count[s[0]] = s[1]    
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'):       
            if key_count.get(s,4) > 5: nouvelle_liste.append(s)
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste))
    return df_new



In [55]:

    
# Creation of a dataframe where keywords of low frequencies are suppressed
df_keywords_occurence = remplacement_df_low_frequency_keywords(df_keywords_synonyms, new_keyword_occurences)
keywords, keywords_roots, keywords_select = keywords_inventory(df_keywords_occurence, colonne = 'keywords')    
keywords.remove('')









    



Nb of keywords in variable 'keywords': 3290



In [56]:

    
# New keywords count
new_keyword_occurences, keywords_count = count_word(df_keywords_occurence,'keywords',keywords,1)
new_keyword_occurences[:5]









    Out[56]:





[['director', 3139],
 ['film', 2720],
 ['murder', 1908],
 ['war', 1279],
 ['love', 1142]]

plot the Graph of keyword occurences



In [57]:

    
font = {'family' : 'fantasy', 'weight' : 'normal', 'size'   : 15}
mpl.rc('font', **font)

keyword_occurences.sort(key = lambda x:x[1], reverse = True)

y_axis = [i[1] for i in keyword_occurences]
x_axis = [k for k,i in enumerate(keyword_occurences)]

new_y_axis = [i[1] for i in new_keyword_occurences]
new_x_axis = [k for k,i in enumerate(new_keyword_occurences)]

f, ax = plt.subplots(figsize=(9, 5))
ax.plot(x_axis, y_axis, 'r-', label='before cleaning')
ax.plot(new_x_axis, new_y_axis, 'b-', label='after cleaning')

# Now add the legend with some customizations.
legend = ax.legend(loc='upper right', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
for label in legend.get_texts():
    label.set_fontsize('medium')
            
plt.ylim((0,25))
plt.axhline(y=3.5, linewidth=2, color = 'k')
plt.xlabel("keywords index", family='fantasy', fontsize = 15)
plt.ylabel("Nb. of occurences", family='fantasy', fontsize = 15)
#plt.suptitle("Nombre d'occurences des mots clés", fontsize = 18, family='fantasy')
plt.text(3500, 5, 'threshold for keyword delation', fontsize = 13)
plt.show()

The number of keywords decreased almost half of that of original. At the beginning, the number is 11320 which decreased to about 3200.

2.1.3 Group by word vectors

Similarity is determined by comparing word vectors or "word embeddings", multi-dimensional meaning representations of a word. SpaCy's similarity model usually assumes a pretty general-purpose definition of similarity. We further shrink the number of keywords.Finally, we can reserve the number of keywords that you want.

Further processing the keywords with the help of spacy methods

Set the nm_keep as the number of keywords you want to reserve.



In [58]:

    
final_keywords_set  = list()
nm_keep = 1200 # number of keywords that you finally want
for i in range(len(new_keyword_occurences)):
    final_keywords_set.append(new_keyword_occurences[i][0])
set_1 = final_keywords_set[nm_keep:len(new_keyword_occurences)]



In [59]:

    
documents_left = {}
documents_whole = dict()
for words in final_keywords_set:
    document = nlp(words)
    documents_whole[words] = [document]
for i in range(len(documents_whole)-nm_keep):
    documents_left[set_1[i]]=[]
    for j in range(nm_keep):
        corr_final = 0
        count1 = 0
        for token in documents_whole[set_1[i]]:
            corr = 0
            count = 0
            for token1 in documents_whole[final_keywords_set[j]]:
                corr += token.similarity(token1)
                count+=1
            aaa = corr/count # average similarity
            count1 +=1
            corr_final = corr_final+aaa
        documents_left[set_1[i]].append(corr_final/count1)



In [60]:

    
replace_dict = dict()
for elem in set_1:
    replace_dict[elem] = final_keywords_set[np.argmax(documents_left[elem])]



In [61]:

    
# Replacement of the keywords by the main form(for the word vector)
def remplacement_df_keywords_inverse(df, dico_remplacement):
    df_new = df.copy(deep = True)
    for index, row in df_new.iterrows():
        chaine = row['keywords']
        if pd.isnull(chaine): continue
        nouvelle_liste = []
        for s in chaine.split('|'): 
            if s in dico_remplacement.keys():
                nouvelle_liste.append(dico_remplacement[s])
            else:
                nouvelle_liste.append(s)       
        df_new.set_value(index, 'keywords', '|'.join(nouvelle_liste)) 
    return df_new



In [62]:

    
df_keywords_cleaned_again = remplacement_df_keywords_inverse(df_keywords_occurence, replace_dict)
df_keywords_occurence = df_keywords_cleaned_again
keywords, keywords_roots, keywords_select = keywords_inventory(df_keywords_occurence, colonne = 'keywords')    
keywords.remove('')









    



Nb of keywords in variable 'keywords': 1201



In [63]:

    
new_keyword_occurences, keywords_count = count_word(df_keywords_occurence,'keywords',keywords,1)



In [64]:

    
# Graph of keyword occurences
font = {'family' : 'fantasy', 'weight' : 'normal', 'size'   : 30}
mpl.rc('font', **font)

keyword_occurences.sort(key = lambda x:x[1], reverse = True)
new_y_axis = [i[1] for i in new_keyword_occurences]
new_x_axis = [k for k,i in enumerate(new_keyword_occurences)]

f, ax = plt.subplots(figsize=(9, 5))
ax.plot(new_x_axis, new_y_axis, '-*', label='after cleaning')
plt.ylim((0,1000))
# Now add the legend with some customizations.
legend = ax.legend(loc='upper right', shadow=True)
frame = legend.get_frame()
frame.set_facecolor('0.90')
for label in legend.get_texts():
    label.set_fontsize('medium')
plt.axhline(y=3.5, linewidth=2, color = 'k')
plt.xlabel("keywords index", family='fantasy', fontsize = 15)
plt.ylabel("Nb. of occurences", family='fantasy', fontsize = 15)
plt.show()

From the picture, we see that the keywords have only 1200. Also,the density of high-occurance keywords is small.

2.2 Actor relations

In this section, we just focus on the inner relations between actors. From this part, we could know actors' prefernce of movie genres.

Use a subset of the df_initial to see the inner relations of actors.



In [65]:

    
df_actor = df_actor.dropna(how='any')
df_actor['popularity'] = df_actor['popularity'].apply(lambda x: float(x))
df_actor['actor_id_1'] = df_actor['actor_id_1'].apply(lambda x: float(x))
df_actor['actor_id_2'] = df_actor['actor_id_2'].apply(lambda x: float(x))
df_actor['actor_id_3'] = df_actor['actor_id_3'].apply(lambda x: float(x))



In [66]:

    
df_actor_1 = df_actor[['actor_id_1','actor_id_2','actor_id_3','id']].reset_index(drop = True)
for genre in set_genre:
    df_actor_1[genre] = df_initial['genres'].str.contains(genre).apply(lambda x:1 if x else 0)
df_actor_1[0:10]









    Out[66]:







  
    
      
      actor_id_1
      actor_id_2
      actor_id_3
      id
      Thriller
      Music
      History
      Documentary
      Fantasy
      Science Fiction
      Foreign
      Horror
      Animation
      Family
      Action
      Adventure
      Romance
      Comedy
      War
      Drama
      Crime
      Mystery
      Western
      TV Movie
    
  
  
    
      0
      31.0
      12898.0
      7167.0
      862
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      1
      6837.0
      3151.0
      13567.0
      15602
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      8851.0
      9780.0
      18284.0
      31357
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      3
      1158.0
      380.0
      5576.0
      949
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
      0
    
    
      4
      3.0
      15887.0
      17141.0
      11860
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      5
      15111.0
      6280.0
      8656.0
      9091
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      6
      517.0
      48.0
      10695.0
      710
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      7
      4173.0
      11148.0
      6280.0
      10858
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      8
      380.0
      4430.0
      4517.0
      524
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      9
      204.0
      7056.0
      3291.0
      4584
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0



In [67]:

    
e=[]
for i in range(3):
    name = 'df_actor_NUM'.replace('NUM', str(i+1))
    actor = 'actor_id_NUM'.replace('NUM', str(i+1))
    actor_1 = 'actor_id_NUM'.replace('NUM', str((i+1)%3+1))
    actor_2 = 'actor_id_NUM'.replace('NUM', str((i+2)%3+1))
    df_actor_2 = df_actor_1.copy(deep=True)
    del df_actor_2[actor_1]
    del df_actor_2[actor_2]
    df_actor_2.rename(columns={actor:'actor_id'},inplace=True)
    e.append(df_actor_2)
df_actor_2 = pd.concat(e)

We get a new dataframe which use the actor as the index. Each actor appears only once while id appears 3 times because one movie keep only 3 actors in our case.



In [68]:

    
df_actor_2[0:10]









    Out[68]:







  
    
      
      actor_id
      id
      Thriller
      Music
      History
      Documentary
      Fantasy
      Science Fiction
      Foreign
      Horror
      Animation
      Family
      Action
      Adventure
      Romance
      Comedy
      War
      Drama
      Crime
      Mystery
      Western
      TV Movie
    
  
  
    
      0
      31.0
      862
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      1
      6837.0
      15602
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      2
      8851.0
      31357
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      3
      1158.0
      949
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
      0
    
    
      4
      3.0
      11860
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      5
      15111.0
      9091
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      6
      517.0
      710
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      7
      4173.0
      10858
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      8
      380.0
      524
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0
    
    
      9
      204.0
      4584
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      0
      0
      0
      0

We only keep the actors who acted at more than 8 movies.



In [69]:

    
temp = df_actor_2['actor_id'].value_counts()
df_actor_3 = df_actor_2.set_index(['actor_id'])
df_actor_3 = df_actor_3.loc[temp > 8].sort_index()
df_actor_3.reset_index(inplace=True)



In [70]:

    
df_actor_3[0:10]









    Out[70]:







  
    
      
      actor_id
      id
      Thriller
      Music
      History
      Documentary
      Fantasy
      Science Fiction
      Foreign
      Horror
      Animation
      Family
      Action
      Adventure
      Romance
      Comedy
      War
      Drama
      Crime
      Mystery
      Western
      TV Movie
    
  
  
    
      0
      2.0
      11
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      1
      2.0
      25757
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      2
      2.0
      266537
      1
      0
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      3
      2.0
      372355
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      4
      2.0
      31967
      1
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      5
      2.0
      1892
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      6
      2.0
      331962
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
      0
      1
      0
      0
      0
      0
      0
      0
    
    
      7
      2.0
      42591
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      8
      2.0
      14711
      0
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      0
      1
      0
      0
      0
    
    
      9
      2.0
      95511
      1
      0
      0
      0
      0
      0
      0
      0
      0
      0
      1
      1
      0
      0
      0
      1
      0
      0
      0
      0

Now, we could know the actors-genres table.



In [71]:

    
feature = df_actor_3.groupby('actor_id').sum()
feature.head(5)









    Out[71]:







  
    
      
      Thriller
      Music
      History
      Documentary
      Fantasy
      Science Fiction
      Foreign
      Horror
      Animation
      Family
      Action
      Adventure
      Romance
      Comedy
      War
      Drama
      Crime
      Mystery
      Western
      TV Movie
    
    
      actor_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      2.0
      6
      0
      1
      1
      0
      1
      1
      0
      0
      1
      6
      5
      3
      5
      0
      6
      4
      1
      0
      0
    
    
      3.0
      4
      0
      0
      1
      0
      0
      0
      0
      1
      2
      3
      1
      1
      4
      0
      12
      3
      3
      0
      0
    
    
      5.0
      3
      1
      0
      0
      0
      2
      0
      0
      1
      1
      4
      1
      4
      5
      3
      8
      0
      1
      3
      0
    
    
      31.0
      3
      2
      0
      1
      4
      1
      0
      1
      1
      5
      2
      2
      6
      10
      0
      13
      2
      2
      0
      0
    
    
      35.0
      5
      1
      1
      2
      1
      1
      1
      5
      0
      2
      2
      2
      2
      4
      0
      6
      2
      2
      0
      0

You can see from the table above. Actors always acted in specific genres with high weights while there is 0 weight in other genres.

Calculate the distance of actors based on cosine similarity.



In [72]:

    
distances = spatial.distance.pdist(feature,'cosine')
distances = spatial.distance.squareform(distances)



In [73]:

    
kernel_width = distances.mean()
weight = np.exp(-np.power(distances,2)/(kernel_width**2)) - np.eye(distances.shape[0])

keep only 100 strongest distance.



In [74]:

    
NEIGHBORS = 50
N = len(weight)
# sort the weights from biggest to smallest one.
strongest = np.argsort(-weight)
# only get the first 100 strongest ones.
strongest = strongest[:,0:NEIGHBORS]
choose = np.zeros([N,N])
for i in range(N):
    choose[i,strongest[i,:]]=1



In [75]:

    
# Be sure that weight matrix stays symmetric.
choose = np.triu(choose)+np.triu(choose).T
weights = choose * weight
np.nonzero(weights-weights.transpose())









    Out[75]:





(array([], dtype=int64), array([], dtype=int64))



In [76]:

    
fix, axes = plt.subplots(2, 2, figsize=(17, 8))
def plot(weights, axes):
    axes[0].spy(weights) # The first plot of weights matrix
    axes[1].hist(weights[weights > 0].reshape(-1), bins=50); # the degrees distrubutions
    
plot(weight, axes[:, 0]) # left side plot
plot(weights, axes[:, 1]) # right side graph

2.4 Correlations

From the relations of actor and movies, we could know that actors and movies are likely to cluster by genres. We then plot a correlation map. The correlations can help us to have a clear notice of most relevant features, which will help us to determine the features that we will use in the Engine.



In [77]:

    
df_keywords_occurence_de = df_keywords_occurence.copy(deep =True)
df_keywords_occurence_de[df_keywords_occurence_de['budget']== 0]['budget']=np.nan
df_keywords_occurence_de[df_keywords_occurence_de['revenue']== 0]['revenue']=np.nan
df_keywords_occurence_de[df_keywords_occurence_de['runtime']== 0]['time']=np.nan
df_keywords_occurence_de[df_keywords_occurence_de['vote_count']== 0]['time']=np.nan
df_keywords_occurence_de = df_keywords_occurence_de.dropna(how='any')
df_keywords_occurence_de['popularity'] = df_keywords_occurence_de['popularity'].apply(lambda x: float(x))



In [78]:

    
f, ax = plt.subplots(figsize=(12, 9))
#_____________________________
# calculations of correlations
corrmat = df_keywords_occurence_de.corr()
#________________________________________
k = 17 # number of variables for heatmap
cols = corrmat.nlargest(k, 'vote_count')['vote_count'].index
cm = np.corrcoef(df_keywords_occurence_de[cols].dropna(how='any').values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True,
                 fmt='.2f', annot_kws={'size': 10}, linewidth = 0.1, cmap = 'coolwarm',
                 yticklabels=cols.values, xticklabels=cols.values)
f.text(0.5, 0.93, "Correlation coefficients", ha='center', fontsize = 18, family='fantasy')
plt.show()

According to the values reported above, I delete a few variables from the dataframe and then re-order the columns.



In [79]:

    
df_var_cleaned = df_keywords_occurence.copy(deep = True)

2.4 Missing values

I examine the number of missing values in each variable and then choose a methodology to complete the dataset.



In [80]:

    
missing_df = df_var_cleaned.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['filling_factor'] = (df_var_cleaned.shape[0] 
                                - missing_df['missing_count']) / df_var_cleaned.shape[0] * 100
missing_df = missing_df.sort_values('filling_factor').reset_index(drop = True)
missing_df









    Out[80]:







  
    
      
      column_name
      missing_count
      filling_factor
    
  
  
    
      0
      director
      19941
      56.140853
    
    
      1
      director_id
      19941
      56.140853
    
    
      2
      production_company_id
      11881
      73.868385
    
    
      3
      actor_id_3
      4659
      89.752782
    
    
      4
      actor_name_3
      4659
      89.752782
    
    
      5
      actor_id_2
      3749
      91.754278
    
    
      6
      actor_name_2
      3749
      91.754278
    
    
      7
      genres_id
      2442
      94.628954
    
    
      8
      actor_id_1
      2417
      94.683940
    
    
      9
      actor_name_1
      2417
      94.683940
    
    
      10
      overview
      954
      97.901729
    
    
      11
      runtime
      263
      99.421546
    
    
      12
      release_date
      90
      99.802050
    
    
      13
      title_year
      90
      99.802050
    
    
      14
      revenue
      6
      99.986803
    
    
      15
      title
      6
      99.986803
    
    
      16
      vote_average
      6
      99.986803
    
    
      17
      vote_count
      6
      99.986803
    
    
      18
      popularity
      5
      99.989003
    
    
      19
      keywords
      0
      100.000000
    
    
      20
      country
      0
      100.000000
    
    
      21
      budget
      0
      100.000000
    
    
      22
      id
      0
      100.000000
    
    
      23
      genres
      0
      100.000000
    
    
      24
      language
      0
      100.000000
    
    
      25
      production_company
      0
      100.000000

The content of this table is now represented:



In [81]:

    
y_axis = missing_df['filling_factor'] 
x_label = missing_df['column_name']
x_axis = missing_df.index

fig = plt.figure(figsize=(11, 4))
plt.xticks(rotation=80, fontsize = 14)
plt.yticks(fontsize = 13)

N_thresh = 5
plt.axvline(x=N_thresh-0.5, linewidth=2, color = 'r')
plt.text(N_thresh-4.8, 30, 'filling factor \n < {}%'.format(round(y_axis[N_thresh],1)),
         fontsize = 15, family = 'fantasy', bbox=dict(boxstyle="round",
                   ec=(1.0, 0.5, 0.5),
                   fc=(0.8, 0.5, 0.5)))
N_thresh = 17
plt.axvline(x=N_thresh-0.5, linewidth=2, color = 'g')
plt.text(N_thresh, 30, 'filling factor \n = {}%'.format(round(y_axis[N_thresh],1)),
         fontsize = 15, family = 'fantasy', bbox=dict(boxstyle="round",
                   ec=(1., 0.5, 0.5),
                   fc=(0.5, 0.8, 0.5)))

plt.xticks(x_axis, x_label,family='fantasy', fontsize = 14 )
plt.ylabel('Filling factor (%)', family='fantasy', fontsize = 16)
plt.bar(x_axis, y_axis);

2.4.1 Setting missing title years

To infer the title year, For actors and the director, we determine the mean year of activity, using the current dataset. we then average the values obtained to estimate the title year.



In [82]:

    
df_filling = df_var_cleaned.copy(deep=True)
missing_year_info = df_filling[df_filling['title_year'].isnull()][['director_id','actor_id_1','actor_id_1', 'actor_id_3']]



In [83]:

    
def fill_year(df):
    col = ['director_id', 'actor_id_1', 'actor_id_2', 'actor_id_3']
    usual_year = [0 for _ in range(4)]
    var        = [0 for _ in range(4)]

    # I get the mean years of activity for the actors and director
    for i in range(4):
        usual_year[i] = df.groupby(col[i])['title_year'].mean()

    # I create a dictionnary collectinf this info
    actor_year = dict()
    for i in range(4):
        for s in usual_year[i].index:
            if s in actor_year.keys():
                if pd.notnull(usual_year[i][s]) and pd.notnull(actor_year[s]):
                    actor_year[s] = (actor_year[s] + usual_year[i][s])/2
                elif pd.isnull(actor_year[s]):
                    actor_year[s] = usual_year[i][s]
            else:
                actor_year[s] = usual_year[i][s]
        
    # identification of missing title years
    missing_year_info = df[df['title_year'].isnull()]

    # filling of missing values
    icount_replaced = 0
    for index, row in missing_year_info.iterrows():
        value = [ np.NaN for _ in range(4)]
        icount = 0 ; sum_year = 0
        for i in range(4):            
            var[i] = df.loc[index][col[i]]
            if pd.notnull(var[i]): value[i] = actor_year[var[i]]
            if pd.notnull(value[i]): icount += 1 ; sum_year += actor_year[var[i]]
        if icount != 0: sum_year = sum_year / icount 

        if int(sum_year) > 0:
            icount_replaced += 1
            df.set_value(index, 'title_year', int(sum_year))
            if icount_replaced < 10: 
                print("{:<45} -> {:<20}".format(df.loc[index]['title'],int(sum_year)))
    return



In [84]:

    
fill_year(df_filling)









    



Vermont Is for Lovers                         -> 2015                
Bling: A Planet Rock                          -> 1983                
Human Failure                                 -> 2000                
nan                                           -> 1992                
nan                                           -> 1996                
The Awful Truth                               -> 1994                
Electile Dysfunction: Inside the Business of American Campaigns -> 2001                
Aurinkotuuli                                  -> 1998                
Enola Gay and the Atomic Bombing of Japan     -> 2005

2.4.2 Keywords from the title

keywords usually play an important role in the functioning of the engine. Hence,filling missing values in the plot_keywords variable using the words of the title.Firstly,we create the list of synonyms of all the words contained in the title and I check if any of these synonyms are already in the keyword list. When it is the case, I add this keyword to the entry:



In [85]:

    
icount = 0
for index, row in df_filling[df_filling['keywords'].isnull()].iterrows():
    icount += 1
    liste_mot = row['title'].strip().split()
    new_keyword = []
    for s in liste_mot:
        lemma = get_synonymes(s)
        for t in list(lemma):
            if t in keywords: 
                new_keyword.append(t)                
    if new_keyword and icount < 15: 
        print('{:<50} -> {:<30}'.format(row['title'], str(new_keyword)))
    if new_keyword:
        df_filling.set_value(index, 'keywords', '|'.join(new_keyword))

2.4.3 Imputing from regressions

In Section 2.4, we had a look at the correlation between variables and found that a few of them showed some degree of correlation, with a Pearson's coefficient > 0.5:



In [86]:

    
f, ax = plt.subplots(figsize=(12, 9))
cols = corrmat.nlargest(9, 'vote_average')['vote_count'].index
temp = df_keywords_occurence[cols].dropna(how='any')
temp['popularity'] = temp['popularity'].apply(lambda x: float(x))
# temp.dtype = 'float'
cm = np.corrcoef(temp.values.T)
sns.set(font_scale=1.25)
cmap = sns.diverging_palette(256, 10, as_cmap=True)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True,cmap =cmap,
                 fmt='.2f', annot_kws={'size': 10}, 
                 yticklabels=cols.values, xticklabels=cols.values)
plt.show()



In [87]:

    
sns.set(font_scale=1.25)
cols = ['revenue', 'vote_count']
sns.pairplot(df_filling.dropna(how='any')[cols],diag_kind='kde', size = 5)
plt.show();

First, we define a function that impute the missing value from a linear fit of the data:



In [88]:

    
def variable_linreg_imputation(df, col_to_predict, ref_col):
    regr = linear_model.LinearRegression()
    test = df[[col_to_predict,ref_col]].dropna(how='any', axis = 0)
    X = np.array(test[ref_col])
    Y = np.array(test[col_to_predict])
    X = X.reshape(len(X),1)
    Y = Y.reshape(len(Y),1)
    regr.fit(X, Y)
    
    test = df[df[col_to_predict].isnull() & df[ref_col].notnull()]
    for index, row in test.iterrows():
        value = float(regr.predict(row[ref_col]))
        df.set_value(index, col_to_predict, value)

This function takes the dataframe as input, as well as the names of two columns. A linear fit is performed between those two columns which is used to fill the holes in the first column that was given:



In [89]:

    
variable_linreg_imputation(df_filling, 'revenue', 'vote_count')

Finally, I examine which amount of data is still missing in the dataframe:



In [90]:

    
df = df_filling.copy(deep = True)
missing_df = df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
missing_df['filling_factor'] = (df.shape[0] 
                                - missing_df['missing_count']) / df.shape[0] * 100
missing_df = missing_df.sort_values('filling_factor').reset_index(drop = True)
missing_df









    Out[90]:







  
    
      
      column_name
      missing_count
      filling_factor
    
  
  
    
      0
      director
      19941
      56.140853
    
    
      1
      director_id
      19941
      56.140853
    
    
      2
      production_company_id
      11881
      73.868385
    
    
      3
      actor_id_3
      4659
      89.752782
    
    
      4
      actor_name_3
      4659
      89.752782
    
    
      5
      actor_id_2
      3749
      91.754278
    
    
      6
      actor_name_2
      3749
      91.754278
    
    
      7
      genres_id
      2442
      94.628954
    
    
      8
      actor_id_1
      2417
      94.683940
    
    
      9
      actor_name_1
      2417
      94.683940
    
    
      10
      overview
      954
      97.901729
    
    
      11
      runtime
      263
      99.421546
    
    
      12
      release_date
      90
      99.802050
    
    
      13
      title_year
      25
      99.945014
    
    
      14
      revenue
      6
      99.986803
    
    
      15
      title
      6
      99.986803
    
    
      16
      vote_average
      6
      99.986803
    
    
      17
      vote_count
      6
      99.986803
    
    
      18
      popularity
      5
      99.989003
    
    
      19
      keywords
      0
      100.000000
    
    
      20
      country
      0
      100.000000
    
    
      21
      budget
      0
      100.000000
    
    
      22
      id
      0
      100.000000
    
    
      23
      genres
      0
      100.000000
    
    
      24
      language
      0
      100.000000
    
    
      25
      production_company
      0
      100.000000

and we can see that in the worst case, the filling factor is around 96% (excluding the homepage and tagline variables).



In [91]:

    
df = df_filling.copy(deep=True)
df.reset_index(inplace = True, drop = True)



In [92]:

    
df[:5]









    Out[92]:







  
    
      
      budget
      genres
      id
      overview
      popularity
      release_date
      revenue
      runtime
      title
      vote_average
      vote_count
      title_year
      production_company_id
      director_id
      director
      actor_id_1
      actor_name_1
      actor_id_2
      actor_name_2
      actor_id_3
      actor_name_3
      keywords
      genres_id
      country
      language
      production_company
    
  
  
    
      0
      30000000
      Animation|Comedy|Family
      862
      Led by Woody, Andy's toys live happily in his ...
      21.9469
      1995-10-30
      373554033.0
      81.0
      Toy Story
      7.7
      5415.0
      1995.0
      3.0
      7879.0
      John Lasseter
      31.0
      Tom Hanks
      12898.0
      Tim Allen
      7167.0
      Don Rickles
      jealousy|toy|boy|friendship|friend|rivalry|door
      16.0
      United States of America
      English
      Pixar Animation Studios
    
    
      1
      65000000
      Adventure|Fantasy|Family
      8844
      When siblings Judy and Peter discover an encha...
      17.0155
      1995-12-15
      262797249.0
      104.0
      Jumanji
      6.9
      2413.0
      1995.0
      559.0
      NaN
      NaN
      2157.0
      Robin Williams
      8537.0
      Jonathan Hyde
      205.0
      Kirsten Dunst
      road movie|disappearance|book|new zealand|rain...
      12.0
      United States of America
      English|Français
      TriStar Pictures|Teitler Film|Interscope Commu...
    
    
      2
      0
      Romance|Comedy
      15602
      A family wedding reignites the ancient feud be...
      11.7129
      1995-12-22
      0.0
      101.0
      Grumpier Old Men
      6.5
      92.0
      1995.0
      6194.0
      26502.0
      Howard Deutch
      6837.0
      Walter Matthau
      3151.0
      Jack Lemmon
      13567.0
      Ann-Margret
      fish|friend|duringcreditsstinger|men
      10749.0
      United States of America
      English
      Warner Bros.|Lancaster Gate
    
    
      3
      16000000
      Comedy|Drama|Romance
      31357
      Cheated on, mistreated and stepped on, the wom...
      3.85949
      1995-12-22
      81452156.0
      127.0
      Waiting to Exhale
      6.1
      34.0
      1995.0
      306.0
      2178.0
      Forest Whitaker
      8851.0
      Whitney Houston
      9780.0
      Angela Bassett
      18284.0
      Loretta Devine
      based on novel|single mother|divorce
      35.0
      United States of America
      English
      Twentieth Century Fox Film Corporation
    
    
      4
      0
      Comedy
      11862
      Just when George Banks has recovered from his ...
      8.38752
      1995-02-10
      76578911.0
      106.0
      Father of the Bride Part II
      5.7
      173.0
      1995.0
      5842.0
      NaN
      NaN
      67773.0
      Steve Martin
      3092.0
      Diane Keaton
      519.0
      Martin Short
      baby|midlife crisis|reunion|divorce|daughter|d...
      35.0
      United States of America
      English
      Sandollar Productions|Touchstone Pictures



In [93]:

    
df.to_csv('df.csv',index=False)

2.5 Movie Clustering

Cluster the movies into two groups with the help of eigenvectors



In [94]:

    
gaussian_filter = lambda x,y,sigma: math.exp(-(x-y)**2/(2*sigma**2))



In [95]:

    
def entry_variables(df, id_entry): 
    col_labels = []    
    if pd.notnull(df['director'].iloc[id_entry]):
        for s in df['director'].iloc[id_entry].split('|'):
            col_labels.append(s)
            
    for i in range(3):
        column = 'actor_name_NUM'.replace('NUM', str(i+1))
        if pd.notnull(df[column].iloc[id_entry]):
            for s in df[column].iloc[id_entry].split('|'):
                col_labels.append(s)
                
    if pd.notnull(df['keywords'].iloc[id_entry]):
        for s in df['keywords'].iloc[id_entry].split('|'):
            col_labels.append(s)
    return col_labels



In [96]:

    
def add_variables(df, REF_VAR):    
    for s in REF_VAR: df[s] = pd.Series([0 for _ in range(len(df))])
    colonnes = ['genres', 'actor_name_1', 'actor_name_2',
                'actor_name_3', 'director', 'keywords']
    for categorie in colonnes:
        for index, row in df.iterrows():
            if pd.isnull(row[categorie]): continue
            for s in str(row[categorie]).split('|'):
                if s in REF_VAR:
                    if categorie == 'genres': df.set_value(index, s, 1.3) 
                    else: df.set_value(index, s, 1)            
    return df

Sample the movie by delete the moives that have NAN data



In [97]:

    
df = df_filling.copy(deep=True)
df.reset_index(inplace = True, drop = True)



In [98]:

    
df_cluster= df_keywords_occurence.copy(deep=True)
df_cluster = df_cluster.dropna(how = 'any').reset_index()

Sample the dataset.



In [99]:

    
df_cluster = df_cluster[:255]



In [100]:

    
s = (len(df_cluster),len(df_cluster))
cluster_weights = np.zeros(s)



In [102]:

    
for i in range(len(df_cluster)):
    id_entry = i
    df_copy = df_cluster.copy(deep = True)    
    liste_genres = set()
    for s in df['genres'].str.split('|').values:
        liste_genres = liste_genres.union(set(s))    

    # Create additional variables to check the similarity
    variables = entry_variables(df_copy, id_entry)
    variables += list(liste_genres)
    df_new = add_variables(df_copy, variables)

    # determination of the closest neighbors: the distance is calculated / new variables
    X = df_new.as_matrix(variables)
    nbrs = NearestNeighbors(n_neighbors=31, algorithm='auto', metric='euclidean').fit(X)

    distances, indices = nbrs.kneighbors(X)    
    xtest = df_new.iloc[id_entry].as_matrix(variables)
    xtest = xtest.reshape(1, -1)

    distances, indices = nbrs.kneighbors(xtest)
    for j in range(len(indices[0])):       
        a = distances[0]
        b = indices[0]
        cluster_weights[i][b[j]] = a[j]
        cluster_weights[b[j]][i] = a[j]



In [104]:

    
s=set()
for i in df['genres'].str.split('|').values:
    s=s.union(set(i))



In [105]:

    
weights = cluster_weights.copy()
plt.spy(weights)









    Out[105]:





<matplotlib.image.AxesImage at 0x135847390>



In [106]:

    
degrees =[sum(weights[i,:]) for i in range(len(weights))]
plt.hist(degrees, bins=50);



In [107]:

    
laplacian =  sparse.csgraph.laplacian(weights, normed=True)
plt.spy(laplacian);



In [108]:

    
laplacian = sparse.csr_matrix(laplacian)
laplacian.count_nonzero()
remaining_edges = laplacian.count_nonzero()/2
print("the remaining edges number is {}".format(int(remaining_edges)))









    



the remaining edges number is 5572



In [109]:

    
eigenvalues, eigenvectors = sparse.linalg.eigsh(laplacian,k = 10,which='SM')
plt.plot(eigenvalues, '.-', markersize=15);



In [110]:

    
x =eigenvectors[:, 1]
y =eigenvectors[:, 2]



In [111]:

    
labels =  eigenvectors[:, 1]> eigenvectors.mean() 
plt.scatter(x, y, c=labels, cmap='RdBu', alpha=0.5);



In [112]:

    
a = []
for i in range(len(labels)):
    if labels[i] == True:
        a.append(i)
b = []
for i in range(len(labels)):
    if labels[i] == False:
        b.append(i)



In [113]:

    
df_group1 = df_cluster.copy()
df_group1 = df_group1.drop(a)
df_group2 = df_cluster.copy()
df_group2 = df_group2.drop(b)



In [114]:

    
set_keywords = set()
for list_keywords in df_group1['genres'].str.split('|').values:
    if isinstance(list_keywords, float): continue  # only happen if liste_keywords = NaN
    set_keywords = set_keywords.union(list_keywords)
# remove null chain entry
set_keywords2 = set()
for list_keywords in df_group2['genres'].str.split('|').values:
    if isinstance(list_keywords, float): continue  # only happen if liste_keywords = NaN
    set_keywords2 = set_keywords2.union(list_keywords)



In [115]:

    
Group1 =count_word(df_group2,'genres',set_keywords2,1)
Group2 = count_word(df_group1,'genres',set_keywords,1)



In [116]:

    
Group1_list = Group1[0]
Group1_list[:3]









    Out[116]:





[['Comedy', 97], ['Drama', 74], ['Romance', 63]]



In [117]:

    
Group2_list = Group2[0]
Group2_list[:3]









    Out[117]:





[['Drama', 74], ['Thriller', 59], ['Action', 44]]

From the results above we can see that the genres of group1 are similar and that of group2 are similar.

3. RECOMMENDATION ENGINE BASED ON MOVIE

3.1 Basic functioning of the engine

In order to build the recommendation engine, basically proceed in two steps:

determine $N$ films with a content similar to the entry provided by the user
select the 5 most popular films among these $N$ films ( we select $N$=31 as the test)

3.1.1 Similarity

The first step is to define a criteria that would tell us how close two films are. we extract features of the director name, the names of the actors ,a few keywords and genres,then build a matrix like below

movie title	director	actor 1	keyword 1	genre k
Film 1	$a_{11}$	$a_{12}$	...	$a_{1q}$
...			...
Film i	$a_{i1}$	$a_{i2}$	$a_{ij}$	$a_{iq}$
...			...
Film p	$a_{p1}$	$a_{p2}$	...	$a_{pq}$

We compare these features of each movie with the features of the given movie.Then we put the outcome into the matrix(the matrix entries can be 0, 1 or other numbers >0).

If a movie has a different fearue to the one of the selected movie, we set it to 0;

else, if they share a same feature, we set it to a positive number, usually 1. Concerning the weight of this feature, it can also be other positive numbers. For example, we give the "genre" feature a weight of 1.7.

For exemple, if "keyword 1" is in film $i$, we will have $a_{ij}$ = 1 and 0 otherwise. Since we think genres fearture is more important, we set $a_{ij}$ = p or 0 ( p=1.7 here)

Once this matrix has been defined, we determine the distance between two films according to:

\begin{eqnarray} d_{m, n} = \sqrt{ \sum_{i = 1}^{N} \left( a_{m,i} - a_{n,i} \right)^2 } \end{eqnarray}

At this stage, we just have to select the N films which are the closest from the entry selected by the user.

3.1.2 Popularity

According to similarities between entries, we get a list of $N$ films. At this stage, we select 5 films from this list and give a score for every entry. We decide de compute the score according to 3 criteria:

the IMDB score
the number of votes the entry received
the year of release

Then, we calculate the score according to the formula:

\begin{eqnarray} \mathrm{score} = IMDB^2 \times \phi_{\sigma_1, c_1} \times \phi_{\sigma_2, c_2} \end{eqnarray}

where $\phi$ is a gaussian function:

\begin{eqnarray} \phi_{\sigma, c}(x) \propto \mathrm{exp}\left(-\frac{(x-c)^2}{2 \, \sigma^2}\right) \end{eqnarray}

For votes, we get the maximum number of votes among the $N$ films and set $\sigma_1 = c_1 $= max_user.

For years, we put $\sigma_1 = 20$ and center the gaussian on the title year of the film selected by the user. With the gaussians, we put more weight to the entries with a large number of votes and to the films whose release year is close to the title selected by the user.

3.2 Definition of the engine functions



In [118]:

    
def gaussian_filter(x,y,sigma):
    value = math.exp(-(x-y)**2/(2*sigma**2))
    return value

Function initial_variables: this function returns the values taken by the variables 'director_name', 'plot_keywords'' and 'actor_N_name'(N $\in$ [1:3]) for the film selected by the user.



In [119]:

    
def initial_variables(df_tem, id_entry):    
    labels = []       
    if pd.notnull(df_tem['director'].iloc[id_entry]):
        for v in df_tem['director'].iloc[id_entry].split('|'):
            labels.append(v)
    
    if pd.notnull(df_tem['keywords'].iloc[id_entry]):
        for v in df_tem['keywords'].iloc[id_entry].split('|'):
            labels.append(v)
            
    for i in range(3):
        column = 'actor_name_NUM'.replace('NUM', str(i+1))
        if pd.notnull(df_tem[column].iloc[id_entry]):
            for v in df_tem[column].iloc[id_entry].split('|'):
                labels.append(v)
                
    return labels

Function create_matrix: this function add a list of variables to the dataframe given in input and initialize these variables at 0,1,or 1.7 depending on the correspondance with the description of the films and the content of the intial_var variable given in input.



In [120]:

    
def create_matrix(df_tem, intial_var): 
        
    columns = ['genres', 'actor_name_1', 'actor_name_2',
                'actor_name_3', 'director', 'keywords']
    
    for s in intial_var: 
        df_tem[s] = pd.Series([0 for _ in range(len(df_tem))])
        
    for categorie in columns:
        for index, row in df_tem.iterrows():
            if pd.isnull(row[categorie]): continue
            for s in row[categorie].split('|'):
                if s in intial_var:
                    #setting genres weight is 1.7
                    if categorie == 'genres': df_tem.set_value(index, s, 1.7) 
                    else: df_tem.set_value(index, s, 1)            
    return df_tem

Function similarity: this function create a list of N (= 31) films similar to the film selected by the user according to the length of their eculidean disatnces.



In [121]:

    
def similarity(df_tem,id_entry):       
    list_genres = set()
    for s in df_tem['genres'].str.split('|').values:
        list_genres = list_genres.union(set(s))   
        
    # Create additional variables to check the similarity
    variables = initial_variables(df_tem, id_entry)
    variables += list(list_genres)
    df_distance = create_matrix(df_tem, variables)
    
    # use NearestNeighbors to calculate the euclidean distance
    dist_matrix = df_distance.as_matrix(variables)
    
    nbrs = NearestNeighbors(n_neighbors=31, algorithm='auto', metric='euclidean').fit(dist_matrix)

    distances, indices = nbrs.kneighbors(dist_matrix)    
    
    dist_matrix_test = df_distance.iloc[id_entry].as_matrix(variables).reshape(1,-1)

    distances, indices = nbrs.kneighbors(dist_matrix_test)

    return indices[0][:]

Function popularity_score: this function gives a score to a film depending on its IMDB score, title, year and the number of users who have voted for this film.



In [122]:

    
def popularity_score(title_main, max_users, year_ref, title, year, imdb_score, votes):    
    if pd.notnull(year_ref):
        factor_year = gaussian_filter(year_ref, year, 20)
    else: 
        factor_year = 1        

    sigma = max_users * 1.0

    if pd.notnull(votes):
        factor_vote = gaussian_filter(votes, max_users, sigma)
    else: 
        factor_vote = 0
        
    if title_similartiy(title_main, title):
        note = 0
    else:
        note = imdb_score**2 * factor_year * factor_vote
    
    return note

Function extract_parameters: this function extracts some variables of the dataframe given in input and returns this list for a selection of N films. This list is ordered according to criteria established in the popularity_score function.



In [123]:

    
def extract_parameters(df_tem, list_films):     
    
    parametres_films = ['_' for _ in range(31)]
    
    i =0
    max_users = -1
    
    for index in list_films:
        parametres_films[i] = list(df_tem.iloc[index][['title', 'title_year',
                                        'vote_average', 'overview', 
                                        'vote_count','language','genres']])
        #add index
        parametres_films[i].append(index)
        #get the maximum vote_count
        max_users = max(max_users, parametres_films[i][4] )
        i += 1
        

    title_main = parametres_films[0][0]
    year_ref  = parametres_films[0][1]
    
    #sorted the parametres_films according to their popularity score
    parametres_films.sort(key = lambda x:popularity_score(title_main, max_users,
                                   year_ref, x[0], x[1], x[2], x[4]), reverse = True)
    
    return parametres_films

Function title_similartiy: this function compares the 2 titles passed in input and defines if these titles are similar or not.



In [124]:

    
def title_similartiy(title_1, title_2): 
   
    if fuzz.ratio(title_1, title_2) > 50 or fuzz.token_set_ratio(title_1, title_2) > 50:
    
        return True
    else:
        return False

Function popularity_selection: this function complete the film_selection list which contains 5 films that will be recommended to the user. The films are selected from the parametres_films list and are taken into account only if the title is different enough from other film titles.



In [125]:

    
def popularity_selection(film_selection, parametres_films):    
    film_list = film_selection[:]
    film_num = len(film_list)
    
    for i in range(31):
        already_in_list = False
        for s in film_selection:
            if s[0] == parametres_films[i][0]: 
                already_in_list = True
            if title_similartiy(parametres_films[i][0], s[0]): 
                already_in_list = True            
        if already_in_list: 
            continue
            
        film_num += 1
        
        if film_num <= 5:
            film_list.append(parametres_films[i])
            
    return film_list

Function remove_latest_sametitle: this function remove sequels from the list if more that two films from a serie are present. The older one is kept.



In [126]:

    
def reserve_latest_sametitle(film_selection):    
    removed_from_selection = []
    for i, film_1 in enumerate(film_selection):
        for j, film_2 in enumerate(film_selection):
            if j <= i: continue 
            if title_similartiy(film_1[0], film_2[0]): 
                last_film = film_2[0] if film_1[1] < film_2[1] else film_1[0]
                removed_from_selection.append(last_film)
    film_list = [film for film in film_selection if film[0] not in removed_from_selection]

    return film_list

Main function: create a list of 5 films that will be recommended to the user.



In [127]:

    
def movie_recommendation(df_tem, id_entry, del_new, verbose):    
    if verbose: 
        print(100*'_' + '\n' + "Recommendation: Films similar to id={} -> title: '{}'  genres:'{}'.".format(id_entry,
                                df_tem.iloc[id_entry]['title'],df_tem.iloc[id_entry]['genres']))
        print('This film is about: {}'.format(df_tem.iloc[id_entry]['overview']))
        
    # Create a list of 31 films according the distance
    list_films = similarity(df_tem, id_entry)
    
    # Sort a list of 31 films according the popularyity score 
    parametres_films = extract_parameters(df_tem, list_films)
    
    # Select 5 films from this list 
    film_selection = []
    film_selection = popularity_selection(film_selection, parametres_films)
    
    # Delete the old movies that have same title
    if del_new: film_selection = reserve_latest_sametitle(film_selection)
        
    # Add new films to complete the list
    film_selection = popularity_selection(film_selection, parametres_films)
    
    selection_titles = []
    
    #Print the recommendation
    for i,s in enumerate(film_selection):
        selection_titles.append([s[0].replace(u'\xa0', u''), s[3],s[6]])
        if verbose: 
            print("nº{:<2}     -> {:<30}".format(i+1, s[0]))
           

    return selection_titles

3.3 Making meaningful recommendations

Firtly, we face a small issue: the existence of same title makes that some recommendations may seem quite unreasonable.For exemple, somebody who liked "'The NeverEnding Story' would be recommend by these movie that he may not like:



In [128]:

    
dum = movie_recommendation(df, 2052, False,  True)









    



____________________________________________________________________________________________________
Recommendation: Films similar to id=2052 -> title: 'The NeverEnding Story'  genres:'Drama|Family|Fantasy|Adventure'.
This film is about: While hiding from bullies in his school's attic, a young boy discovers the extraordinary land of Fantasia, through a magical book called The Neverending Story. The book tells the tale of Atreyu, a young warrior who, with the help of a luck dragon named Falkor, must save Fantasia from the destruction of The Nothing.
nº1      -> Harry Potter and the Philosopher's Stone
nº2      -> Harry Potter and the Chamber of Secrets
nº3      -> Harry Potter and the Prisoner of Azkaban
nº4      -> Jumanji                       
nº5      -> Harry Potter and the Goblet of Fire

This issue is quite easily understood: these movies series share the same director, actors and keywords,so it is quite probable that this engine will recommend a series of films. In the previous exemple, we see that the engine recommends the three films of the "Harry Potter and the Philosopher's Stone" trilogy, as well as "Harry Potter and the Chamber of Secrets" and "Harry Potter and the Prisoner of Azkaban".

Hence, we used the fuzzywuzzy package to build the reserve_latest_sametitle function. This function defines the degree of similarity of two film titles and if too close, the most recent film is removed from the list of recommendations.



In [129]:

    
df2=df.copy(deep=True)
dum2 = movie_recommendation(df2, 2052,  True,  True)









    



____________________________________________________________________________________________________
Recommendation: Films similar to id=2052 -> title: 'The NeverEnding Story'  genres:'Drama|Family|Fantasy|Adventure'.
This film is about: While hiding from bullies in his school's attic, a young boy discovers the extraordinary land of Fantasia, through a magical book called The Neverending Story. The book tells the tale of Atreyu, a young warrior who, with the help of a luck dragon named Falkor, must save Fantasia from the destruction of The Nothing.
nº1      -> Harry Potter and the Philosopher's Stone
nº2      -> Jumanji                       
nº3      -> A Little Princess             
nº4      -> Clash of the Titans           
nº5      -> Ronja Robbersdaughter



In [130]:

    
dum2









    Out[130]:





[["Harry Potter and the Philosopher's Stone",
  "Harry Potter has lived under the stairs at his aunt and uncle's house his whole life. But on his 11th birthday, he learns he's a powerful wizard -- with a place waiting for him at the Hogwarts School of Witchcraft and Wizardry. As he learns to harness his newfound powers with the help of the school's kindly headmaster, Harry uncovers the truth about his parents' deaths -- and about the villain who's to blame.",
  'Adventure|Fantasy|Family'],
 ['Jumanji',
  "When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures.",
  'Adventure|Fantasy|Family'],
 ['A Little Princess',
  "When her father enlists to fight for the British in WWI, young Sara Crewe goes to New York to attend the same boarding school her late mother attended. She soon clashes with the severe headmistress, Miss Minchin, who attempts to stifle Sara's creativity and sense of self- worth.",
  'Drama|Family|Fantasy'],
 ['Clash of the Titans',
  "To win the right to marry his love, the beautiful princess Andromeda, and fulfil his destiny, Perseus must complete various tasks including taming Pegasus, capturing Medusa's head, and battling the Kraken monster.",
  'Adventure|Fantasy|Family'],
 ['Ronja Robbersdaughter',
  "Ronya lives happily in her father's castle until she comes across a new playmate, Birk, in the nearby dark forest. The two explore the wilderness, braving dangerous Witchbirds and Rump-Gnomes. But when their families find out Birk and Ronja have been playing together, they forbid them to see each other again. Indeed, their fathers are competing robber chieftains and bitter enemies. Now the two spunky children must try to tear down the barriers that have kept their families apart for so long.",
  'Adventure|Drama|Fantasy|Family']]

We print these parameters of each film recommended. As shown above, the overviews, genres of the selected film and recommended films are similar. The result of recommending is reasonable enough to give users recommendation from one film to five similar films.

3.4 Exemple of recommendation: test-case



In [131]:

    
selection = dict()
for i in range(0, 1000,300):
    selection[i] = movie_recommendation(df, i,  True, True)









    



____________________________________________________________________________________________________
Recommendation: Films similar to id=0 -> title: 'Toy Story'  genres:'Animation|Comedy|Family'.
This film is about: Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.
nº1      -> Monsters, Inc.                
nº2      -> Santa Claus and the Magic Drum
nº3      -> Heathcliff: The Movie         
nº4      -> Meet the Robinsons            
nº5      -> Barbie of Swan Lake           
____________________________________________________________________________________________________
Recommendation: Films similar to id=300 -> title: 'Roommates'  genres:'Drama|Comedy'.
This film is about: nan
nº1      -> Magnolia                      
nº2      -> Cosy Dens                     
nº3      -> All About My Mother           
nº4      -> Death at a Funeral            
nº5      -> Werckmeister Harmonies        
____________________________________________________________________________________________________
Recommendation: Films similar to id=600 -> title: 'Fargo'  genres:'Crime|Drama|Thriller'.
This film is about: Jerry, a small-town Minnesota car salesman is bursting at the seams with debt... but he's got a plan. He's going to hire two thugs to kidnap his wife in a scheme to collect a hefty ransom from his wealthy father-in-law. It's going to be a snap and nobody's going to get hurt... until people start dying. Enter Police Chief Marge, a coffee-drinking, parka-wearing - and extremely pregnant - investigator who'll stop at nothing to get her man. And if you think her small-time investigative skills will give the crooks a run for their ransom... you betcha!
nº1      -> Collateral                    
nº2      -> Memories of Murder            
nº3      -> Nine Queens                   
nº4      -> Layer Cake                    
nº5      -> Cop Land                      
____________________________________________________________________________________________________
Recommendation: Films similar to id=900 -> title: 'The Women'  genres:'Comedy|Drama'.
This film is about: Wealthy Mary Haines is unaware her husband is having an affair with shopgirl Crystal Allen. Sylvia Fowler and Edith Potter discover this from a manicurist and arrange for Mary to hear the gossip. On the train taking her to a Reno divorce Mary meets the Countess and Miriam (in an affair with Fowler's husband). While they are at Lucy's dude ranch, Fowler arrives for her own divorce and the Countess meets fifth husband-to-be Buck. Back in New York, Mary's ex is now unhappily married to Crystal who is already in an affair with Buck.
nº1      -> Two Tars                      
nº2      -> In Name Only                  
nº3      -> Auntie Mame                   
nº4      -> The Life and Death of 9413, a Hollywood Extra
nº5      -> Tiara Tahiti

4. RECOMMENDATION ENGINE BASED ON USER

4.1 Basic functioning of the engine

To rebuild a recommendation engine based on users, there are maily two steps:

find out a user's favorite genres of movie.
select the highest ranked movies of user's favorate genres.
use the recommendation engine based on movie to give recommendations.

4.2 Definition of the engine functions



In [132]:

    
# read user's rating data
rate = pd.read_csv("../data/ratings.csv")

Function selecting rated movies: Fetch out all the movie voted by the user. Remove movies with missing parameters. Give a new Dataframe of information of available movies.



In [133]:

    
def rated_movie(rate_t, df_t, userId):
    # find all movies the user has voted
    rate_id = rate_t[rate_t['userId'] == userId]
    all_movies = list(df_t['id'])
    rate_id.reset_index(inplace=True)

    # remove movies that do not exist
    a=[]
    for i in range(rate_id.shape[0]):
        if (str)((int)(rate_id.iloc[i]['movieId'])) not in all_movies:
            a.append(i)
    rate_id.drop(a,inplace=True)
    rate_clear = rate_id
    rate_clear.reset_index(inplace=True)

    # find concrete information of rated movies
    list_index = []
    for i in range(rate_clear.shape[0]):
        list_index.append(df_t[df_t['id'] == (str)((int)(rate_clear.iloc[i]['movieId']))].index.values[0])

    rate_movie = df_t.iloc[list_index]

    rate_movie.reset_index(inplace=True)

    # merge two DataFrames
    rate_all = pd.concat([rate_clear, rate_movie], axis=1) 

    return rate_all, rate_movie

Function finding favorite genres: Collect number = num_types user's favorite movie genres and return a list.



In [134]:

    
def favor_genres(rate_movie):
    # get user's favorite genres
    set_genre = set()
    for s in rate_movie['genres'].str.split('|').values:
        set_genre = set_genre.union(set(s))

    genre_occurences, genre_d = count_word(rate_movie, 'genres', set_genre,1)

    favorite_genres = []
    num_types = 2;
    for i in range(num_types):
        favorite_genres.append(genre_occurences[i][0])
    
    return favorite_genres

Function finding favorite movies: Extract user's favorite movie of each genre and return a list of these movies.



In [135]:

    
def favor_movie(favorite_genres, rate_all):
    # extract the favorite movie of each favorite genre
    favor_movie = []
    max_rate = []
    for j in range(len(favorite_genres)):
        favor_movie.append(-1)
        max_rate.append(-1)
        for i in range(rate_all.shape[0]):
            if favorite_genres[j] in rate_all['genres'].str.split('|').values[i]:
                if rate_all.iloc[i]['rating'] > max_rate[j]:
                    max_rate[j] = rate_all.iloc[i]['rating']
                    favor_movie[j] = rate_all.iloc[i]['movieId']


    favor_movie_id = list(set(favor_movie))
    a = list(df['id'])
    favor_movie_id_location = []
    for i in favor_movie_id:
        favor_movie_id_location.append(a.index(str(i)))
    
    return favor_movie_id_location

Function recommending based on movie: Give a recommendation based on user's favorite movies



In [136]:

    
def film_recom(df_t, id_entry, del_new, verbose ):    
    if verbose: 
        print("You may like these movies:")
    
    # Create a list of 31 films according the distance
    list_films = similarity(df_t, id_entry)
    
    # Sort a list of 31 films according the popularyity score 
    parametres_films = extract_parameters(df_t, list_films)
    
    # Select 5 films from this list 
    film_selection = []
    film_selection = popularity_selection(film_selection, parametres_films)
    
    # Delete the old movies that have same title
    if del_new: film_selection = reserve_latest_sametitle(film_selection)
        
    # Add new films to complete the list
    film_selection = popularity_selection(film_selection, parametres_films)
    
    selection_titles = []
    for i,s in enumerate(film_selection[:2]):
        selection_titles.append([s[0].replace(u'\xa0', u''), s[2]])
        if verbose: print("nº{:<2}     -> {:<30}".format(i+1, s[0]))

    return selection_titles

Function main: Give 2 * num_types recommendation based on user's voting.



In [137]:

    
def user_recommendation(rate_t, df_t, userId):

    #clear data
    rate_all, rate_movie = rated_movie(rate_t, df_t, userId)
    
   # if  rate_all== 'error':
   #     return
    #get user's favorite genres
    favorite_genres = favor_genres(rate_movie)
    
    #get movid id of favorite movie
    favor_movie_id = favor_movie(favorite_genres, rate_all)
    
    print("It seems that you really like watching {} and {} !".format(favorite_genres[0],favorite_genres[1]))
    
    selection = []
    for i in range(len(favor_movie_id)):
         selection.extend(film_recom(df_t, favor_movie_id[i], True,  True))
    return selection

4.3 Making meaningful recommendations

Considering the issue that, a user's favorite movie can be affected by two main issues:

Genres that the user likes.
How a moive attracts this user(these moives may not be in one of the favorite genres).

To combine both these two effects, we first show two genres the user likes best. Then, we recommend 3 movies related to the user's favorite moive for each genre. It is likely that the highest ranked movie of each genre may be the same one. In this case, we just recommend 3 movie.



In [138]:

    
df3=df.copy(deep=True)
rate3=rate.copy(deep=True)
user_recommendation(rate3, df3, 10000)









    



It seems that you really like watching Drama and History !
You may like these movies:
nº1      -> Trench of Hope                
nº2      -> The Founder                   






    Out[138]:





[['Trench of Hope', 8.0], ['The Founder', 7.0]]



In [139]:

    
df3=df.copy(deep=True)
rate3=rate.copy(deep=True)
user_recommendation(rate3, df3, 222)









    



It seems that you really like watching Drama and Thriller !
You may like these movies:
nº1      -> The Violent Professionals     
nº2      -> The Formula                   
You may like these movies:
nº1      -> The Hunt                      
nº2      -> This Is England               






    Out[139]:





[['The Violent Professionals', 6.0],
 ['The Formula', 6.0],
 ['The Hunt', 7.9000000000000004],
 ['This Is England', 7.4000000000000004]]



In [140]:

    
df3=df.copy(deep=True)
rate3=rate.copy(deep=True)
user_recommendation(rate3, df3, 666)









    



It seems that you really like watching Drama and Romance !
You may like these movies:
nº1      -> 8½                            
nº2      -> The Hustler                   






    Out[140]:





[['8½', 7.9000000000000004], ['The Hustler', 7.5999999999999996]]

Except the title of each film recommended, we also print out the number of people who vote for this film. This reflects how many people do care about this film, and thus indicate how popular this film is.

5. Conclusion: possible improvements and points to adress

Finally a few things were not considered when building the engine and they should deserve some attention:

the language of the film was not checked: in fact, this could be important to get sure that the films recommended are in the same language than the one choosen by the user
some sequels to films may don't share similar titles (e.g. James Bond serie)
the process of removing unavailable can be moved to the data cleaning section to make it clear.
if possbile, the original data can be seperated into two sets, one for training and the other for testing. This can carry out a more direct result of how well the recommendation system works.
the recommendation engine based on user can also include some other factors, ex. watching year, preferable language.

	cast	crew	id
0	[{'cast_id': 14, 'character': 'Woody (voice)',...	[{'credit_id': '52fe4284c3a36847f8024f49', 'de...	862
1	[{'cast_id': 1, 'character': 'Alan Parrish', '...	[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...	8844
2	[{'cast_id': 2, 'character': 'Max Goldman', 'c...	[{'credit_id': '52fe466a9251416c75077a89', 'de...	15602
3	[{'cast_id': 1, 'character': "Savannah 'Vannah...	[{'credit_id': '52fe44779251416c91011acb', 'de...	31357
4	[{'cast_id': 1, 'character': 'George Banks', '...	[{'credit_id': '52fe44959251416c75039ed7', 'de...	11862

	adult	belongs_to_collection	budget	genres	id	imdb_id	original_language	original_title	overview	popularity	poster_path	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	title	video	vote_average	vote_count	title_year	production_company_id	director_id	director	actor_id_1	actor_name_1	actor_id_2	actor_name_2	actor_id_3	actor_name_3	keywords	genres_id	country	language	production_company
0	False	{'id': 10194, 'name': 'Toy Story Collection', ...	30000000	Animation\|Comedy\|Family	862	tt0114709	en	Toy Story	Led by Woody, Andy's toys live happily in his ...	21.9469	/rhIRbceoE9lR4veEXuwCC2wARtG.jpg	[{'name': 'Pixar Animation Studios', 'id': 3}]	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-10-30	373554033.0	81.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Toy Story	False	7.7	5415.0	1995.0	3.0	7879.0	John Lasseter	31.0	Tom Hanks	12898.0	Tim Allen	7167.0	Don Rickles	jealousy\|toy\|boy\|friendship\|friends\|rivalry\|bo...	16.0	United States of America	English	Pixar Animation Studios
1	False	NaN	65000000	Adventure\|Fantasy\|Family	8844	tt0113497	en	Jumanji	When siblings Judy and Peter discover an encha...	17.0155	/vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg	[{'name': 'TriStar Pictures', 'id': 559}, {'na...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-15	262797249.0	104.0	[{'iso_639_1': 'en', 'name': 'English'}, {'iso...	Released	Jumanji	False	6.9	2413.0	1995.0	559.0	NaN	NaN	2157.0	Robin Williams	8537.0	Jonathan Hyde	205.0	Kirsten Dunst	board game\|disappearance\|based on children's b...	12.0	United States of America	English\|Français	TriStar Pictures\|Teitler Film\|Interscope Commu...
2	False	{'id': 119050, 'name': 'Grumpy Old Men Collect...	0	Romance\|Comedy	15602	tt0113228	en	Grumpier Old Men	A family wedding reignites the ancient feud be...	11.7129	/6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg	[{'name': 'Warner Bros.', 'id': 6194}, {'name'...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-22	0.0	101.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Grumpier Old Men	False	6.5	92.0	1995.0	6194.0	26502.0	Howard Deutch	6837.0	Walter Matthau	3151.0	Jack Lemmon	13567.0	Ann-Margret	fishing\|best friend\|duringcreditsstinger\|old men	10749.0	United States of America	English	Warner Bros.\|Lancaster Gate
3	False	NaN	16000000	Comedy\|Drama\|Romance	31357	tt0114885	en	Waiting to Exhale	Cheated on, mistreated and stepped on, the wom...	3.85949	/16XOMpEaLWkrcPqSQqhTmeJuqQl.jpg	[{'name': 'Twentieth Century Fox Film Corporat...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-12-22	81452156.0	127.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Waiting to Exhale	False	6.1	34.0	1995.0	306.0	2178.0	Forest Whitaker	8851.0	Whitney Houston	9780.0	Angela Bassett	18284.0	Loretta Devine	based on novel\|interracial relationship\|single...	35.0	United States of America	English	Twentieth Century Fox Film Corporation
4	False	{'id': 96871, 'name': 'Father of the Bride Col...	0	Comedy	11862	tt0113041	en	Father of the Bride Part II	Just when George Banks has recovered from his ...	8.38752	/e64sOI48hQXyru7naBFyssKFxVd.jpg	[{'name': 'Sandollar Productions', 'id': 5842}...	[{'iso_3166_1': 'US', 'name': 'United States o...	1995-02-10	76578911.0	106.0	[{'iso_639_1': 'en', 'name': 'English'}]	Released	Father of the Bride Part II	False	5.7	173.0	1995.0	5842.0	NaN	NaN	67773.0	Steve Martin	3092.0	Diane Keaton	519.0	Martin Short	baby\|midlife crisis\|confidence\|aging\|daughter\|...	35.0	United States of America	English	Sandollar Productions\|Touchstone Pictures

	budget	genres	id	overview	popularity	release_date	revenue	runtime	title	vote_average	vote_count	title_year	production_company_id	director_id	director	actor_id_1	actor_name_1	actor_id_2	actor_name_2	actor_id_3	actor_name_3	keywords	genres_id	country	language	production_company
column type	object	object	object	object	object	object	float64	float64	object	float64	float64	float64	float64	float64	object	float64	object	float64	object	float64	object	object	float64	object	object	object
null values	0	0	0	954	5	90	6	263	6	6	6	90	11881	19941	19941	2417	2417	3749	3749	4659	4659	0	2442	0	0	0
null values (%)	0	0	0	2.09827	0.0109972	0.19795	0.0131967	0.578454	0.0131967	0.0131967	0.0131967	0.19795	26.1316	43.8591	43.8591	5.31606	5.31606	8.24572	8.24572	10.2472	10.2472	0	5.37105	0	0	0

	column_name	missing_count	filling_factor
0	director	19941	56.140853
1	director_id	19941	56.140853
2	production_company_id	11881	73.868385
3	actor_id_3	4659	89.752782
4	actor_name_3	4659	89.752782
5	actor_id_2	3749	91.754278
6	actor_name_2	3749	91.754278
7	genres_id	2442	94.628954
8	actor_id_1	2417	94.683940
9	actor_name_1	2417	94.683940
10	overview	954	97.901729
11	runtime	263	99.421546
12	release_date	90	99.802050
13	title_year	90	99.802050
14	revenue	6	99.986803
15	title	6	99.986803
16	vote_average	6	99.986803
17	vote_count	6	99.986803
18	popularity	5	99.989003
19	keywords	0	100.000000
20	country	0	100.000000
21	budget	0	100.000000
22	id	0	100.000000
23	genres	0	100.000000
24	language	0	100.000000
25	production_company	0	100.000000

	actor_id	id	Thriller	History	Science Fiction	Foreign	Family	Action	Adventure	Romance	Comedy	Drama	Crime	Mystery
0	2.0	11	0	0	0	0	1	1	1	0	0	1	0	0
1	2.0	25757	0	0	0	0	0	1	1	0	1	0	0	0
2	2.0	266537	1	1	0	0	0	0	0	0	0	1	0	0
3	2.0	372355	1	0	0	0	0	1	0	0	0	0	1	0
4	2.0	31967	1	0	0	1	0	0	0	1	0	1	0	1
5	2.0	1892	1	0	0	0	0	0	0	0	0	1	0	0
6	2.0	331962	0	0	1	0	0	0	0	0	1	0	0	0
7	2.0	42591	1	0	0	0	0	0	0	0	0	1	1	0
8	2.0	14711	0	0	0	0	0	1	1	0	0	0	1	0
9	2.0	95511	1	0	0	0	0	1	1	0	0	1	0	0

	Thriller	Music	History	Documentary	Fantasy	Science Fiction	Foreign	Horror	Animation	Family	Action	Adventure	Romance	Comedy	War	Drama	Crime	Mystery	Western	TV Movie
actor_id
2.0	6	0	1	1	0	1	1	0	0	1	6	5	3	5	0	6	4	1	0	0
3.0	4	0	0	1	0	0	0	0	1	2	3	1	1	4	0	12	3	3	0	0
5.0	3	1	0	0	0	2	0	0	1	1	4	1	4	5	3	8	0	1	3	0
31.0	3	2	0	1	4	1	0	1	1	5	2	2	6	10	0	13	2	2	0	0
35.0	5	1	1	2	1	1	1	5	0	2	2	2	2	4	0	6	2	2	0	0