Collaborative Filtering in 9 Lines of Code

I was looking for some sample code for getting my hands dirty with some recommender system code, particularly around collaborative filtering. I came across this blog post with the same title, which is based on the book Toby Seragan's book Programming Collective Intelligence.

This notebook is based on the orginal blog post. The code had to be adjusted to make it work with Python 3, and the latest version of pandas.



In [15]:

    
import numpy as np; import pandas as pd; from pandas import Series, DataFrame

The file contains ratings from different critics on various titles.



In [19]:

    
rating = pd.read_csv('data/movie_rating.csv')
rating.head()









    Out[19]:






  
    
      
      critic
      title
      rating
    
  
  
    
      0
      Jack Matthews
      Lady in the Water
      3.0
    
    
      1
      Jack Matthews
      Snakes on a Plane
      4.0
    
    
      2
      Jack Matthews
      You Me and Dupree
      3.5
    
    
      3
      Jack Matthews
      Superman Returns
      5.0
    
    
      4
      Jack Matthews
      The Night Listener
      3.0

We will first create the matrix with titles of movies as rows and critics as columns. Each cell contains the rating from the corresponding user for a rating.



In [22]:

    
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')
rp









    Out[22]:






  
    
      critic
      Claudia Puig
      Gene Seymour
      Jack Matthews
      Lisa Rose
      Mick LaSalle
      Toby
    
    
      title
      
      
      
      
      
      
    
  
  
    
      Just My Luck
      3.0
      1.5
      NaN
      3.0
      2.0
      NaN
    
    
      Lady in the Water
      NaN
      3.0
      3.0
      2.5
      3.0
      NaN
    
    
      Snakes on a Plane
      3.5
      3.5
      4.0
      3.5
      4.0
      4.5
    
    
      Superman Returns
      4.0
      5.0
      5.0
      3.5
      3.0
      4.0
    
    
      The Night Listener
      4.5
      3.0
      3.0
      3.0
      3.0
      NaN
    
    
      You Me and Dupree
      2.5
      3.5
      3.5
      2.5
      2.0
      1.0

The next step is to find the similarity score between the critics. We will use Toby as example, and use Pearson correlation score. Pandas contains the function corrwith() which compute the correlation. As you can see from the result below, Toby's taste is similar to Lisa Rose but not so much wit Gene Seymour.

Note that we could have used some other similarity metric such as cosine similarity.



In [37]:

    
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby









    Out[37]:





critic
Claudia Puig     0.893405
Gene Seymour     0.381246
Jack Matthews    0.662849
Lisa Rose        0.991241
Mick LaSalle     0.924473
Toby             1.000000
dtype: float64

To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assigns the similarity score and the weighted rating.



In [75]:

    
rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)

rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)
rating_c.head()









    Out[75]:






  
    
      
      critic
      title
      rating
      sim_rating
      similarity
    
  
  
    
      0
      Jack Matthews
      Lady in the Water
      3.0
      1.988547
      0.662849
    
    
      4
      Jack Matthews
      The Night Listener
      3.0
      1.988547
      0.662849
    
    
      5
      Mick LaSalle
      Lady in the Water
      3.0
      2.773420
      0.924473
    
    
      7
      Mick LaSalle
      Just My Luck
      2.0
      1.848947
      0.924473
    
    
      10
      Mick LaSalle
      The Night Listener
      3.0
      2.773420
      0.924473

Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.



In [25]:

    
recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)









    Out[25]:





title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981
dtype: float64

Putting it all together:



In [36]:

    
rating = pd.read_csv('data/movie_rating.csv')
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')

rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)

rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)
rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)

recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)









    Out[36]:





title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981
dtype: float64

	critic	title	rating
0	Jack Matthews	Lady in the Water	3.0
1	Jack Matthews	Snakes on a Plane	4.0
2	Jack Matthews	You Me and Dupree	3.5
3	Jack Matthews	Superman Returns	5.0
4	Jack Matthews	The Night Listener	3.0

critic	Claudia Puig	Gene Seymour	Jack Matthews	Lisa Rose	Mick LaSalle	Toby
title
Just My Luck	3.0	1.5	NaN	3.0	2.0	NaN
Lady in the Water	NaN	3.0	3.0	2.5	3.0	NaN
Snakes on a Plane	3.5	3.5	4.0	3.5	4.0	4.5
Superman Returns	4.0	5.0	5.0	3.5	3.0	4.0
The Night Listener	4.5	3.0	3.0	3.0	3.0	NaN
You Me and Dupree	2.5	3.5	3.5	2.5	2.0	1.0

	critic	title	rating	sim_rating	similarity
0	Jack Matthews	Lady in the Water	3.0	1.988547	0.662849
4	Jack Matthews	The Night Listener	3.0	1.988547	0.662849
5	Mick LaSalle	Lady in the Water	3.0	2.773420	0.924473
7	Mick LaSalle	Just My Luck	2.0	1.848947	0.924473
10	Mick LaSalle	The Night Listener	3.0	2.773420	0.924473