Collaborative Filtering in 9 Lines of Code

I was looking for some sample code for getting my hands dirty with some recommender system code, particularly around collaborative filtering. I came across this blog post with the same title, which is based on the book Toby Seragan's book Programming Collective Intelligence.

This notebook is based on the orginal blog post. The code had to be adjusted to make it work with Python 3, and the latest version of pandas.


In [15]:
import numpy as np; import pandas as pd; from pandas import Series, DataFrame

The file contains ratings from different critics on various titles.


In [19]:
rating = pd.read_csv('data/movie_rating.csv')
rating.head()


Out[19]:
critic title rating
0 Jack Matthews Lady in the Water 3.0
1 Jack Matthews Snakes on a Plane 4.0
2 Jack Matthews You Me and Dupree 3.5
3 Jack Matthews Superman Returns 5.0
4 Jack Matthews The Night Listener 3.0

We will first create the matrix with titles of movies as rows and critics as columns. Each cell contains the rating from the corresponding user for a rating.


In [22]:
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')
rp


Out[22]:
critic Claudia Puig Gene Seymour Jack Matthews Lisa Rose Mick LaSalle Toby
title
Just My Luck 3.0 1.5 NaN 3.0 2.0 NaN
Lady in the Water NaN 3.0 3.0 2.5 3.0 NaN
Snakes on a Plane 3.5 3.5 4.0 3.5 4.0 4.5
Superman Returns 4.0 5.0 5.0 3.5 3.0 4.0
The Night Listener 4.5 3.0 3.0 3.0 3.0 NaN
You Me and Dupree 2.5 3.5 3.5 2.5 2.0 1.0

The next step is to find the similarity score between the critics. We will use Toby as example, and use Pearson correlation score. Pandas contains the function corrwith() which compute the correlation. As you can see from the result below, Toby's taste is similar to Lisa Rose but not so much wit Gene Seymour.

Note that we could have used some other similarity metric such as cosine similarity.


In [37]:
rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)
sim_toby


Out[37]:
critic
Claudia Puig     0.893405
Gene Seymour     0.381246
Jack Matthews    0.662849
Lisa Rose        0.991241
Mick LaSalle     0.924473
Toby             1.000000
dtype: float64

To make recommendation for Toby, we calculate a rating of others weighted by the similarity. Note that we only need to calculate rating for movies Toby has not yet seen. The first line below filter out irrelevant data. It then assigns the similarity score and the weighted rating.


In [75]:
rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)

rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)
rating_c.head()


Out[75]:
critic title rating sim_rating similarity
0 Jack Matthews Lady in the Water 3.0 1.988547 0.662849
4 Jack Matthews The Night Listener 3.0 1.988547 0.662849
5 Mick LaSalle Lady in the Water 3.0 2.773420 0.924473
7 Mick LaSalle Just My Luck 2.0 1.848947 0.924473
10 Mick LaSalle The Night Listener 3.0 2.773420 0.924473

Lastly we add up the score for each title using groupby(). We also normalize the score by dividing it with the sum of the weights. Base on other critics' similarity and their rating, we have made a movie recommendation for Toby. The number matches the result of the book.


In [25]:
recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)


Out[25]:
title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981
dtype: float64

Putting it all together:


In [36]:
rating = pd.read_csv('data/movie_rating.csv')
rp = rating.pivot_table(index=['title'], columns=['critic'], values='rating')

rating_toby = rp['Toby']
sim_toby = rp.corrwith(rating_toby)

rating_c = rating.loc[rating_toby[rating.title].isnull().values & (rating.critic != 'Toby')]
rating_c_similarity = rating_c['critic'].map(sim_toby)
rating_c = rating_c.assign(similarity=rating_c_similarity, sim_rating=rating_c_similarity * rating_c.rating)

recommendations = rating_c.groupby('title').apply(lambda s: s.sim_rating.sum() / s.similarity.sum())
recommendations.sort_index(ascending=False)


Out[36]:
title
The Night Listener    3.347790
Lady in the Water     2.832550
Just My Luck          2.530981
dtype: float64