Building an Recommendation Engine

Adopted from PyCon2013 Tutorial - How to Build a Minimal Recommendation Engine by Unata

The Recommendation Problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

  • Amazon
  • Netflix
  • Facebook
  • Last.fm.

What exactly do they do?

Definitions from the literature

  • In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

  • Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

  • In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

  • Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Further Reading:

Video Course: https://www.coursera.org/course/recsys

Notation

  • $U$ is the set of users in our domain. Its size is $|U|$.
  • $I$ is the set of items in our domain. Its size is $|I|$.
  • $I(u)$ is the set of items that user $u$ has rated.
  • $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
  • $U(i)$ is the set of users that have rated item $i$.
  • $-U(i)$ is the complement of $U(i)$.
  • $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

Goal of a recommendation system

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} i^{*} = \argmax_{i \in -I(u)} S(u,i), \forall{u \in U} $

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Dataset

MovieLens from GroupLens Research: grouplens.org

Datasets are available at http://grouplens.org/datasets/movielens/

We will be using the MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.


In [1]:
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 10)

In [2]:
users = pd.read_table('../data/movie-lens-1m/users.dat',
                      sep='::', header=None, 
                      names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')

ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',
                        sep='::', header=None, 
                        names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

movies = pd.read_table('../data/movie-lens-1m/movies.dat',
                       sep='::', header=None, 
                       names=['movie_id', 'title', 'genres'], engine='python')

# show how one of them looks
ratings.head(5)


Out[2]:
user_id movie_id rating timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291

In [3]:
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head(5)


Out[3]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama

In [4]:
movielens.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209 non-null int64
movie_id      1000209 non-null int64
rating        1000209 non-null int64
timestamp     1000209 non-null int64
gender        1000209 non-null object
age           1000209 non-null int64
occupation    1000209 non-null int64
zip           1000209 non-null object
title         1000209 non-null object
genres        1000209 non-null object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB

Evaluation

Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.

Evaluation: split ratings into train and test sets

This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:

  • take a smaller sample from the full 1M dataset for speed reasons;
  • make sure that we have at least 2 ratings per user in that subset;
  • split the result into training and testing sets.

In [5]:
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()


(10000, 10)
3659
2244

In [6]:
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)


(8489, 10)
Out[6]:
True

In [7]:
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape


(5842, 11)
(2647, 11)

In [8]:
movielens_train.shape


Out[8]:
(5842, 11)

In [9]:
movielens_test.shape


Out[9]:
(2647, 11)

In [10]:
movielens_train.to_csv('../data/movie-lens-1m/movielens_train.csv')
movielens_test.to_csv('../data/movie-lens-1m/movielens_test.csv')

Evaluation: performance criterion

  • RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

Evaluation: the 'evaluate' method


In [11]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [12]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)

Minimal reco engine v1.0: simple mean ratings

Content-based filtering using mean ratings

With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.


In [13]:
def estimate1(user_id, item_id):
    """ Simple content-filtering based on mean ratings. """
    return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()

print 'RMSE for estimate1: %s' % evaluate(estimate1)


RMSE for estimate1: 1.23911345489

Collaborative-based filtering using mean ratings


In [14]:
def estimate2(user_id, movie_id):
    """ Simple collaborative filter based on mean ratings. """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    return ratings_by_others.rating.mean()

print 'RMSE for estimate2: %s' % evaluate(estimate2)


RMSE for estimate2: 1.14635101187

In [15]:
print movielens_train.groupby('gender')['rating'].mean()


gender
F         3.598227
M         3.542112
Name: rating, dtype: float64

In [16]:
print movielens_train.groupby(['gender', 'age'])['rating'].mean()


gender  age
F       1      3.613636
        18     3.449057
        25     3.631985
...
M       45     3.640351
        50     3.518644
        56     3.662500
Name: rating, Length: 14, dtype: float64

In [17]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title


gender  title                            
F       'Night Mother (1986)                 1
        10 Things I Hate About You (1999)    4
        101 Dalmatians (1961)                4
...
M       Zed & Two Noughts, A (1985)    1
        Zero Effect (1998)             2
        eXistenZ (1999)                3
Name: rating, Length: 2614, dtype: float64

In [18]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)


Out[18]:
gender F M
title
'Night Mother (1986) 1.0 NaN
'burbs, The (1989) NaN 4.000000
...And Justice for All (1979) NaN 1.000000
10 Things I Hate About You (1999) 4.0 5.000000
101 Dalmatians (1961) 4.0 3.166667
101 Dalmatians (1996) NaN 1.500000
13th Warrior, The (1999) NaN 4.000000
20,000 Leagues Under the Sea (1954) NaN 2.800000
2001: A Space Odyssey (1968) 3.5 4.100000
2010 (1984) NaN 4.000000

In [19]:
by_gender_title = movielens_train.pivot_table(values='rating', index='title', columns='gender')
by_gender_title.head(10)


Out[19]:
gender F M
title
'Night Mother (1986) 1.0 NaN
'burbs, The (1989) NaN 4.000000
...And Justice for All (1979) NaN 1.000000
10 Things I Hate About You (1999) 4.0 5.000000
101 Dalmatians (1961) 4.0 3.166667
101 Dalmatians (1996) NaN 1.500000
13th Warrior, The (1999) NaN 4.000000
20,000 Leagues Under the Sea (1954) NaN 2.800000
2001: A Space Odyssey (1968) 3.5 4.100000
2010 (1984) NaN 4.000000

Minimal reco engine v1.1: implicit sim functions

We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:


In [20]:
user_info = users.set_index('user_id')
user_info.head(5)


Out[20]:
gender age occupation zip
user_id
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 02460
5 M 25 20 55455

With this in hand, we can now ask what the gender of a particular user_id is like so:


In [21]:
user_id = 3
user_info.ix[user_id, 'gender']


Out[21]:
'M'

Collaborative-based filtering by Groups

Using the pandas aggregation framework we will build a collaborative filter that estimates ratings using an implicit sim(u,u') function to compare different users.


In [22]:
def estimate3(user_id, movie_id):
    """ Collaborative filtering using an implicit sim(u,u'). """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
    user_gender = user_info.ix[user_id, 'gender']
    if user_gender in means_by_gender.columns: 
        return means_by_gender.ix[movie_id, user_gender]
    else:
        return means_by_gender.ix[movie_id].mean()

print 'RMSE for reco3: %s' % evaluate(estimate3)


RMSE for reco3: 1.19104892024

At this point it seems worthwhile to write a learn that pre-computes whatever datastructures we need at estimation time.


In [23]:
class Reco3:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same gender. """
        if movie_id not in self.means_by_gender.index: 
            return 3.0
        user_gender = user_info.ix[user_id, 'gender']
        if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
            return self.means_by_gender.ix[movie_id, user_gender]
        else:
            return self.means_by_gender.ix[movie_id].mean()

reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)


RMSE for reco3: 1.19104892024

In [24]:
class Reco4:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_age = movielens_train.pivot_table('rating', index='movie_id', columns='age')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same age. """
        if movie_id not in self.means_by_age.index: return 3.0
        user_age = user_info.ix[user_id, 'age']
        if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
            return self.means_by_age.ix[movie_id, user_age]
        else:
            return self.means_by_age.ix[movie_id].mean()

reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)


RMSE for reco4: 1.21790603479

In [25]:
movielens.pivot_table('rating', index='movie_id', columns='user_id')


Out[25]:
user_id 2 5 10 11 17 18 22 25 26 28 ... 6010 6011 6016 6018 6023 6025 6026 6035 6036 6037
movie_id
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3946 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3948 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3949 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3952 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2161 rows × 2148 columns

A few similarity functions

These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.

  • Euclidean 'similarity'
$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$
  • Cosine similarity
$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$
  • Pearson correlation
$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$

In [26]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

In [27]:
# Test Euclidean Distance function
euclidean(np.array([1, 2, 3]), np.array([1, 2, 5]))


Out[27]:
0.33333333333333331

In [28]:
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

In [29]:
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

In [30]:
user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

In [31]:
class Reco5:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
        if ratings_by_others.empty: return 3.0
        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)

In [32]:
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)


RMSE for reco5: 1.09772699942

In [33]:


In [ ]: