Building an Recommendation Engine

Adopted from PyCon2013 Tutorial - How to Build a Minimal Recommendation Engine by Unata

The Recommendation Problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.
$S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

Goal of a recommendation system

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} i^{*} = \argmax_{i \in -I(u)} S(u,i), \forall{u \in U} $

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Dataset

MovieLens from GroupLens Research: grouplens.org

Datasets are available at http://grouplens.org/datasets/movielens/

We will be using the MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.



In [ ]:

    
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 10)



In [ ]:

    
users = pd.read_table('../data/movie-lens-1m/users.dat',
                      sep='::', header=None, 
                      names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')

ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',
                        sep='::', header=None, 
                        names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

movies = pd.read_table('../data/movie-lens-1m/movies.dat',
                       sep='::', header=None, 
                       names=['movie_id', 'title', 'genres'], engine='python')

# show how one of them looks
ratings.head(5)



In [ ]:

    
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head(5)



In [ ]:

    
movielens.info()

Evaluation

Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.

Evaluation: split ratings into train and test sets

This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:

take a smaller sample from the full 1M dataset for speed reasons;
make sure that we have at least 2 ratings per user in that subset;
split the result into training and testing sets.



In [ ]:

    
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()



In [ ]:

    
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)



In [ ]:

    
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape



In [ ]:

    
movielens_train.shape



In [ ]:

    
movielens_test.shape



In [ ]:

    
movielens_train.to_csv('../data/movie-lens-1m/movielens_train.csv')
movielens_test.to_csv('../data/movie-lens-1m/movielens_test.csv')

Evaluation: performance criterion

RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

Evaluation: the 'evaluate' method



In [ ]:

    
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))



In [ ]:

    
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)



In [ ]:

    
def estimate0(user_id, item_id):
    """ The Answer is always 3 (c is the correct answer). """
    return 3.0

print 'RMSE for estimate1: %s' % evaluate(estimate0)

Minimal reco engine v1.0: simple mean ratings

Content-based filtering using mean ratings

With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.



In [ ]:

    
movielens_train.head()



In [ ]:

    
def estimate1(user_id, item_id):
    """ Simple content-filtering based on mean ratings. """
    return None

print 'RMSE for estimate1: %s' % evaluate(estimate1)

Collaborative-based filtering using mean ratings



In [ ]:

    
def estimate2(user_id, movie_id):
    """ Simple collaborative filter based on mean ratings. """
    return None

print 'RMSE for estimate2: %s' % evaluate(estimate2)



In [ ]:

    
print movielens_train.groupby('gender')['rating'].mean()



In [ ]:

    
print movielens_train.groupby(['gender', 'age'])['rating'].mean()



In [ ]:

    
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title



In [ ]:

    
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)



In [ ]:

    
by_gender_title = movielens_train.pivot_table(values='rating', index='title', columns='gender')
by_gender_title.head(10)

Minimal reco engine v1.1: implicit sim functions

We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:



In [ ]:

    
user_info = users.set_index('user_id')
user_info.head(5)

With this in hand, we can now ask what the gender of a particular user_id is like so:



In [ ]:

    
user_id = 3
user_info.ix[user_id, 'gender']

Collaborative-based filtering by Groups

Using the pandas aggregation framework we will build a collaborative filter that estimates ratings using an implicit sim(u,u') function to compare different users.



In [ ]:

    
def estimate3(user_id, movie_id):
    """ Collaborative filtering using an implicit sim(u,u'). """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
    user_gender = user_info.ix[user_id, 'gender']
    if user_gender in means_by_gender.columns: 
        return means_by_gender.ix[movie_id, user_gender]
    else:
        return means_by_gender.ix[movie_id].mean()

print 'RMSE for reco3: %s' % evaluate(estimate3)

At this point it seems worthwhile to write a learn that pre-computes whatever datastructures we need at estimation time.



In [ ]:

    
class Reco3:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same gender. """
        if movie_id not in self.means_by_gender.index: 
            return 3.0
        user_gender = user_info.ix[user_id, 'gender']
        if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
            return self.means_by_gender.ix[movie_id, user_gender]
        else:
            return self.means_by_gender.ix[movie_id].mean()

reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)



In [ ]:

    
class Reco4:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_age = movielens_train.pivot_table('rating', index='movie_id', columns='age')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same age. """
        if movie_id not in self.means_by_age.index: return 3.0
        user_age = user_info.ix[user_id, 'age']
        if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
            return self.means_by_age.ix[movie_id, user_age]
        else:
            return self.means_by_age.ix[movie_id].mean()

reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)



In [ ]:

    
movielens.pivot_table('rating', index='movie_id', columns='user_id')

A few similarity functions

These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.

Euclidean 'similarity'

$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$

Cosine similarity

$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$

Pearson correlation

$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$



In [ ]:

    
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))



In [ ]:

    
# Test Euclidean Distance function
euclidean(np.array([1, 2, 3]), np.array([1, 2, 5]))



In [ ]:

    
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))



In [ ]:

    
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))



In [ ]:

    
user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')



In [ ]:

    
user_profiles



In [ ]:

    
class Reco5:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
        if ratings_by_others.empty: 
            return 3.0
        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)



In [ ]:

    
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)



In [ ]: