Building an Recommendation Engine

The Recommendation Problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

  • Amazon
  • Netflix
  • Facebook
  • Last.fm.

What exactly do they do?

Definitions from the literature

  • In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997

  • Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992

  • In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005

  • Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

Further Reading:

Notation

  • $U$ is the set of users in our domain. Its size is $|U|$.
  • $I$ is the set of items in our domain. Its size is $|I|$.
  • $I(u)$ is the set of items that user $u$ has rated.
  • $-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
  • $U(i)$ is the set of users that have rated item $i$.
  • $-U(i)$ is the complement of $U(i)$.
  • $S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

Goal of a recommendation system

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} i^{*} = \argmax_{i \in -I(u)} S(u,i), \forall{u \in U} $

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Dataset

MovieLens from GroupLens Research: grouplens.org

Datasets are available at http://grouplens.org/datasets/movielens/

We will be using the MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.


In [26]:
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 10)

In [27]:
users = pd.read_table('../data/movie-lens-1m/users.dat',
                      sep='::', header=None, 
                      names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')

ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',
                        sep='::', header=None, 
                        names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

movies = pd.read_table('../data/movie-lens-1m/movies.dat',
                       sep='::', header=None, 
                       names=['movie_id', 'title', 'genres'], engine='python')

# show how one of them looks
ratings.head(5)


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-27-e2125bea3603> in <module>()
      1 users = pd.read_table('../data/movie-lens-1m/users.dat',
      2                       sep='::', header=None,
----> 3                       names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')
      4 
      5 ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    239 
    240     # Create the parser.
--> 241     parser = TextFileReader(filepath_or_buffer, **kwds)
    242 
    243     if (nrows is not None) and (chunksize is not None):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    555             self.options['has_index_names'] = kwds['has_index_names']
    556 
--> 557         self._make_engine(self.engine)
    558 
    559     def _get_options_with_defaults(self, engine):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    698             elif engine == 'python-fwf':
    699                 klass = FixedWidthFieldParser
--> 700             self._engine = klass(self.f, **self.options)
    701 
    702     def _failover_to_python(self):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1384         if isinstance(f, compat.string_types):
   1385             f = com._get_handle(f, 'r', encoding=self.encoding,
-> 1386                                 compression=self.compression)
   1387         elif self.compression:
   1388             f = _wrap_compressed(f, self.compression, self.encoding)

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/core/common.pyc in _get_handle(path, mode, encoding, compression)
   2714                 f = open(path, mode, errors='replace')
   2715         else:
-> 2716             f = open(path, mode)
   2717 
   2718     return f

IOError: [Errno 2] No such file or directory: '../data/movie-lens-1m/users.dat'

In [3]:
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head(5)


Out[3]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1 1193 5 978300760 F 1 10 48067 One Flew Over the Cuckoo's Nest (1975) Drama
1 2 1193 5 978298413 M 56 16 70072 One Flew Over the Cuckoo's Nest (1975) Drama
2 12 1193 4 978220179 M 25 12 32793 One Flew Over the Cuckoo's Nest (1975) Drama
3 15 1193 4 978199279 M 25 7 22903 One Flew Over the Cuckoo's Nest (1975) Drama
4 17 1193 5 978158471 M 50 1 95350 One Flew Over the Cuckoo's Nest (1975) Drama

In [4]:
movielens.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209 non-null int64
movie_id      1000209 non-null int64
rating        1000209 non-null int64
timestamp     1000209 non-null int64
gender        1000209 non-null object
age           1000209 non-null int64
occupation    1000209 non-null int64
zip           1000209 non-null object
title         1000209 non-null object
genres        1000209 non-null object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB

Evaluation

Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.

Evaluation: split ratings into train and test sets

This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:

  • take a smaller sample from the full 1M dataset for speed reasons;
  • make sure that we have at least 2 ratings per user in that subset;
  • split the result into training and testing sets.

In [5]:
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()


(10000, 10)
3659
2244

In [6]:
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)


(8489, 10)
Out[6]:
True

In [7]:
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape


(5842, 11)
(2647, 11)

In [8]:
movielens_train.shape


Out[8]:
(5842, 11)

In [9]:
movielens_test.shape


Out[9]:
(2647, 11)

In [10]:
movielens_train.to_csv('../data/movielens_train.csv')
movielens_test.to_csv('../data/movie-lens-1m/movielens_test.csv')

In [13]:
movielens_train = pd.read_csv('../data/movielens_train.csv')
movielens_test = pd.read_csv('../data/movielens_test.csv')

In [14]:
movielens_train.columns


Out[14]:
Index([u'user_id', u'movie_id', u'rating', u'timestamp', u'gender', u'age', u'occupation', u'zip', u'title', u'genres'], dtype='object')

In [15]:
movielens_train.head()


Out[15]:
user_id movie_id rating timestamp gender age occupation zip title genres
0 1793 1641 4 974703937 M 25 4 20008 Full Monty, The (1997) Comedy
1 73 70 5 977865460 M 18 4 53706 From Dusk Till Dawn (1996) Action|Comedy|Crime|Horror|Thriller
2 4725 805 3 963382233 M 35 5 96707-1321 Time to Kill, A (1996) Drama
3 2288 3869 2 974522733 M 1 10 13066 Naked Gun 2 1/2: The Smell of Fear, The (1991) Comedy
4 5630 2599 4 980383876 M 35 17 06854 Election (1999) Comedy

Evaluation: performance criterion

  • RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

Evaluation: the 'evaluate' method


In [16]:
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))

In [17]:
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)

Minimal reco engine v1.0: simple mean ratings

Content-based filtering using mean ratings

With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.


In [18]:
def estimate1(user_id, item_id):
    """ Simple content-filtering based on mean ratings. """
    return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()

print 'RMSE for estimate1: %s' % evaluate(estimate1)


RMSE for estimate1: 1.27463489115

Collaborative-based filtering using mean ratings


In [19]:
def estimate2(user_id, movie_id):
    """ Simple collaborative filter based on mean ratings. """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    return ratings_by_others.rating.mean()

print 'RMSE for estimate2: %s' % evaluate(estimate2)


RMSE for estimate2: 1.14835537277

In [20]:
print movielens_train.groupby('gender')['rating'].mean()


gender
F         3.588112
M         3.541770
Name: rating, dtype: float64

In [21]:
print movielens_train.groupby(['gender', 'age'])['rating'].mean()


gender  age
F       1      3.625000
        18     3.347222
        25     3.587912
...
M       45     3.492877
        50     3.723032
        56     3.688889
Name: rating, Length: 14, dtype: float64

In [22]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title


gender  title                            
F       'burbs, The (1989)                   1.0
        10 Things I Hate About You (1999)    3.5
        101 Dalmatians (1961)                3.0
...
M       Young Poisoner's Handbook, The (1995)    3.666667
        Your Friends and Neighbors (1998)        2.500000
        Zero Effect (1998)                       4.000000
Name: rating, Length: 2662, dtype: float64

In [23]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)


Out[23]:
gender F M
title
'burbs, The (1989) 1.0 2.666667
...And Justice for All (1979) NaN 3.000000
10 Things I Hate About You (1999) 3.5 3.000000
101 Dalmatians (1961) 3.0 4.000000
101 Dalmatians (1996) NaN 2.000000
12 Angry Men (1957) 4.5 4.250000
13th Warrior, The (1999) NaN 2.600000
2 Days in the Valley (1996) 4.0 NaN
20,000 Leagues Under the Sea (1954) 4.0 3.600000
2001: A Space Odyssey (1968) 5.0 4.363636

In [24]:
by_gender_title = movielens_train.pivot_table(values='rating', index='title', columns='gender')
by_gender_title.head(10)


Out[24]:
gender F M
title
'burbs, The (1989) 1.0 2.666667
...And Justice for All (1979) NaN 3.000000
10 Things I Hate About You (1999) 3.5 3.000000
101 Dalmatians (1961) 3.0 4.000000
101 Dalmatians (1996) NaN 2.000000
12 Angry Men (1957) 4.5 4.250000
13th Warrior, The (1999) NaN 2.600000
2 Days in the Valley (1996) 4.0 NaN
20,000 Leagues Under the Sea (1954) 4.0 3.600000
2001: A Space Odyssey (1968) 5.0 4.363636

Minimal reco engine v1.1: implicit sim functions

We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:


In [25]:
user_info = users.set_index('user_id')
user_info.head(5)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-c629bdea94ff> in <module>()
----> 1 user_info = users.set_index('user_id')
      2 user_info.head(5)

NameError: name 'users' is not defined

With this in hand, we can now ask what the gender of a particular user_id is like so:


In [21]:
user_id = 3
user_info.ix[user_id, 'gender']


Out[21]:
'M'

Collaborative-based filtering by Groups

Using the pandas aggregation framework we will build a collaborative filter that estimates ratings using an implicit sim(u,u') function to compare different users.


In [22]:
def estimate3(user_id, movie_id):
    """ Collaborative filtering using an implicit sim(u,u'). """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
    user_gender = user_info.ix[user_id, 'gender']
    if user_gender in means_by_gender.columns: 
        return means_by_gender.ix[movie_id, user_gender]
    else:
        return means_by_gender.ix[movie_id].mean()

print 'RMSE for reco3: %s' % evaluate(estimate3)


RMSE for reco3: 1.19104892024

At this point it seems worthwhile to write a learn that pre-computes whatever datastructures we need at estimation time.


In [23]:
class Reco3:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same gender. """
        if movie_id not in self.means_by_gender.index: 
            return 3.0
        user_gender = user_info.ix[user_id, 'gender']
        if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
            return self.means_by_gender.ix[movie_id, user_gender]
        else:
            return self.means_by_gender.ix[movie_id].mean()

reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)


RMSE for reco3: 1.19104892024

In [24]:
class Reco4:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_age = movielens_train.pivot_table('rating', index='movie_id', columns='age')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same age. """
        if movie_id not in self.means_by_age.index: return 3.0
        user_age = user_info.ix[user_id, 'age']
        if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
            return self.means_by_age.ix[movie_id, user_age]
        else:
            return self.means_by_age.ix[movie_id].mean()

reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)


RMSE for reco4: 1.21790603479

In [25]:
movielens.pivot_table('rating', index='movie_id', columns='user_id')


Out[25]:
user_id 2 5 10 11 17 18 22 25 26 28 ... 6010 6011 6016 6018 6023 6025 6026 6035 6036 6037
movie_id
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3946 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3947 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3948 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3949 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3952 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

2161 rows × 2148 columns

A few similarity functions

These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.

  • Euclidean 'similarity'
$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$
  • Cosine similarity
$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$
  • Pearson correlation
$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$

In [26]:
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))

In [27]:
# Test Euclidean Distance function
euclidean(np.array([1, 2, 3]), np.array([1, 2, 5]))


Out[27]:
0.33333333333333331

In [28]:
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))

In [29]:
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))

In [30]:
user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

In [31]:
class Reco5:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
        if ratings_by_others.empty: return 3.0
        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)

In [32]:
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)


RMSE for reco5: 1.09772699942

In [33]:


In [ ]: