Building an Recommendation Engine

The Recommendation Problem

Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:

Amazon
Netflix
Facebook
Last.fm.

What exactly do they do?

Definitions from the literature

In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012

$U$ is the set of users in our domain. Its size is $|U|$.
$I$ is the set of items in our domain. Its size is $|I|$.
$I(u)$ is the set of items that user $u$ has rated.
$-I(u)$ is the complement of $I(u)$ i.e., the set of items not yet seen by user $u$.
$U(i)$ is the set of users that have rated item $i$.
$-U(i)$ is the complement of $U(i)$.
$S(u,i)$ is a function that measures the utility of item $i$ for user $u$.

Goal of a recommendation system

$ \newcommand{\argmax}{\mathop{\rm argmax}\nolimits} i^{*} = \argmax_{i \in -I(u)} S(u,i), \forall{u \in U} $

Problem statement

The recommendation problem in its most basic form is quite simple to define:

|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1               | ?   | ?   | 4   | ?   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_2               | 3   | ?   | ?   | 2   | 2   |
|-------------------+-----+-----+-----+-----+-----|
| u_3               | 3   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_4               | ?   | 1   | 2   | 1   | 1   |
|-------------------+-----+-----+-----+-----+-----|
| u_5               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_6               | 2   | ?   | 2   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_7               | ?   | ?   | ?   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_8               | 3   | 1   | 5   | ?   | ?   |
|-------------------+-----+-----+-----+-----+-----|
| u_9               | ?   | ?   | ?   | ?   | 2   |
|-------------------+-----+-----+-----+-----+-----|

Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.

Dataset

MovieLens from GroupLens Research: grouplens.org

Datasets are available at http://grouplens.org/datasets/movielens/

We will be using the MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.



In [26]:

    
import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 10)



In [27]:

    
users = pd.read_table('../data/movie-lens-1m/users.dat',
                      sep='::', header=None, 
                      names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')

ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',
                        sep='::', header=None, 
                        names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')

movies = pd.read_table('../data/movie-lens-1m/movies.dat',
                       sep='::', header=None, 
                       names=['movie_id', 'title', 'genres'], engine='python')

# show how one of them looks
ratings.head(5)









    



---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-27-e2125bea3603> in <module>()
      1 users = pd.read_table('../data/movie-lens-1m/users.dat',
      2                       sep='::', header=None,
----> 3                       names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')
      4 
      5 ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    463                     skip_blank_lines=skip_blank_lines)
    464 
--> 465         return _read(filepath_or_buffer, kwds)
    466 
    467     parser_f.__name__ = name

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    239 
    240     # Create the parser.
--> 241     parser = TextFileReader(filepath_or_buffer, **kwds)
    242 
    243     if (nrows is not None) and (chunksize is not None):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    555             self.options['has_index_names'] = kwds['has_index_names']
    556 
--> 557         self._make_engine(self.engine)
    558 
    559     def _get_options_with_defaults(self, engine):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in _make_engine(self, engine)
    698             elif engine == 'python-fwf':
    699                 klass = FixedWidthFieldParser
--> 700             self._engine = klass(self.f, **self.options)
    701 
    702     def _failover_to_python(self):

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/io/parsers.pyc in __init__(self, f, **kwds)
   1384         if isinstance(f, compat.string_types):
   1385             f = com._get_handle(f, 'r', encoding=self.encoding,
-> 1386                                 compression=self.compression)
   1387         elif self.compression:
   1388             f = _wrap_compressed(f, self.compression, self.encoding)

/Users/sampathweb/anaconda/envs/conda27/lib/python2.7/site-packages/pandas/core/common.pyc in _get_handle(path, mode, encoding, compression)
   2714                 f = open(path, mode, errors='replace')
   2715         else:
-> 2716             f = open(path, mode)
   2717 
   2718     return f

IOError: [Errno 2] No such file or directory: '../data/movie-lens-1m/users.dat'



In [3]:

    
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head(5)









    Out[3]:






  
    
      
      user_id
      movie_id
      rating
      timestamp
      gender
      age
      occupation
      zip
      title
      genres
    
  
  
    
      0
        1
       1193
       5
       978300760
       F
        1
       10
       48067
       One Flew Over the Cuckoo's Nest (1975)
       Drama
    
    
      1
        2
       1193
       5
       978298413
       M
       56
       16
       70072
       One Flew Over the Cuckoo's Nest (1975)
       Drama
    
    
      2
       12
       1193
       4
       978220179
       M
       25
       12
       32793
       One Flew Over the Cuckoo's Nest (1975)
       Drama
    
    
      3
       15
       1193
       4
       978199279
       M
       25
        7
       22903
       One Flew Over the Cuckoo's Nest (1975)
       Drama
    
    
      4
       17
       1193
       5
       978158471
       M
       50
        1
       95350
       One Flew Over the Cuckoo's Nest (1975)
       Drama



In [4]:

    
movielens.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
user_id       1000209 non-null int64
movie_id      1000209 non-null int64
rating        1000209 non-null int64
timestamp     1000209 non-null int64
gender        1000209 non-null object
age           1000209 non-null int64
occupation    1000209 non-null int64
zip           1000209 non-null object
title         1000209 non-null object
genres        1000209 non-null object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB

Evaluation

Before we attempt to express the basic equations for content-based or collaborative filtering we need a basic mechanism to evaluate the performance of our engine.

Evaluation: split ratings into train and test sets

This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:

take a smaller sample from the full 1M dataset for speed reasons;
make sure that we have at least 2 ratings per user in that subset;
split the result into training and testing sets.



In [5]:

    
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()









    



(10000, 10)
3659
2244



In [6]:

    
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)









    



(8489, 10)






    Out[6]:





True



In [7]:

    
def assign_to_set(df):
    sampled_ids = np.random.choice(df.index,
                                   size=np.int64(np.ceil(df.index.size * 0.2)),
                                   replace=False)
    df.ix[sampled_ids, 'for_testing'] = True
    return df

movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape









    



(5842, 11)
(2647, 11)



In [8]:

    
movielens_train.shape









    Out[8]:





(5842, 11)



In [9]:

    
movielens_test.shape









    Out[9]:





(2647, 11)



In [10]:

    
movielens_train.to_csv('../data/movielens_train.csv')
movielens_test.to_csv('../data/movie-lens-1m/movielens_test.csv')



In [13]:

    
movielens_train = pd.read_csv('../data/movielens_train.csv')
movielens_test = pd.read_csv('../data/movielens_test.csv')



In [14]:

    
movielens_train.columns









    Out[14]:





Index([u'user_id', u'movie_id', u'rating', u'timestamp', u'gender', u'age', u'occupation', u'zip', u'title', u'genres'], dtype='object')



In [15]:

    
movielens_train.head()









    Out[15]:






  
    
      
      user_id
      movie_id
      rating
      timestamp
      gender
      age
      occupation
      zip
      title
      genres
    
  
  
    
      0
       1793
       1641
       4
       974703937
       M
       25
        4
            20008
                               Full Monty, The (1997)
                                    Comedy
    
    
      1
         73
         70
       5
       977865460
       M
       18
        4
            53706
                           From Dusk Till Dawn (1996)
       Action|Comedy|Crime|Horror|Thriller
    
    
      2
       4725
        805
       3
       963382233
       M
       35
        5
       96707-1321
                               Time to Kill, A (1996)
                                     Drama
    
    
      3
       2288
       3869
       2
       974522733
       M
        1
       10
            13066
       Naked Gun 2 1/2: The Smell of Fear, The (1991)
                                    Comedy
    
    
      4
       5630
       2599
       4
       980383876
       M
       35
       17
            06854
                                      Election (1999)
                                    Comedy

Evaluation: performance criterion

RMSE: $\sqrt{\frac{\sum(\hat y - y)^2}{n}}$

Evaluation: the 'evaluate' method



In [16]:

    
def compute_rmse(y_pred, y_true):
    """ Compute Root Mean Squared Error. """
    return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))



In [17]:

    
def evaluate(estimate_f):
    """ RMSE-based predictive performance evaluation with pandas. """
    ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
    estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
    real = movielens_test.rating.values
    return compute_rmse(estimated, real)

Minimal reco engine v1.0: simple mean ratings

Content-based filtering using mean ratings

With this table-like representation of the ratings data, a basic content-based filter becomes a one-liner function.



In [18]:

    
def estimate1(user_id, item_id):
    """ Simple content-filtering based on mean ratings. """
    return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()

print 'RMSE for estimate1: %s' % evaluate(estimate1)









    



RMSE for estimate1: 1.27463489115

Collaborative-based filtering using mean ratings



In [19]:

    
def estimate2(user_id, movie_id):
    """ Simple collaborative filter based on mean ratings. """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    return ratings_by_others.rating.mean()

print 'RMSE for estimate2: %s' % evaluate(estimate2)









    



RMSE for estimate2: 1.14835537277



In [20]:

    
print movielens_train.groupby('gender')['rating'].mean()









    



gender
F         3.588112
M         3.541770
Name: rating, dtype: float64



In [21]:

    
print movielens_train.groupby(['gender', 'age'])['rating'].mean()









    



gender  age
F       1      3.625000
        18     3.347222
        25     3.587912
...
M       45     3.492877
        50     3.723032
        56     3.688889
Name: rating, Length: 14, dtype: float64



In [22]:

    
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title









    



gender  title                            
F       'burbs, The (1989)                   1.0
        10 Things I Hate About You (1999)    3.5
        101 Dalmatians (1961)                3.0
...
M       Young Poisoner's Handbook, The (1995)    3.666667
        Your Friends and Neighbors (1998)        2.500000
        Zero Effect (1998)                       4.000000
Name: rating, Length: 2662, dtype: float64



In [23]:

    
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)









    Out[23]:






  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      'burbs, The (1989)
       1.0
       2.666667
    
    
      ...And Justice for All (1979)
       NaN
       3.000000
    
    
      10 Things I Hate About You (1999)
       3.5
       3.000000
    
    
      101 Dalmatians (1961)
       3.0
       4.000000
    
    
      101 Dalmatians (1996)
       NaN
       2.000000
    
    
      12 Angry Men (1957)
       4.5
       4.250000
    
    
      13th Warrior, The (1999)
       NaN
       2.600000
    
    
      2 Days in the Valley (1996)
       4.0
            NaN
    
    
      20,000 Leagues Under the Sea (1954)
       4.0
       3.600000
    
    
      2001: A Space Odyssey (1968)
       5.0
       4.363636



In [24]:

    
by_gender_title = movielens_train.pivot_table(values='rating', index='title', columns='gender')
by_gender_title.head(10)









    Out[24]:






  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      'burbs, The (1989)
       1.0
       2.666667
    
    
      ...And Justice for All (1979)
       NaN
       3.000000
    
    
      10 Things I Hate About You (1999)
       3.5
       3.000000
    
    
      101 Dalmatians (1961)
       3.0
       4.000000
    
    
      101 Dalmatians (1996)
       NaN
       2.000000
    
    
      12 Angry Men (1957)
       4.5
       4.250000
    
    
      13th Warrior, The (1999)
       NaN
       2.600000
    
    
      2 Days in the Valley (1996)
       4.0
            NaN
    
    
      20,000 Leagues Under the Sea (1954)
       4.0
       3.600000
    
    
      2001: A Space Odyssey (1968)
       5.0
       4.363636

Minimal reco engine v1.1: implicit sim functions

We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:



In [25]:

    
user_info = users.set_index('user_id')
user_info.head(5)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-c629bdea94ff> in <module>()
----> 1 user_info = users.set_index('user_id')
      2 user_info.head(5)

NameError: name 'users' is not defined

With this in hand, we can now ask what the gender of a particular user_id is like so:



In [21]:

    
user_id = 3
user_info.ix[user_id, 'gender']









    Out[21]:





'M'

Collaborative-based filtering by Groups

Using the pandas aggregation framework we will build a collaborative filter that estimates ratings using an implicit sim(u,u') function to compare different users.



In [22]:

    
def estimate3(user_id, movie_id):
    """ Collaborative filtering using an implicit sim(u,u'). """
    ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
    if ratings_by_others.empty: 
        return 3.0
    means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
    user_gender = user_info.ix[user_id, 'gender']
    if user_gender in means_by_gender.columns: 
        return means_by_gender.ix[movie_id, user_gender]
    else:
        return means_by_gender.ix[movie_id].mean()

print 'RMSE for reco3: %s' % evaluate(estimate3)









    



RMSE for reco3: 1.19104892024

At this point it seems worthwhile to write a learn that pre-computes whatever datastructures we need at estimation time.



In [23]:

    
class Reco3:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same gender. """
        if movie_id not in self.means_by_gender.index: 
            return 3.0
        user_gender = user_info.ix[user_id, 'gender']
        if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
            return self.means_by_gender.ix[movie_id, user_gender]
        else:
            return self.means_by_gender.ix[movie_id].mean()

reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)









    



RMSE for reco3: 1.19104892024



In [24]:

    
class Reco4:
    """ Collaborative filtering using an implicit sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.means_by_age = movielens_train.pivot_table('rating', index='movie_id', columns='age')

    def estimate(self, user_id, movie_id):
        """ Mean ratings by other users of the same age. """
        if movie_id not in self.means_by_age.index: return 3.0
        user_age = user_info.ix[user_id, 'age']
        if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
            return self.means_by_age.ix[movie_id, user_age]
        else:
            return self.means_by_age.ix[movie_id].mean()

reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)









    



RMSE for reco4: 1.21790603479



In [25]:

    
movielens.pivot_table('rating', index='movie_id', columns='user_id')









    Out[25]:






  
    
      user_id
      2
      5
      10
      11
      17
      18
      22
      25
      26
      28
      ...
      6010
      6011
      6016
      6018
      6023
      6025
      6026
      6035
      6036
      6037
    
    
      movie_id
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1   
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      2   
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3   
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      5   
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      6   
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      3946
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3947
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3948
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3949
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
    
      3952
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
    
  

2161 rows × 2148 columns

A few similarity functions

These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.

Euclidean 'similarity'

$$ sim(x,y) = \frac{1}{1 + \sqrt{\sum (x - y)^2}}$$

Cosine similarity

$$ sim(x,y) = \frac{(x . y)}{\sqrt{(x . x) (y . y)}} $$

Pearson correlation

$$ sim(x,y) = \frac{(x - \bar x).(y - \bar y)}{\sqrt{(x - \bar x).(x - \bar x) * (y - \bar y)(y - \bar y)}} $$



In [26]:

    
def euclidean(s1, s2):
    """Take two pd.Series objects and return their euclidean 'similarity'."""
    diff = s1 - s2
    return 1 / (1 + np.sqrt(np.sum(diff ** 2)))



In [27]:

    
# Test Euclidean Distance function
euclidean(np.array([1, 2, 3]), np.array([1, 2, 5]))









    Out[27]:





0.33333333333333331



In [28]:

    
def cosine(s1, s2):
    """Take two pd.Series objects and return their cosine similarity."""
    return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))



In [29]:

    
def pearson(s1, s2):
    """Take two pd.Series objects and return a pearson correlation."""
    s1_c = s1 - s1.mean()
    s2_c = s2 - s2.mean()
    return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))



In [30]:

    
user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')



In [31]:

    
class Reco5:
    """ Collaborative filtering using a custom sim(u,u'). """

    def learn(self):
        """ Prepare datastructures for estimation. """
        self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')

    def estimate(self, user_id, movie_id):
        """ Ratings weighted by correlation similarity. """
        ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
        if ratings_by_others.empty: return 3.0
        ratings_by_others.set_index('user_id', inplace=True)
        their_ids = ratings_by_others.index
        their_ratings = ratings_by_others.rating
        their_profiles = self.all_user_profiles[their_ids]
        user_profile = self.all_user_profiles[user_id]
        sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
        ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
        ratings_sims = ratings_sims[ ratings_sims.sim > 0]
        if ratings_sims.empty:
            return their_ratings.mean()
        else:
            return np.average(ratings_sims.rating, weights=ratings_sims.sim)



In [32]:

    
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)









    



RMSE for reco5: 1.09772699942



In [33]:



In [ ]: