Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
MovieLens from GroupLens Research: grouplens.org
Datasets are available at http://grouplens.org/datasets/movielens/
We will be using the MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.
In [26]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 10)
In [27]:
users = pd.read_table('../data/movie-lens-1m/users.dat',
sep='::', header=None,
names=['user_id', 'gender', 'age', 'occupation', 'zip'], engine='python')
ratings = pd.read_table('../data/movie-lens-1m/ratings.dat',
sep='::', header=None,
names=['user_id', 'movie_id', 'rating', 'timestamp'], engine='python')
movies = pd.read_table('../data/movie-lens-1m/movies.dat',
sep='::', header=None,
names=['movie_id', 'title', 'genres'], engine='python')
# show how one of them looks
ratings.head(5)
In [3]:
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head(5)
Out[3]:
In [4]:
movielens.info()
This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:
In [5]:
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print movielens.shape
print movielens.user_id.nunique()
print movielens.movie_id.nunique()
In [6]:
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
movielens = movielens[user_ids_larger_1[movielens.user_id].values]
print movielens.shape
np.all(movielens.user_id.value_counts() > 1)
Out[6]:
In [7]:
def assign_to_set(df):
sampled_ids = np.random.choice(df.index,
size=np.int64(np.ceil(df.index.size * 0.2)),
replace=False)
df.ix[sampled_ids, 'for_testing'] = True
return df
movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print movielens_train.shape
print movielens_test.shape
In [8]:
movielens_train.shape
Out[8]:
In [9]:
movielens_test.shape
Out[9]:
In [10]:
movielens_train.to_csv('../data/movielens_train.csv')
movielens_test.to_csv('../data/movie-lens-1m/movielens_test.csv')
In [13]:
movielens_train = pd.read_csv('../data/movielens_train.csv')
movielens_test = pd.read_csv('../data/movielens_test.csv')
In [14]:
movielens_train.columns
Out[14]:
In [15]:
movielens_train.head()
Out[15]:
In [16]:
def compute_rmse(y_pred, y_true):
""" Compute Root Mean Squared Error. """
return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))
In [17]:
def evaluate(estimate_f):
""" RMSE-based predictive performance evaluation with pandas. """
ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
real = movielens_test.rating.values
return compute_rmse(estimated, real)
In [18]:
def estimate1(user_id, item_id):
""" Simple content-filtering based on mean ratings. """
return movielens_train.ix[movielens_train.user_id == user_id, 'rating'].mean()
print 'RMSE for estimate1: %s' % evaluate(estimate1)
In [19]:
def estimate2(user_id, movie_id):
""" Simple collaborative filter based on mean ratings. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty:
return 3.0
return ratings_by_others.rating.mean()
print 'RMSE for estimate2: %s' % evaluate(estimate2)
In [20]:
print movielens_train.groupby('gender')['rating'].mean()
In [21]:
print movielens_train.groupby(['gender', 'age'])['rating'].mean()
In [22]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean()
print by_gender_title
In [23]:
by_gender_title = movielens_train.groupby(['gender', 'title'])['rating'].mean().unstack('gender')
by_gender_title.head(10)
Out[23]:
In [24]:
by_gender_title = movielens_train.pivot_table(values='rating', index='title', columns='gender')
by_gender_title.head(10)
Out[24]:
We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:
In [25]:
user_info = users.set_index('user_id')
user_info.head(5)
With this in hand, we can now ask what the gender of a particular user_id is like so:
In [21]:
user_id = 3
user_info.ix[user_id, 'gender']
Out[21]:
In [22]:
def estimate3(user_id, movie_id):
""" Collaborative filtering using an implicit sim(u,u'). """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty:
return 3.0
means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
user_gender = user_info.ix[user_id, 'gender']
if user_gender in means_by_gender.columns:
return means_by_gender.ix[movie_id, user_gender]
else:
return means_by_gender.ix[movie_id].mean()
print 'RMSE for reco3: %s' % evaluate(estimate3)
At this point it seems worthwhile to write a learn
that pre-computes whatever datastructures we need at estimation time.
In [23]:
class Reco3:
""" Collaborative filtering using an implicit sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')
def estimate(self, user_id, movie_id):
""" Mean ratings by other users of the same gender. """
if movie_id not in self.means_by_gender.index:
return 3.0
user_gender = user_info.ix[user_id, 'gender']
if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
return self.means_by_gender.ix[movie_id, user_gender]
else:
return self.means_by_gender.ix[movie_id].mean()
reco = Reco3()
reco.learn()
print 'RMSE for reco3: %s' % evaluate(reco.estimate)
In [24]:
class Reco4:
""" Collaborative filtering using an implicit sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.means_by_age = movielens_train.pivot_table('rating', index='movie_id', columns='age')
def estimate(self, user_id, movie_id):
""" Mean ratings by other users of the same age. """
if movie_id not in self.means_by_age.index: return 3.0
user_age = user_info.ix[user_id, 'age']
if ~np.isnan(self.means_by_age.ix[movie_id, user_age]):
return self.means_by_age.ix[movie_id, user_age]
else:
return self.means_by_age.ix[movie_id].mean()
reco = Reco4()
reco.learn()
print 'RMSE for reco4: %s' % evaluate(reco.estimate)
In [25]:
movielens.pivot_table('rating', index='movie_id', columns='user_id')
Out[25]:
These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.
In [26]:
def euclidean(s1, s2):
"""Take two pd.Series objects and return their euclidean 'similarity'."""
diff = s1 - s2
return 1 / (1 + np.sqrt(np.sum(diff ** 2)))
In [27]:
# Test Euclidean Distance function
euclidean(np.array([1, 2, 3]), np.array([1, 2, 5]))
Out[27]:
In [28]:
def cosine(s1, s2):
"""Take two pd.Series objects and return their cosine similarity."""
return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))
In [29]:
def pearson(s1, s2):
"""Take two pd.Series objects and return a pearson correlation."""
s1_c = s1 - s1.mean()
s2_c = s2 - s2.mean()
return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
In [30]:
user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')
In [31]:
class Reco5:
""" Collaborative filtering using a custom sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')
def estimate(self, user_id, movie_id):
""" Ratings weighted by correlation similarity. """
ratings_by_others = movielens_train[movielens_train.movie_id == movie_id]
if ratings_by_others.empty: return 3.0
ratings_by_others.set_index('user_id', inplace=True)
their_ids = ratings_by_others.index
their_ratings = ratings_by_others.rating
their_profiles = self.all_user_profiles[their_ids]
user_profile = self.all_user_profiles[user_id]
sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
ratings_sims = ratings_sims[ ratings_sims.sim > 0]
if ratings_sims.empty:
return their_ratings.mean()
else:
return np.average(ratings_sims.rating, weights=ratings_sims.sim)
In [32]:
reco = Reco5()
reco.learn()
print 'RMSE for reco5: %s' % evaluate(reco.estimate)
In [33]:
In [ ]: