Credit where credit is due. Please check the references at the bottom of this document.
MovieLens from GroupLens Research: grouplens.org
The MovieLens 1M data set contains 1 million ratings collected from 6000 users on 4000 movies.
The goal of this tutorial is to provide you with a hands-on overview of two of the main libraries from the scientific and data analysis communities. We're going to use:
What exactly are we going to do? Here's a high-level overview:
Recommenders have been around since at least 1992. Today we see different flavours of recommenders, deployed across different verticals:
What exactly do they do?
In a typical recommender system people provide recommendations as inputs, which the system then aggregates and directs to appropriate recipients. -- Resnick and Varian, 1997
Collaborative filtering simply means that people collaborate to help one another perform filtering by recording their reactions to documents they read. -- Goldberg et al, 1992
In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user. Intuitively, this estimation is usually based on the ratings given by this user to other items and on some other information [...] Once we can estimate ratings for the yet unrated items, we can recommend to the user the item(s) with the highest estimated rating(s). -- Adomavicius and Tuzhilin, 2005
Driven by computer algorithms, recommenders help consumers by selecting products they will probably like and might buy based on their browsing, searches, purchases, and preferences. -- Konstan and Riedl, 2012
The recommendation problem in its most basic form is quite simple to define:
|-------------------+-----+-----+-----+-----+-----|
| user_id, movie_id | m_1 | m_2 | m_3 | m_4 | m_5 |
|-------------------+-----+-----+-----+-----+-----|
| u_1 | ? | ? | 4 | ? | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_2 | 3 | ? | ? | 2 | 2 |
|-------------------+-----+-----+-----+-----+-----|
| u_3 | 3 | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_4 | ? | 1 | 2 | 1 | 1 |
|-------------------+-----+-----+-----+-----+-----|
| u_5 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_6 | 2 | ? | 2 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_7 | ? | ? | ? | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_8 | 3 | 1 | 5 | ? | ? |
|-------------------+-----+-----+-----+-----+-----|
| u_9 | ? | ? | ? | ? | 2 |
|-------------------+-----+-----+-----+-----+-----|
Given a partially filled matrix of ratings ($|U|x|I|$), estimate the missing values.
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like R.
The heart of pandas is the DataFrame object for data manipulation. It features:
In [10]:
import numpy as np
import pandas as pd
# set some print options
np.set_printoptions(precision=4)
np.set_printoptions(threshold=5)
np.set_printoptions(suppress=True)
pd.set_option('precision', 3, 'notebook_repr_html', True, )
# init random gen
np.random.seed(2)
In [11]:
ser = pd.Series([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.06, 8.0])
ser
Out[11]:
In [12]:
values = np.array([2.0, 1.0, 5.0, 0.97, 3.0, 10.0, 0.0599, 8.0])
labels = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']
ser = pd.Series(data=values, index=labels)
ser
Out[12]:
In [13]:
movie_rating = {
'age': 1,
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'occupation': 10,
'rating': 5,
'timestamp': 978300760,
'title': "One Flew Over the Cuckoo's Nest (1975)",
'user_id': 1,
'zip': '48067'
}
ser = pd.Series(movie_rating)
print (ser)
In [14]:
ser.index
Out[14]:
In [15]:
ser.values
Out[15]:
In [16]:
ser.loc['gender']
Out[16]:
In [17]:
ser.loc[['gender', 'zip']]
Out[17]:
In [18]:
booleans = np.array([False, False, False, False, False, True, False, False, False, False])
booleans
Out[18]:
In [19]:
ser.loc[booleans]
Out[19]:
In [20]:
ser.iloc[1]
Out[20]:
In [21]:
ser.iloc[[1,2]]
Out[21]:
In [22]:
ser.ix['gender']
Out[22]:
In [23]:
ser.ix[1]
Out[23]:
In [24]:
ser['gender']
Out[24]:
In [25]:
ser[[1,2]]
Out[25]:
In [26]:
ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'])
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'])
print (ser_1 + ser_2)
Automatic upcasting when performing operations between Series with different dtypes:
In [27]:
ser_1 = pd.Series(data=[1,3,4], index=['A', 'B', 'C'], dtype='int')
ser_2 = pd.Series(data=[5,5,5], index=['A', 'G', 'C'], dtype='float')
ser_1 + ser_2
Out[27]:
In [28]:
# build from a dict of equal-length lists or ndarrays
pd.DataFrame({'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]})
Out[28]:
You can explicitly set the column names and index values as well.
In [29]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'])
Out[29]:
In [30]:
pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
Out[30]:
You can also think of it as a dictionary of Series objects.
In [31]:
movie_rating = {
'gender': 'F',
'genres': 'Drama',
'movie_id': 1193,
'rating': 5,
'timestamp': 978300760,
'user_id': 1,
}
ser_1 = pd.Series(movie_rating)
ser_2 = pd.Series(movie_rating)
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.columns.name = 'rating_events'
df.index.name = 'rating_data'
df
Out[31]:
In [32]:
df = df.T
df
Out[32]:
In [33]:
df.columns
Out[33]:
In [34]:
df.index
Out[34]:
In [35]:
df.values
Out[35]:
In [36]:
df = pd.DataFrame({'r_1': ser_1, 'r_2': ser_2})
df.drop('genres', axis=0)
Out[36]:
In [37]:
df.drop('r_1', axis=1)
Out[37]:
Add a new column using dictionary notation:
In [38]:
# careful with the order here
df['r_3'] = ['F', 'Drama', 1193, 5, 978300760, 1]
df
Out[38]:
In [39]:
df['r_4'] = pd.Series({'gender': 'M'})
df
Out[39]:
--> Go to "Pandas questions: Series and DataFrames"
You can index into a column using it's label, or with dot notation
In [40]:
df = pd.DataFrame(data={'col_1': [0.12, 7, 45, 10], 'col_2': [0.9, 9, 34, 11]},
columns=['col_1', 'col_2', 'col_3'],
index=['obs1', 'obs2', 'obs3', 'obs4'])
df
Out[40]:
In [41]:
df['col_1']
Out[41]:
In [42]:
df.col_1
Out[42]:
You can also use multiple columns to select a subset of them:
In [43]:
df[['col_2', 'col_1']]
Out[43]:
DataFrame has similar .loc and .iloc methods:
In [44]:
df.loc['obs1', 'col_1']
Out[44]:
In [45]:
df.iloc[0, 0]
Out[45]:
The .ix method gives you the most flexibility to index into certain rows, or even rows and columns:
In [46]:
df.ix['obs3']
Out[46]:
In [47]:
df.ix[0]
Out[47]:
In [48]:
df.ix[:2]
Out[48]:
In [49]:
df.ix[:2, 'col_2']
Out[49]:
In [50]:
df.ix[:2, ['col_1', 'col_2']]
Out[50]:
--> Go to "Pandas questions: Indexing"
In [51]:
users = pd.read_table('data/ml-1m/users.dat',
sep='::', header=None,
names=['user_id', 'gender', 'age', 'occupation', 'zip'])
ratings = pd.read_table('data/ml-1m/ratings.dat',
sep='::', header=None,
names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_table('data/ml-1m/movies.dat',
sep='::', header=None,
names=['movie_id', 'title', 'genres'])
# show how one of them looks
ratings.head(5)
Out[51]:
Using pd.merge
we get it all into one big DataFrame.
In [52]:
movielens = pd.merge(pd.merge(ratings, users), movies)
movielens.head()
Out[52]:
This subsection will generate training and testing sets for evaluation. You do not need to understand every single line of code, just the general gist:
In [53]:
# let's work with a smaller subset for speed reasons
movielens = movielens.ix[np.random.choice(movielens.index, size=10000, replace=False)]
print (movielens.shape)
print (movielens.user_id.nunique())
print (movielens.movie_id.nunique())
In [54]:
user_ids_larger_1 = pd.value_counts(movielens.user_id, sort=False) > 1
user_ids_larger_1 = user_ids_larger_1[user_ids_larger_1].index
movielens = movielens.select(lambda l: movielens.loc[l, 'user_id'] in user_ids_larger_1)
print (movielens.shape)
assert np.all(movielens.user_id.value_counts() > 1)
We now generate train and test subsets by marking 20% of each users's ratings, using groupby and apply.
In [55]:
def assign_to_set(df):
sampled_ids = np.random.choice(df.index,
size=np.int64(np.ceil(df.index.size * 0.2)),
replace=False)
df.ix[sampled_ids, 'for_testing'] = True
return df
movielens['for_testing'] = False
grouped = movielens.groupby('user_id', group_keys=False).apply(assign_to_set)
movielens_train = movielens[grouped.for_testing == False]
movielens_test = movielens[grouped.for_testing == True]
print (movielens.shape)
print (movielens_train.shape)
print (movielens_test.shape)
assert len(movielens_train.index & movielens_test.index) == 0
Store these two sets in text files:
In [56]:
movielens_train.to_csv('data/my_generated_movielens_train.csv')
movielens_test.to_csv('data/my_generated_movielens_test.csv')
In [57]:
def compute_rmse(y_pred, y_true):
""" Compute Root Mean Squared Error. """
return np.sqrt(np.mean(np.power(y_pred - y_true, 2)))
In [58]:
def evaluate(estimate_f):
""" RMSE-based predictive performance evaluation with pandas. """
ids_to_estimate = zip(movielens_test.user_id, movielens_test.movie_id)
estimated = np.array([estimate_f(u,i) for (u,i) in ids_to_estimate])
real = movielens_test.rating.values
return compute_rmse(estimated, real)
In [59]:
def my_estimate_function(user_id, movie_id):
return 3
In [60]:
print ('RMSE for my estimate function: %s' % evaluate(my_estimate_function))
--> Go to "Mini Challenge prep: data loading & evaluation functions"
Recommend based on the user's rating history.
Generic expression (notice how this is kind of a 'row-based' approach):
$$ \newcommand{\aggr}{\mathop{\rm aggr}\nolimits} r_{u,i} = \aggr_{i' \in I(u)} [r_{u,i'}] $$A simple example using the mean as an aggregation function:
$$ r_{u,i} = \bar r_u = \frac{\sum_{i' \in I(u)} r_{u,i'}}{|I(u)|} $$A simple example using the mean as an aggregation function:
$$ r_{u,i} = \bar r_i = \frac{\sum_{u' \in U(i)} r_{u',i}}{|U(i)|} $$The literature has lots of examples of systems that try to combine the strengths of the two main approaches. This can be done in a number of ways:
Content-based techniques are limited by the amount of metadata that is available to describe an item. There are domains in which feature extraction methods are expensive or time consuming, e.g., processing multimedia data such as graphics, audio/video streams. In the context of grocery items for example, it's often the case that item information is only partial or completely missing. Examples include:
A user has to have rated a sufficient number of items before a recommender system can have a good idea of what their preferences are. In a content-based system, the aggregation function needs ratings to aggregate.
Collaborative filters rely on an item being rated by many users to compute aggregates of those ratings. Think of this as the exact counterpart of the new user problem for content-based systems.
When looking at the more general versions of content-based and collaborative systems, the success of the recommender system depends on the availability of a critical mass of user/item iteractions. We get a first glance at the data sparsity problem by quantifying the ratio of existing ratings vs $|U|x|I|$. A highly sparse matrix of interactions makes it difficult to compute similarities between users and items. As an example, for a user whose tastes are unusual compared to the rest of the population, there will not be any other users who are particularly similar, leading to poor recommendations.
In [61]:
def content_mean(user_id, movie_id):
""" Simple content-filtering based on mean ratings. """
user_condition = movielens_train.user_id == user_id
return movielens_train.loc[user_condition, 'rating'].mean()
print ('RMSE for estimate1: %s' % evaluate(content_mean))
--> Go to "Reco systems questions: Minimal reco engine v1.0"
Possibly incorporating metadata about items, which makes the term 'content' make more sense now.
$$ r_{u,i} = k \sum_{i' \in I(u)} sim(i, i') \; r_{u,i'} $$$$ r_{u,i} = \bar r_u + k \sum_{i' \in I(u)} sim(i, i') \; (r_{u,i'} - \bar r_u) $$Here $k$ is a normalizing factor,
$$ k = \frac{1}{\sum_{i' \in I(u)} |sim(i,i')|} $$and $\bar r_u$ is the average rating of user u:
$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$Possibly incorporating metadata about users.
$$ r_{u,i} = k \sum_{u' \in U(i)} sim(u, u') \; r_{u',i} $$$$ r_{u,i} = \bar r_u + k \sum_{u' \in U(i)} sim(u, u') \; (r_{u',i} - \bar r_u) $$Here $k$ is a normalizing factor,
$$ k = \frac{1}{\sum_{u' \in U(i)} |sim(u,u')|} $$and $\bar r_u$ is the average rating of user u:
$$ \bar r_u = \frac{\sum_{i \in I(u)} r_{u,i}}{|I(u)|} $$
In [62]:
movielens_train.groupby('gender')['rating'].mean()
Out[62]:
In [63]:
movielens_train.groupby(['gender', 'age'])['rating'].mean()
Out[63]:
In [64]:
# transform the ratings frame into a ratings matrix
ratings_mtx_df = movielens_train.pivot_table(values='rating',
index='user_id',
columns='movie_id')
ratings_mtx_df.head(3)
Out[64]:
In [65]:
# grab another subsquare of the ratings matrix to actually diplay some real entries!
ratings_mtx_df.loc[11:16, 1196:1200]
Out[65]:
The more interesting case with pivot_table
is as an interface to
groupby
:
In [66]:
movielens_train.pivot_table(values='rating', index='age', columns='gender', aggfunc='mean')
Out[66]:
You can pass in a list of functions, such as [np.mean, np.std]
, to compute mean ratings and a measure of disagreement.
In [67]:
movielens_train.pivot_table(values='rating', index='age', columns='gender', aggfunc=[np.mean, np.std])
Out[67]:
We're going to need a user index from the users portion of the dataset. This will allow us to retrieve information given a specific user_id in a more convenient way:
In [68]:
user_info = users.set_index('user_id')
user_info.head(5)
Out[68]:
With this in hand, we can now ask what the gender of a particular user_id is like so:
In [69]:
user_id = 3
user_info.loc[user_id, 'gender']
Out[69]:
In [70]:
def collab_gender(user_id, movie_id):
""" Collaborative filtering using an implicit sim(u,u') based on gender. """
user_condition = movielens_train.user_id != user_id
movie_condition = movielens_train.movie_id == movie_id
ratings_by_others = movielens_train.loc[user_condition & movie_condition]
if ratings_by_others.empty:
return 3.0
means_by_gender = ratings_by_others.pivot_table('rating', index='movie_id', columns='gender')
user_gender = user_info.ix[user_id, 'gender']
if user_gender in means_by_gender.columns:
return means_by_gender.ix[movie_id, user_gender]
else:
return means_by_gender.ix[movie_id].mean()
print ('RMSE for collab_gender: %s' % evaluate(collab_gender))
At this point it seems worthwhile to write a learn
function to pre-compute whatever datastructures we need at estimation time.
In [71]:
class CollabGenderReco:
""" Collaborative filtering using an implicit sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.means_by_gender = movielens_train.pivot_table('rating', index='movie_id', columns='gender')
def estimate(self, user_id, movie_id):
""" Mean ratings by other users of the same gender. """
if movie_id not in self.means_by_gender.index:
return 3.0
user_gender = user_info.ix[user_id, 'gender']
if ~np.isnan(self.means_by_gender.ix[movie_id, user_gender]):
return self.means_by_gender.ix[movie_id, user_gender]
else:
return self.means_by_gender.ix[movie_id].mean()
reco = CollabGenderReco()
reco.learn()
print ('RMSE for CollabGenderReco: %s' % evaluate(reco.estimate))
Break!!
These were all written to operate on two pandas Series, each one representing the rating history of two different users. You can also apply them to any two feature vectors that describe users or items. In all cases, the higher the return value, the more similar two Series are. You might need to add checks for edge cases, such as divisions by zero, etc.
In [72]:
def euclidean(s1, s2):
"""Take two pd.Series objects and return their euclidean 'similarity'."""
diff = s1 - s2
return 1 / (1 + np.sqrt(np.sum(diff ** 2)))
In [73]:
def cosine(s1, s2):
"""Take two pd.Series objects and return their cosine similarity."""
return np.sum(s1 * s2) / np.sqrt(np.sum(s1 ** 2) * np.sum(s2 ** 2))
In [74]:
def pearson(s1, s2):
"""Take two pd.Series objects and return a pearson correlation."""
s1_c = s1 - s1.mean()
s2_c = s2 - s2.mean()
return np.sum(s1_c * s2_c) / np.sqrt(np.sum(s1_c ** 2) * np.sum(s2_c ** 2))
In [75]:
def jaccard(s1, s2):
dotp = np.sum(s1 * s2)
return dotp / (np.sum(s1 ** 2) + np.sum(s2 ** 2) - dotp)
def binjaccard(s1, s2):
dotp = (s1.index & s2.index).size
return dotp / (s1.sum() + s2.sum() - dotp)
In [76]:
class CollabPearsonReco:
""" Collaborative filtering using a custom sim(u,u'). """
def learn(self):
""" Prepare datastructures for estimation. """
self.all_user_profiles = movielens.pivot_table('rating', index='movie_id', columns='user_id')
def estimate(self, user_id, movie_id):
""" Ratings weighted by correlation similarity. """
user_condition = movielens_train.user_id != user_id
movie_condition = movielens_train.movie_id == movie_id
ratings_by_others = movielens_train.loc[user_condition & movie_condition]
if ratings_by_others.empty:
return 3.0
ratings_by_others.set_index('user_id', inplace=True)
their_ids = ratings_by_others.index
their_ratings = ratings_by_others.rating
their_profiles = self.all_user_profiles[their_ids]
user_profile = self.all_user_profiles[user_id]
sims = their_profiles.apply(lambda profile: pearson(profile, user_profile), axis=0)
ratings_sims = pd.DataFrame({'sim': sims, 'rating': their_ratings})
ratings_sims = ratings_sims[ratings_sims.sim > 0]
if ratings_sims.empty:
return their_ratings.mean()
else:
return np.average(ratings_sims.rating, weights=ratings_sims.sim)
reco = CollabPearsonReco()
reco.learn()
print ('RMSE for CollabPearsonReco: %s' % evaluate(reco.estimate))
In [77]:
'''In a plane with p1 at (x1, y1) and p2 at (x2, y2).
Manhattan distance = |x1 – x2| + |y1 – y2|'''
def manhattan_distance(s1, s2):
return sum(abs(x-y) for x, y in list(zip(s1, s2)))
In [78]:
from decimal import Decimal
def nth_root(value, n_root):
root_value = 1/float(n_root)
return round (Decimal(value) ** Decimal(root_value),3)
def minkowski_distance(x,y,p_value):
return nth_root(sum(pow(abs(a-b),p_value) for a,b in list(zip(x, y))), p_value)
In [79]:
'''Test above functions'''
print('manhattan_distance: ', manhattan_distance([1,2,3,4], [9,6,3,0]))
print('minkowski_distance: ', minkowski_distance([1,2,3,4], [9,6,3,0], 2))
In [ ]: