Recommendation systems are engines predicting ratings which users may give to certain items. These systems are widely used to predict ratings of movies, books, news and many other things. There are several ways to produce recommendations and one of them is collaborative filtering. This method is based on collecting information about users' behaviour or preferences and prediction is based on similarities between users. Or in simpler terms: if users have similar preferences on some issues, they will likely have similar preferences on other issues.
Usually collaborating filtering is divided into two categories: user-based and item-based. In user-based we look for users who are similar to the target user and use their ratings to calculate the prediction for the target user. In item-based we create a matrix with relationships between the items, find the preferences of the active user based on the matrix and find items, which he could like.
These two ways are often called memory-based approach as they load all data into the memory. Advantages of memory-based approach are:
Disadvantages:
Another approach is model-based. This means building model based on the dataset to find underlying patterns in the data.
Advantages:
Disadvantges:
In this notebook I show how these methods can be implemented.
I use dataset with restaurants ratings. Citation, as requested:
Blanca Vargas-Govea, Juan Gabriel González-Serna, Rafael Ponce-MedellÃn. Effects of relevant contextual features in the performance of a restaurant recommender system. In RecSys’11: Workshop on Context Aware Recommender Systems (CARS-2011), Chicago, IL, USA, October 23, 2011.
In [1]:
import numpy as np
import pandas as pd
In [2]:
data = pd.read_csv('UCI/RCdata/rating_final.csv')
In [3]:
data.head(10)
Out[3]:
Users give restaurants ratings based of food, service and overall quality. Possible ratings are 0, 1, 2. To distinguish zero ratings from lack of ratings I replace zero ratings with very small values. I'll use only overall rating in the analysis.
In [4]:
data['rating'] = data['rating'].apply(lambda x: 0.000001 if x == 0 else x)
In [5]:
#Sparse matrix.
ratings = data.pivot_table(index='userID', columns='placeID', values='rating')
Algorithm uses Pearson correlation:
, where: r - ratings; x, y - users; Ixy is the set of items rated by both user x and user y.
At first similarities between users are calculates using Pearson correlation, then users, who are most similar to the target user are identified. Recommendations are generated based on their ratings.
In [6]:
def pearson(user1, user2, df):
'''
Calculates similarity between two users. Takes user's ids and dataframe as inputs.
'''
df_short = df[df[user1].notnull() & df[user2].notnull()]
if len(df_short) == 0:
return 0
else:
rat1 = [row[user1] for i, row in df_short.iterrows()]
rat2 = [row[user2] for i, row in df_short.iterrows()]
numerator = sum([(rat1[i] - sum(rat1)/len(rat1)) * (rat2[i] - sum(rat2)/len(rat2)) for i in range(0,len(df_short))])
denominator1 = sum([(rat1[i] - sum(rat1)/len(rat1)) ** 2 for i in range(0,len(df_short))])
denominator2 = sum([(rat2[i] - sum(rat2)/len(rat2)) ** 2 for i in range(0,len(df_short))])
if denominator1 * denominator2 == 0:
return 0
else:
return numerator / ((denominator1 * denominator2) ** 0.5)
In [7]:
#Dataframe is transposed, for easier processing.
pearson('U1103', 'U1028', ratings.transpose())
Out[7]:
In [8]:
def get_neighbours(user_id, df):
'''
Creates a sorted list of users, who are most similar to specified user. Calculate similarity between current user and
all other users and sort by similarity.
'''
distances = [(user, pearson(user_id, user, df)) for user in df.columns if user != user_id]
distances.sort(key=lambda x: x[1], reverse=True)
distances = [i for i in distances if i[1] > 0]
return distances
In [9]:
get_neighbours('U1103', ratings.transpose())
Out[9]:
In [10]:
def recommend(user, df, n_users=2, n_recommendations=2):
'''
Generate recommendations for the user. Take userID and Dataframe as input. Get neighbours and get weighted score for
each place they rated. Return sorted list of places and their scores.
'''
recommendations = {}
nearest = get_neighbours(user, df)
n_users = n_users if n_users <= len(nearest) else len(nearest)
user_ratings = df[df[user].notnull()][user]
place_ratings = []
for i in range(n_users):
neighbour_ratings = df[df[nearest[i][0]].notnull()][nearest[i][0]]
for place in neighbour_ratings.index:
if place not in user_ratings.index:
place_ratings.append([place,neighbour_ratings[place],nearest[i][1]])
recommendations = get_ratings(place_ratings)
return recommendations[:n_recommendations]
def get_ratings(place_ratings):
'''
Creates Dataframe from list of lists. Calculates weighted rarings for each place.
'''
ratings_df = pd.DataFrame(place_ratings, columns=['placeID', 'rating', 'weight'])
ratings_df['total_weight'] = ratings_df['weight'].groupby(ratings_df['placeID']).transform('sum')
recommendations = []
for i in ratings_df.placeID.unique():
place_ratings = 0
df_short = ratings_df.loc[ratings_df.placeID == i]
for j, row in df_short.iterrows():
place_ratings += row[1] * row[2] / row[3]
recommendations.append((i, place_ratings))
recommendations = [i for i in recommendations if i[1] >= 1]
recommendations.sort(key=lambda x: x[1], reverse=True)
return recommendations
In [11]:
recommend('U1068', ratings.transpose(),5,5)
Out[11]:
The idea behind slope one algorithm is simple: calculate average difference in ratings for each pair of items and use this difference as prediction. For example if users generally rate item A higher than item B by 1 point, then to predict rating for item A we take a rating of rating B by targer user and add 1 point. Usually there are more items that two, so weighted average is used. The algorithm's main advantages are simplicity and speed.
In [12]:
def get_dev_fr(data):
'''
Calculates average difference between each pair of places and frequency - number of ratings. Both values are calculated
for cases, where a user rated both places.
'''
data_dev = pd.DataFrame(index=data.columns,columns=data.columns)
data_fr = pd.DataFrame(index=data.columns,columns=data.columns)
for i in data_dev.columns:
for j in data_dev.columns:
df_loc = data[data[i].notnull() & data[j].notnull()]
if len(df_loc) != 0:
data_dev.loc[i,j] = (sum(df_loc[i]) - sum(df_loc[j]))/len(df_loc) if i != j else 0
data_fr.loc[i,j] = len(df_loc) if i != j else 0
return data_dev, data_fr
In [13]:
data_dev, data_fr = get_dev_fr(ratings)
In [14]:
def slopeone(user, data):
'''
Generate recommended ratings for each place which user didn't rate adding weighted differences between places.
'''
#Places, which user didn't rate. The condition finds nan values.
recommendation = [i for i in data.columns if data.loc[user,i] != data.loc[user,i]]
recommendation_dictionary = {}
for j in recommendation:
score = 0
denominator = 0
for i in data.columns.drop(recommendation):
if data_dev.loc[j,i] == data_dev.loc[j,i] and data_fr.loc[j,i] == data_fr.loc[j,i]:
score += (data.loc[user,i] + data_dev.loc[j,i]) * data_fr.loc[j,i]
denominator += data_fr.loc[j,i]
if denominator == 0:
recommendation_dictionary[j] = 0
else:
score = score/denominator
recommendation_dictionary[j] = score
recommendation_dictionary = {k:round(v,2) for k, v in recommendation_dictionary.items()}
return sorted(recommendation_dictionary.items(), key=lambda x: x[1], reverse=True)[:5]
In [15]:
slopeone('U1103', ratings)
Out[15]:
Another metric for similarity is cosine similarity.
Predictions are generated in a similar way to the previous methods - as a weighted rating of other users/items. Also it is a good idea to remove user's bias. Users tend to give low or high ratings for all movies. So I'll take in consideration average ratings of users.
In [16]:
ratings_filled = data.pivot_table(index='userID', columns='placeID', values='rating', fill_value=0)
ratings_filled = ratings_filled.astype(float).values
In [17]:
def similarity(ratings, matrix_type='user', epsilon=1e-9):
if matrix_type == 'user':
sim = ratings.dot(ratings.T) + epsilon
elif matrix_type == 'place':
sim = ratings.T.dot(ratings) + epsilon
norms = np.array([np.sqrt(np.diagonal(sim))])
return (sim / norms / norms.T)
In [18]:
user_similarity = similarity(ratings_filled, matrix_type='user')
item_similarity = similarity(ratings_filled, matrix_type='place')
In [19]:
def predict(ratings, similarity, matrix_type='user'):
'''
Predict places based on similarity.
'''
if matrix_type == 'user':
#Bias as sum of non-zero values divided by the number of non-zer0 values.
user_bias = np.true_divide(ratings.sum(axis=1),(ratings!=0).sum(axis=1))
ratings = (ratings - user_bias[:, np.newaxis]).copy()
pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
pred += user_bias[:, np.newaxis]
elif matrix_type == 'place':
item_bias = np.true_divide(ratings.sum(axis=0),(ratings!=0).sum(axis=0))
ratings = (ratings - item_bias[np.newaxis, :]).copy()
pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
pred += item_bias[np.newaxis, :]
return pred
def recommend_cosine(rating, matrix, user):
'''
If user has rated a place, replace predicted rating with 0. Return top-5 predictions.
'''
predictions = [[0 if rating[j][i] > 0 else matrix[j][i] for i in range(len(matrix[j]))] for j in range(len(matrix))]
recommendations = pd.DataFrame(index=ratings.index,columns=ratings.columns,data=np.round(predictions,4)).transpose()
return recommendations[user].sort_values(ascending=False)[:5]
In [20]:
user_pred = predict(ratings_filled, user_similarity, matrix_type='user')
item_pred = predict(ratings_filled, item_similarity, matrix_type='place')
In [21]:
recommend_cosine(ratings_filled, item_pred, 'U1103')
Out[21]:
In [22]:
recommend_cosine(ratings_filled, user_pred, 'U1103')
Out[22]:
Model-based collaborative filtering strives to find latent features in the data. Matrix factorization is a commonly used method. It implies finding two matrices so that their multiplication will yield the matrix with ratings: ones that we already have and predicted ones. One matrix is for users (P), the other one is for items (Q). Ratings matrix (R) is their multiplication. One dimension of matrices P and Q is number of users/items respectively, the other is the number of latent features.
There are several algorithms with which matrix factorization could be done. Alternating least squares is one of them.
$$\underset{Q* , P*}{min}\sum_{(u,i)\epsilon K }(r_{ui}-Q_i^TP_u)^2+\lambda(\left \| Q_i \right \|^2 + \left \| P_u \right \|^2)$$The algorithms optimizes the difference between the original ratings and the ratings which are produced by the multiplication of aforementioned matrices. Second part of the formula is regularization. The idea of ALS is to fix one of matrices (P or Q), optimize for the other matrix, then at the next step fix the second matrix and optimize for the first one.
$${p}_{i} = A^{-1}_{i} V_{i} \ with\ A_{i} = Q_{I_i} Q_{I_i}^{T} + \lambda n_{p_i} E \ and\ V_{i} = Q_{I_i} R^{T}(i,I_{i})$$$${q}_{j} = A^{-1}_{j} V_{j} \ with\ A_{j} = P_{I_j} P_{I_j}^{T} + \lambda n_{q_j} E \ and\ V_{j} = P_{I_j} R^{T}(I_{j},j)$$P. S. This part is heavilly based on this article.
At first I need to split the data into train and test to check how the error changes with each step. Train and test should have the same dimensions as original data, so here is a function for it. I take users, who have rated at least one movie and select their three ratings - this is test. Train is all other values.
In [23]:
def train_test_split(ratings):
test = np.zeros(ratings.shape)
train = ratings.copy()
non_zero = [i for i in range(ratings.shape[0]) if sum(ratings[i]) > 0]
for user in non_zero:
test_ratings = np.random.choice(ratings[user, :].nonzero()[0],
size=3,
replace=False)
train[user, test_ratings] = 0.
test[user, test_ratings] = ratings[user, test_ratings]
return train, test
In [24]:
R, T = train_test_split(ratings_filled)
Now I need index matrix, where value 1 means that a certain user has rated a particular item.
In [25]:
I = R.copy()
I[I > 0] = 1
I[I == 0] = 0
I2 = T.copy()
I2[I2 > 0] = 1
I2[I2 == 0] = 0
In [26]:
def rmse(I,R,Q,P):
return np.sqrt(np.sum((I * (R - np.dot(P.T,Q)))**2)/len(R[R > 0]))
In [27]:
def als(R=R, T=T, lmbda=0.1, k=40, n_epochs=30, I=I, I2=I2):
'''
Function for ALS. Takes matrices and parameters as inputs.
Lmbda - learning rate;
k - dimensionality of latent feature space,
n_epochs - number of epochs for training.
'''
#Number of users and items.
m, n = R.shape
P = 1.5 * np.random.rand(k,m) # Latent user feature matrix.
Q = 1.5 * np.random.rand(k,n) # Latent places feature matrix.
Q[0,:] = R[R != 0].mean(axis=0) # Avg. rating for each movie for initial step.
E = np.eye(k) # (k x k)-dimensional idendity matrix.
train_errors = []
test_errors = []
for epoch in range(n_epochs):
# Fix Q and estimate P
for i, Ii in enumerate(I):
nui = np.count_nonzero(Ii)
if (nui == 0): nui = 1
a = np.dot(np.diag(Ii), Q.T)
Ai = np.dot(Q, a) + lmbda * nui * E
v = np.dot(np.diag(Ii), R[i].T)
Vi = np.dot(Q, v)
P[:,i] = np.linalg.solve(Ai,Vi)
# Fix P and estimate Q
for j, Ij in enumerate(I.T):
nmj = np.count_nonzero(Ij)
if (nmj == 0): nmj = 1
a = np.dot(np.diag(Ij), P.T)
Aj = np.dot(P, a) + lmbda * nmj * E
v = np.dot(np.diag(Ij), R[:,j])
Vj = np.dot(P, v)
Q[:,j] = np.linalg.solve(Aj,Vj)
train_rmse = rmse(I,R,Q,P)
test_rmse = rmse(I2,T,Q,P)
train_errors.append(train_rmse)
test_errors.append(test_rmse)
print(f'[Epoch {epoch+1}/{n_epochs}] train error: {train_rmse:6.6}, test error: {test_rmse:6.6}')
if len(train_errors) > 1 and test_errors[-1:] > test_errors[-2:-1]:
break
print('Test error stopped improving, algorithm stopped')
R = pd.DataFrame(R)
R.columns = ratings.columns
R.index = ratings.index
R_pred = pd.DataFrame(np.dot(P.T,Q))
R_pred.columns = ratings.columns
R_pred.index = ratings.index
return pd.DataFrame(R), R_pred
In [28]:
R, R_pred = als()
So this is it. Now let's compare original and predicted ratings of a user.
In [29]:
user_ratings = R.transpose()['U1123'][R.transpose()['U1123'].sort_values(ascending=False) >=1]
predictions = pd.DataFrame(user_ratings)
predictions.columns = ['Actual']
predictions['Predicted'] = R_pred.loc['U1123',user_ratings.index]
predictions
Out[29]:
The difference in ratings truly isn't very big. And now let's see the recommendations.
In [30]:
R_pred.loc['U1123',set(R_pred.transpose().index)-set(user_ratings.index)].sort_values(ascending=False)[:5]
Out[30]: