The goal of this notebook is first to demonstrate how it is possible to build a bi-linear recommender system only using positive feedback data.
In a latter section we show that it is possible to train deeper architectures following the same design principles.
This notebook is inspired by Maciej Kula's Recommendations in Keras using triplet loss. Contrary to Maciej we won't use the BPR loss but instead will introduce the more common margin-based comparator.
For the sake of computation time, we will only use the smallest variant of the movielens reviews dataset. Beware that the architectural choices and hyperparameters that work well on such a toy dataset will not necessarily be representative of the behavior when run on a more realistic dataset such as Movielens 10M or the Yahoo Songs dataset with 700M rating.
In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os.path as op
from zipfile import ZipFile
try:
from urllib.request import urlretrieve
except ImportError: # Python 2 compat
from urllib import urlretrieve
ML_100K_URL = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
ML_100K_FILENAME = ML_100K_URL.rsplit('/', 1)[1]
ML_100K_FOLDER = 'ml-100k'
if not op.exists(ML_100K_FILENAME):
print('Downloading %s to %s...' % (ML_100K_URL, ML_100K_FILENAME))
urlretrieve(ML_100K_URL, ML_100K_FILENAME)
if not op.exists(ML_100K_FOLDER):
print('Extracting %s to %s...' % (ML_100K_FILENAME, ML_100K_FOLDER))
ZipFile(ML_100K_FILENAME).extractall('.')
In [ ]:
data_train = pd.read_csv(op.join(ML_100K_FOLDER, 'ua.base'), sep='\t',
names=["user_id", "item_id", "rating", "timestamp"])
data_test = pd.read_csv(op.join(ML_100K_FOLDER, 'ua.test'), sep='\t',
names=["user_id", "item_id", "rating", "timestamp"])
data_train.describe()
In [ ]:
def extract_year(release_date):
if hasattr(release_date, 'split'):
components = release_date.split('-')
if len(components) == 3:
return int(components[2])
# Missing value marker
return 1920
m_cols = ['item_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
items = pd.read_csv(op.join(ML_100K_FOLDER, 'u.item'), sep='|',
names=m_cols, usecols=range(5), encoding='latin-1')
items['release_year'] = items['release_date'].map(extract_year)
data_train = pd.merge(data_train, items)
data_test = pd.merge(data_test, items)
In [ ]:
data_train.head()
In [ ]:
# data_test.describe()
In [ ]:
max_user_id = max(data_train['user_id'].max(), data_test['user_id'].max())
max_item_id = max(data_train['item_id'].max(), data_test['item_id'].max())
n_users = max_user_id + 1
n_items = max_item_id + 1
print('n_users=%d, n_items=%d' % (n_users, n_items))
In [ ]:
pos_data_train = data_train.query("rating >= 4")
pos_data_test = data_test.query("rating >= 4")
Because the median rating is around 3.5, this cut will remove approximately half of the ratings from the datasets:
In [ ]:
pos_data_train['rating'].count()
In [ ]:
pos_data_test['rating'].count()
The following section demonstrates how to build a low-rank quadratic interaction model between users and items. The similarity score between a user and an item is defined by the unormalized dot products of their respective embeddings.
The matching scores can be use to rank items to recommend to a specific user.
Training of the model parameters is achieved by randomly sampling negative items not seen by a pre-selected anchor user. We want the model embedding matrices to be such that the similarity between the user vector and the negative vector is smaller than the similarity between the user vector and the positive item vector. Furthermore we use a margin to further move appart the negative from the anchor user.
Here is the architecture of such a triplet architecture. The triplet name comes from the fact that the loss to optimize is defined for triple (anchor_user, positive_item, negative_item)
:
We call this model a triplet model with bi-linear interactions because the similarity between a user and an item is captured by a dot product of the first level embedding vectors. This is therefore not a deep architecture.
In [ ]:
import tensorflow as tf
from tensorflow.keras import layers
def identity_loss(y_true, y_pred):
"""Ignore y_true and return the mean of y_pred
This is a hack to work-around the design of the Keras API that is
not really suited to train networks with a triplet loss by default.
"""
return tf.reduce_mean(y_pred)
class MarginLoss(layers.Layer):
def __init__(self, margin=1.):
super().__init__()
self.margin = margin
def call(self, inputs):
pos_pair_similarity = inputs[0]
neg_pair_similarity = inputs[1]
diff = neg_pair_similarity - pos_pair_similarity
return tf.maximum(diff + self.margin, 0.)
Here is the actual code that builds the model(s) with shared weights. Note that here we use the cosine similarity instead of unormalized dot products (both seems to yield comparable results).
The triplet model is used to train the weights of the companion similarity model. The triplet model takes 1 user, 1 positive item (relative to the selected user) and one negative item and is trained with comparator loss.
The similarity model takes one user and one item as input and return compatibility score (aka the match score).
In [ ]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Flatten, Input, Dense
from tensorflow.keras.layers import Lambda, Dot
from tensorflow.keras.regularizers import l2
class TripletModel(Model):
def __init__(self, n_users, n_items, latent_dim=64,
l2_reg=None, margin=1.):
super().__init__(name="TripletModel")
self.margin = margin
l2_reg = None if l2_reg == 0 else l2(l2_reg)
self.user_layer = Embedding(n_users, latent_dim,
input_length=1,
input_shape=(1,),
name='user_embedding',
embeddings_regularizer=l2_reg)
# The following embedding parameters will be shared to
# encode both the positive and negative items.
self.item_layer = Embedding(n_items, latent_dim,
input_length=1,
name="item_embedding",
embeddings_regularizer=l2_reg)
# The 2 following layers are without parameters, and can
# therefore be used for both positive and negative items.
self.flatten = Flatten()
self.dot = Dot(axes=1, normalize=True)
self.margin_loss = MarginLoss(margin)
def call(self, inputs, training=False):
user_input = inputs[0]
pos_item_input = inputs[1]
neg_item_input = inputs[2]
user_embedding = self.user_layer(user_input)
user_embedding = self.flatten(user_embedding)
pos_item_embedding = self.item_layer(pos_item_input)
pos_item_embedding = self.flatten(pos_item_embedding)
neg_item_embedding = self.item_layer(neg_item_input)
neg_item_embedding = self.flatten(neg_item_embedding)
# Similarity computation between embeddings
pos_similarity = self.dot([user_embedding, pos_item_embedding])
neg_similarity = self.dot([user_embedding, neg_item_embedding])
return self.margin_loss([pos_similarity, neg_similarity])
triplet_model = TripletModel(n_users, n_items,
latent_dim=64, l2_reg=1e-6)
In [ ]:
class MatchModel(Model):
def __init__(self, user_layer, item_layer):
super().__init__(name="MatchModel")
# Reuse shared weights for those layers:
self.user_layer = user_layer
self.item_layer = item_layer
self.flatten = Flatten()
self.dot = Dot(axes=1, normalize=True)
def call(self, inputs):
user_input = inputs[0]
pos_item_input = inputs[1]
user_embedding = self.user_layer(user_input)
user_embedding = self.flatten(user_embedding)
pos_item_embedding = self.item_layer(pos_item_input)
pos_item_embedding = self.flatten(pos_item_embedding)
pos_similarity = self.dot([user_embedding,
pos_item_embedding])
return pos_similarity
match_model = MatchModel(triplet_model.user_layer,
triplet_model.item_layer)
Note that triplet_model
and match_model
have as much parameters, they share both user and item embeddings. Their only difference is that the latter doesn't compute the negative similarity.
Now that we have a randomly initialized model we can start computing random recommendations. To assess their quality we do the following for each user:
In [ ]:
from sklearn.metrics import roc_auc_score
def average_roc_auc(model, data_train, data_test):
"""Compute the ROC AUC for each user and average over users"""
max_user_id = max(data_train['user_id'].max(),
data_test['user_id'].max())
max_item_id = max(data_train['item_id'].max(),
data_test['item_id'].max())
user_auc_scores = []
for user_id in range(1, max_user_id + 1):
pos_item_train = data_train[data_train['user_id'] == user_id]
pos_item_test = data_test[data_test['user_id'] == user_id]
# Consider all the items already seen in the training set
all_item_ids = np.arange(1, max_item_id + 1)
items_to_rank = np.setdiff1d(
all_item_ids, pos_item_train['item_id'].values)
# Ground truth: return 1 for each item positively present in
# the test set and 0 otherwise.
expected = np.in1d(
items_to_rank, pos_item_test['item_id'].values)
if np.sum(expected) >= 1:
# At least one positive test value to rank
repeated_user_id = np.empty_like(items_to_rank)
repeated_user_id.fill(user_id)
predicted = model.predict(
[repeated_user_id, items_to_rank], batch_size=4096)
user_auc_scores.append(roc_auc_score(expected, predicted))
return sum(user_auc_scores) / len(user_auc_scores)
By default the model should make predictions that rank the items in random order. The ROC AUC score is a ranking score that represents the expected value of correctly ordering uniformly sampled pairs of recommendations.
A random (untrained) model should yield 0.50 ROC AUC on average.
In [ ]:
average_roc_auc(match_model, pos_data_train, pos_data_test)
Let's now fit the parameters of the model by sampling triplets: for each user, select a movie in the positive feedback set of that user and randomly sample another movie to serve as negative item.
Note that this sampling scheme could be improved by removing items that are marked as positive in the data to remove some label noise. In practice this does not seem to be a problem though.
In [ ]:
def sample_triplets(pos_data, max_item_id, random_seed=0):
"""Sample negatives at random"""
rng = np.random.RandomState(random_seed)
user_ids = pos_data['user_id'].values
pos_item_ids = pos_data['item_id'].values
neg_item_ids = rng.randint(low=1, high=max_item_id + 1,
size=len(user_ids))
return [user_ids, pos_item_ids, neg_item_ids]
Let's train the triplet model:
In [ ]:
# we plug the identity loss and the a fake target variable ignored by
# the model to be able to use the Keras API to train the triplet model
fake_y = np.ones_like(pos_data_train["user_id"])
triplet_model.compile(loss=identity_loss, optimizer="adam")
n_epochs = 10
batch_size = 64
for i in range(n_epochs):
# Sample new negatives to build different triplets at each epoch
triplet_inputs = sample_triplets(pos_data_train, max_item_id,
random_seed=i)
# Fit the model incrementally by doing a single pass over the
# sampled triplets.
triplet_model.fit(x=triplet_inputs, y=fake_y, shuffle=True,
batch_size=64, epochs=1)
# Evaluate the convergence of the model. Ideally we should prepare a
# validation set and compute this at each epoch but this is too slow.
test_auc = average_roc_auc(match_model, pos_data_train, pos_data_test)
print("Epoch %d/%d: test ROC AUC: %0.4f"
% (i + 1, n_epochs, test_auc))
In [ ]:
# print(match_model.summary())
In [ ]:
# print(triplet_model.summary())
In [ ]:
# %load solutions/triplet_parameter_count.py
Instead of using hard-coded cosine similarities to predict the match of a (user_id, item_id)
pair, we can instead specify a deep neural network based parametrisation of the similarity. The parameters of that matching model are also trained with the margin comparator loss:
Implement a deep_match_model
, deep_triplet_model
pair of models
for the architecture described in the schema. The last layer of
the embedded Multi Layer Perceptron outputs a single scalar that
encodes the similarity between a user and a candidate item.
Evaluate the resulting model by computing the per-user average ROC AUC score on the test feedback data.
Check that the AUC ROC score is close to 0.50 for a randomly initialized model.
Check that you can reach at least 0.91 ROC AUC with this deep model (you might need to adjust the hyperparameters).
Hints:
it is possible to reuse the code to create embeddings from the previous model definition;
the concatenation between user and the positive item embedding can be
obtained with the Concatenate
layer:
concat = Concatenate()
positive_embeddings_pair = concat([user_embedding,
positive_item_embedding])
negative_embeddings_pair = concat([user_embedding,
negative_item_embedding])
In [ ]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Embedding, Flatten, Dense
from tensorflow.keras.layers import Concatenate, Dropout
from tensorflow.keras.regularizers import l2
class MLP(layers.Layer):
def __init__(self, n_hidden=1, hidden_size=64, dropout=0.,
l2_reg=None):
super().__init__()
# TODO
class DeepTripletModel(Model):
def __init__(self, n_users, n_items, user_dim=32, item_dim=64,
margin=1., n_hidden=1, hidden_size=64, dropout=0,
l2_reg=None):
super().__init__()
# TODO
class DeepMatchModel(Model):
def __init__(self, user_layer, item_layer, mlp):
super().__init__(name="MatchModel")
# TODO
In [ ]:
# %load solutions/deep_implicit_feedback_recsys.py
In [ ]:
# print(deep_match_model.summary())
In [ ]:
# print(deep_triplet_model.summary())
In [ ]:
# %load solutions/deep_triplet_parameter_count.py
You can implement any of the following ideas if you want to get a deeper understanding of recommender systems.
As we did for the Explicit Feedback model, it's also possible to extend our models to take additional user and item metadata as side information when computing the match score.
In this notebook we evaluated the quality of the ranked recommendations using the ROC AUC metric. This score reflect the ability of the model to correctly rank any pair of items (sampled uniformly at random among all possible items).
In practice recommender systems will only display a few recommendations to the user (typically 1 to 10). It is typically more informative to use an evaluatio metric that characterize the quality of the top ranked items and attribute less or no importance to items that are not good recommendations for a specific users. Popular ranking metrics therefore include the Precision at k and the Mean Average Precision.
You can read up online about those metrics and try to implement them here.
In this experiment we sampled negative items uniformly at random. However, after training the model for a while, it is possible that the vast majority of sampled negatives have a similarity already much lower than the positive pair and that the margin comparator loss sets the majority of the gradients to zero effectively wasting a lot of computation.
Given the current state of the recsys model we could sample harder negatives with a larger likelihood to train the model better closer to its decision boundary. This strategy is implemented in the WARP loss [1].
The main drawback of hard negative sampling is increasing the risk of sever overfitting if a significant fraction of the labels are noisy.
A very popular recommender systems model is called Factorization Machines [2][3]. They two use low rank vector representations of the inputs but they do not use a cosine similarity or a neural network to model user/item compatibility.
It is be possible to adapt our previous code written with Keras to replace the cosine sims / MLP with the low rank FM quadratic interactions by reading through this gentle introduction.
If you choose to do so, you can compare the quality of the predictions with those obtained by the pywFM project which provides a Python wrapper for the official libFM C++ implementation. Maciej Kula also maintains a lighfm that implements an efficient and well documented variant in Cython and Python.
[1] Wsabie: Scaling Up To Large Vocabulary Image Annotation
Jason Weston, Samy Bengio, Nicolas Usunier, 2011
https://research.google.com/pubs/pub37180.html
[2] Factorization Machines, Steffen Rendle, 2010
https://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle2010FM.pdf
[3] Factorization Machines with libFM, Steffen Rendle, 2012
in ACM Trans. Intell. Syst. Technol., 3(3), May.
http://doi.acm.org/10.1145/2168752.2168771
In [ ]: