In this example we'll use the StackExchange dataset to explore recommendations under item-cold start. Data dumps from the StackExchange network are available at https://archive.org/details/stackexchange, and we'll use one of them --- for stats.stackexchange.com --- here.
The consists of users answering questions: in the user-item interaction matrix, each user is a row, and each question is a column. Based on which users answered which questions in the training set, we'll try to recommend new questions in the training set.
Let's start by loading the data. We'll use the datasets
module.
In [1]:
import numpy as np
from lightfm.datasets import fetch_stackexchange
data = fetch_stackexchange('crossvalidated',
test_set_fraction=0.1,
indicator_features=False,
tag_features=True)
train = data['train']
test = data['test']
Let's examine the data:
In [2]:
print('The dataset has %s users and %s items, '
'with %s interactions in the test and %s interactions in the training set.'
% (train.shape[0], train.shape[1], test.getnnz(), train.getnnz()))
The training and test set are divided chronologically: the test set contains the 10% of interactions that happened after the 90% in the training set. This means that many of the questions in the test set have no interactions. This is an accurate description of a questions answering system: it is most important to recommend questions that have not yet been answered to the expert users who can answer them.
This is clearly a cold-start scenario, and so we can expect a traditional collaborative filtering model to do very poorly. Let's check if that's the case:
In [3]:
# Import the model
from lightfm import LightFM
# Set the number of threads; you can increase this
# ify you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 30
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6
# Let's fit a WARP model: these generally have the best performance.
model = LightFM(loss='warp',
item_alpha=ITEM_ALPHA,
no_components=NUM_COMPONENTS)
# Run 3 epochs and time it.
%time model = model.fit(train, epochs=NUM_EPOCHS, num_threads=NUM_THREADS)
As a means of sanity checking, let's calculate the model's AUC on the training set first. If it's reasonably high, we can be sure that the model is not doing anything stupid and is fitting the training data well.
In [4]:
# Import the evaluation routines
from lightfm.evaluation import auc_score
# Compute and print the AUC score
train_auc = auc_score(model, train, num_threads=NUM_THREADS).mean()
print('Collaborative filtering train AUC: %s' % train_auc)
Fantastic, the model is fitting the training set well. But what about the test set?
In [5]:
# We pass in the train interactions to exclude them from predictions.
# This is to simulate a recommender system where we do not
# re-recommend things the user has already interacted with in the train
# set.
test_auc = auc_score(model, test, train_interactions=train, num_threads=NUM_THREADS).mean()
print('Collaborative filtering test AUC: %s' % test_auc)
This is terrible: we do worse than random! This is not very surprising: as there is no training data for the majority of the test questions, the model cannot compute reasonable representations of the test set items.
The fact that we score them lower than other items (AUC < 0.5) is due to estimated per-item biases, which can be confirmed by setting them to zero and re-evaluating the model.
In [6]:
# Set biases to zero
model.item_biases *= 0.0
test_auc = auc_score(model, test, train_interactions=train, num_threads=NUM_THREADS).mean()
print('Collaborative filtering test AUC: %s' % test_auc)
In [7]:
item_features = data['item_features']
tag_labels = data['item_feature_labels']
print('There are %s distinct tags, with values like %s.' % (item_features.shape[1], tag_labels[:3].tolist()))
We can use these features (instead of an identity feature matrix like in a pure CF model) to estimate a model which will generalize better to unseen examples: it will simply use its representations of item features to infer representations of previously unseen questions.
Let's go ahead and fit a model of this type.
In [8]:
# Define a new model instance
model = LightFM(loss='warp',
item_alpha=ITEM_ALPHA,
no_components=NUM_COMPONENTS)
# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
item_features=item_features,
epochs=NUM_EPOCHS,
num_threads=NUM_THREADS)
As before, let's sanity check the model on the training set.
In [9]:
# Don't forget the pass in the item features again!
train_auc = auc_score(model,
train,
item_features=item_features,
num_threads=NUM_THREADS).mean()
print('Hybrid training set AUC: %s' % train_auc)
Note that the training set AUC is lower than in a pure CF model. This is fine: by using a lower-rank item feature matrix, we have effectively regularized the model, giving it less freedom to fit the training data.
Despite this the model does much better on the test set:
In [10]:
test_auc = auc_score(model,
test,
train_interactions=train,
item_features=item_features,
num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)
This is as expected: because items in the test set share tags with items in the training set, we can provide better test set recommendations by using the tag representations learned from training.
One of the nice properties of the hybrid model is that the estimated tag embeddings capture semantic characteristics of the tags. Like the word2vec model, we can use this property to explore semantic tag similarity:
In [11]:
def get_similar_tags(model, tag_id):
# Define similarity as the cosine of the angle
# between the tag latent vectors
# Normalize the vectors to unit length
tag_embeddings = (model.item_embeddings.T
/ np.linalg.norm(model.item_embeddings, axis=1)).T
query_embedding = tag_embeddings[tag_id]
similarity = np.dot(tag_embeddings, query_embedding)
most_similar = np.argsort(-similarity)[1:4]
return most_similar
for tag in (u'bayesian', u'regression', u'survival'):
tag_id = tag_labels.tolist().index(tag)
print('Most similar tags for %s: %s' % (tag_labels[tag_id],
tag_labels[get_similar_tags(model, tag_id)]))