Bayesian Personalised Ranking (BPR) is a state-of-the-art algorithm for predicting personalised preferences over a range of items from implicit feedback. One of the main difference with other machine learning approaches for item recommenders is that it doesn't assume that unseen items are items that weren't liked by a user.
BPR is also used for a number of other use-cases, such as tag and link prediction, matrix completion, etc.
We start by loading training and testing data from Movielens data. The theano_bpr library provides a small utility function for that, which will extract all user-item pairs for which the rating is above a given threshold and map user and item identifiers to coordinates in a user-item matrix.
In [1]:
from theano_bpr.utils import load_data_from_movielens
In [2]:
training_data, users_to_index, items_to_index = load_data_from_movielens('http://files.grouplens.org/datasets/movielens/ml-100k/ua.base', 3)
This function returns three things
In [3]:
training_data[0:5]
Out[3]:
In [4]:
users_to_index.keys()[0:10]
Out[4]:
In [5]:
users_to_index['811']
Out[5]:
In [6]:
items_to_index.keys()[0:10]
Out[6]:
In [7]:
items_to_index['349']
Out[7]:
We now load the testing data, in the same way. The only difference is the two extra parameters, which define initial coordinates for users and items (the ones generated from the training data).
In [8]:
testing_data, users_to_index, items_to_index = load_data_from_movielens('http://files.grouplens.org/datasets/movielens/ml-100k/ua.test', 3, users_to_index, items_to_index)
In [9]:
testing_data[0:5]
Out[9]:
We now have data for training our model, and data to test our trained model. We import the BPR class from our theano_bpr library, an implementation of Bayesian Personalised Ranking for Matrix Factorisation, using the Theano machine learning library.
In [10]:
from theano_bpr import BPR
We create a new BPR object. The required initialisation arguments are:
In [18]:
bpr = BPR(10, len(users_to_index.keys()), len(items_to_index.keys()))
We now train our BPR matrix factorisation model using our training data. We can also define the number of iterations through the entire training data at this stage. The actual learning will be done through stochastic gradient descent over the training data, with randomly sampled batches of training data (uniform user sampling to make sure we don't skew the model towards very active users).
For training, we recommend using a GPU or multithreaded OpenBLAS.
In [19]:
bpr.train(training_data, epochs=50)
We can now evaluate how well our trained model is doing over our test data. How accurately can it predict user viewings in the testing dataset? The metric we use to quantify that is the Area Under Curve (AUC) measure.
In [20]:
bpr.test(testing_data)
Out[20]:
In [ ]: