Using theano_bpr for recommendations

Bayesian Personalised Ranking (BPR) is a state-of-the-art algorithm for predicting personalised preferences over a range of items from implicit feedback. One of the main difference with other machine learning approaches for item recommenders is that it doesn't assume that unseen items are items that weren't liked by a user.

BPR is also used for a number of other use-cases, such as tag and link prediction, matrix completion, etc.

Loading data

We start by loading training and testing data from Movielens data. The theano_bpr library provides a small utility function for that, which will extract all user-item pairs for which the rating is above a given threshold and map user and item identifiers to coordinates in a user-item matrix.



In [1]:

    
from theano_bpr.utils import load_data_from_movielens



In [2]:

    
training_data, users_to_index, items_to_index = load_data_from_movielens('http://files.grouplens.org/datasets/movielens/ml-100k/ua.base', 3)

This function returns three things

Entries in the user-item matrix



In [3]:

    
training_data[0:5]









    Out[3]:





[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4)]

A mapping from user identifiers to coordinates



In [4]:

    
users_to_index.keys()[0:10]









    Out[4]:





['344', '345', '346', '347', '340', '341', '342', '343', '810', '811']



In [5]:

    
users_to_index['811']









    Out[5]:





809

A mapping from item identifiers (in our case programme identifiers) to coordinates



In [6]:

    
items_to_index.keys()[0:10]









    Out[6]:





['344', '345', '346', '347', '340', '341', '342', '343', '348', '349']



In [7]:

    
items_to_index['349']









    Out[7]:





1090

We now load the testing data, in the same way. The only difference is the two extra parameters, which define initial coordinates for users and items (the ones generated from the training data).



In [8]:

    
testing_data, users_to_index, items_to_index = load_data_from_movielens('http://files.grouplens.org/datasets/movielens/ml-100k/ua.test', 3, users_to_index, items_to_index)



In [9]:

    
testing_data[0:5]









    Out[9]:





[(0, 820), (0, 442), (0, 513), (0, 446), (0, 658)]

We now have data for training our model, and data to test our trained model. We import the BPR class from our theano_bpr library, an implementation of Bayesian Personalised Ranking for Matrix Factorisation, using the Theano machine learning library.

Training a BPR model



In [10]:

    
from theano_bpr import BPR

We create a new BPR object. The required initialisation arguments are:

The number of latent dimensions in our matrix factorisation model
The total number of users
The total number of items



In [18]:

    
bpr = BPR(10, len(users_to_index.keys()), len(items_to_index.keys()))

We now train our BPR matrix factorisation model using our training data. We can also define the number of iterations through the entire training data at this stage. The actual learning will be done through stochastic gradient descent over the training data, with randomly sampled batches of training data (uniform user sampling to make sure we don't skew the model towards very active users).

For training, we recommend using a GPU or multithreaded OpenBLAS.



In [19]:

    
bpr.train(training_data, epochs=50)









    



Generating 2495300 random training samples
Processed 2495000 ( 99.99% ) in 0.0179 seconds
Total training time 53.76 seconds; 2.154323e-05 per sample

Testing a BPR model

We can now evaluate how well our trained model is doing over our test data. How accurately can it predict user viewings in the testing dataset? The metric we use to quantify that is the Area Under Curve (AUC) measure.



In [20]:

    
bpr.test(testing_data)









    



Current AUC mean (900 samples): 0.91430






    Out[20]:





0.91428679132906043



In [ ]: