Getting the data

The first step is to get the movielens data.

Let's import the utility functions from data.py:



In [1]:

    
import data

The following functions get the dataset, and save it to a local file, and parse it into sparse matrices we can pass into LightFM.

In particular, _build_interaction_matrix constructs the interaction matrix: a (no_users, no_items) matrix with 1 in place of positive interactions, and -1 in place of negative interactions. For this experiment, any rating lower than 4 is a negative rating.



In [3]:

    
import inspect



In [7]:

    
print(inspect.getsource(data._build_interaction_matrix))









    



def _build_interaction_matrix(rows, cols, data):
    """
    Build the training matrix (no_users, no_items),
    with ratings >= 4.0 being marked as positive and
    the rest as negative.
    """

    mat = sp.lil_matrix((rows, cols), dtype=np.int32)

    for uid, iid, rating, timestamp in data:
        if rating >= 4.0:
            mat[uid, iid] = 1.0
        else:
            mat[uid, iid] = -1.0

    return mat.tocoo()

Let's run it! The dataset will be automatically downloaded and processed.



In [8]:

    
train, test = data.get_movielens_data()

Let's check the matrices.



In [9]:

    
train









    Out[9]:





<944x1683 sparse matrix of type '<type 'numpy.int32'>'
	with 90570 stored elements in COOrdinate format>



In [10]:

    
test









    Out[10]:





<944x1683 sparse matrix of type '<type 'numpy.int32'>'
	with 9430 stored elements in COOrdinate format>

Looks good and ready to go.

Fitting the model

Let's import the lightfm model.



In [11]:

    
from lightfm import LightFM



In [12]:

    
model = LightFM(no_components=30)

In this case, we set the latent dimensionality of the model to 30. Fitting is straightforward.



In [23]:

    
model.fit(train, epochs=50)









    Out[23]:





<lightfm.lightfm.LightFM at 0x7feac361c190>

Let's try to get a handle on the model accuracy using the ROC AUC score.



In [24]:

    
from sklearn.metrics import roc_auc_score

train_predictions = model.predict(train.row,
                                  train.col)



In [25]:

    
train_predictions









    Out[25]:





array([ 0.57735386,  0.12810806,  0.70434413, ...,  0.37278502,
        0.1001321 ,  0.07673392])



In [26]:

    
roc_auc_score(train.data, train_predictions)









    Out[26]:





0.98793016085665009

We've got very high accuracy on the train dataset; let's check the test set.



In [27]:

    
test_predictions = model.predict(test.row, test.col)



In [28]:

    
roc_auc_score(test.data, test_predictions)









    Out[28]:





0.72499325915332191

The accuracy is much lower on the test data, suggesting a high degree of overfitting. We can combat this by regularizing the model.



In [34]:

    
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col))









    Out[34]:





0.76052953487950203

A modicum of regularization gives much better results.

Using metadata

The promise of lightfm is the possibility of using metadata in cold-start scenarios. The Movielens dataset has genre data for the movies it contains. Let's use that to train the LightFM model.

The get_movielens_item_metadata function constructs a (no_items, no_features) matrix containing features for the movies; if we use genres this will be a (no_items, no_genres) feature matrix.



In [36]:

    
item_features = data.get_movielens_item_metadata(use_item_ids=False)
item_features









    Out[36]:





<1683x19 sparse matrix of type '<type 'numpy.int32'>'
	with 2893 stored elements in LInked List format>

We need to pass these to the fit method in order to use them.



In [37]:

    
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, item_features=item_features, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col, item_features=item_features))









    Out[37]:





0.67178594791630175

This is not as accurate as a pure collaborative filtering solution, but should enable us to make recommendations new movies.

If we add item-specific features back, we should get the original accuracy back.



In [38]:

    
item_features = data.get_movielens_item_metadata(use_item_ids=True)
item_features
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, item_features=item_features, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col, item_features=item_features))









    Out[38]:





0.75693132377857264



In [ ]: