In [1]:
import data
The following functions get the dataset, and save it to a local file, and parse it into sparse matrices we can pass into LightFM
.
In particular, _build_interaction_matrix
constructs the interaction matrix: a (no_users, no_items) matrix with 1 in place of positive interactions, and -1 in place of negative interactions. For this experiment, any rating lower than 4 is a negative rating.
In [3]:
import inspect
In [7]:
print(inspect.getsource(data._build_interaction_matrix))
Let's run it! The dataset will be automatically downloaded and processed.
In [8]:
train, test = data.get_movielens_data()
Let's check the matrices.
In [9]:
train
Out[9]:
In [10]:
test
Out[10]:
Looks good and ready to go.
In [11]:
from lightfm import LightFM
In [12]:
model = LightFM(no_components=30)
In this case, we set the latent dimensionality of the model to 30. Fitting is straightforward.
In [23]:
model.fit(train, epochs=50)
Out[23]:
Let's try to get a handle on the model accuracy using the ROC AUC score.
In [24]:
from sklearn.metrics import roc_auc_score
train_predictions = model.predict(train.row,
train.col)
In [25]:
train_predictions
Out[25]:
In [26]:
roc_auc_score(train.data, train_predictions)
Out[26]:
We've got very high accuracy on the train dataset; let's check the test set.
In [27]:
test_predictions = model.predict(test.row, test.col)
In [28]:
roc_auc_score(test.data, test_predictions)
Out[28]:
The accuracy is much lower on the test data, suggesting a high degree of overfitting. We can combat this by regularizing the model.
In [34]:
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col))
Out[34]:
A modicum of regularization gives much better results.
The promise of lightfm
is the possibility of using metadata in cold-start scenarios. The Movielens dataset has genre data for the movies it contains. Let's use that to train the LightFM
model.
The get_movielens_item_metadata
function constructs a (no_items, no_features) matrix containing features for the movies; if we use genres this will be a (no_items, no_genres) feature matrix.
In [36]:
item_features = data.get_movielens_item_metadata(use_item_ids=False)
item_features
Out[36]:
We need to pass these to the fit
method in order to use them.
In [37]:
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, item_features=item_features, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col, item_features=item_features))
Out[37]:
This is not as accurate as a pure collaborative filtering solution, but should enable us to make recommendations new movies.
If we add item-specific features back, we should get the original accuracy back.
In [38]:
item_features = data.get_movielens_item_metadata(use_item_ids=True)
item_features
model = LightFM(no_components=30, user_alpha=0.0001, item_alpha=0.0001)
model.fit(train, item_features=item_features, epochs=50)
roc_auc_score(test.data, model.predict(test.row, test.col, item_features=item_features))
Out[38]:
In [ ]: