The main goal of this tutorial is to demonstrate how one can use Polara to conduct custom experiments with very specific requirements. Needless to say, it can be useful for reproducing someone's research as well. Below you'll find a particular example based on one of my favorite papers called "EIGENREC: generalizing PureSVD for effective and efficient top-N recommendations", which helped me to see standard SVD-based models in a differrent light and even led me to my own discoveries. Even though it's not necessary for understanding the material below, I strongly recommend to read the original paper as it builds on top of clear ideas and contains a very thorough analysis.
The key take home message from this paper for me personally is that SVD can be viewed as a particular case of a more general eigendecomposition problem of the scaled similarity matrix. Based on that insight, the authors of the EigenRec model propose several modifications, which involve tuning the scaling factor as well as the similarity measure in order to improve the model's performance.
Nevertheless, the tutorial allows to verify validity of the proposed ideas and creates a convenient playground for further exploration.
As in the previous tutorials, let's download the Movielens dataset. The task should be already familiar to you. This is one of the datasets used in the paper as well. One could use some other datasets, it wouldn't change anything in later parts of the tutorial. The main requirement is to have it in the form of a Pandas dataframe, similarly to what is returned by the get_movielens_data
function.
In [1]:
from polara import RecommenderData
from polara import get_movielens_data
In [2]:
data = get_movielens_data() # will automatically download it
# alternatively you can specify a path to the local copy as an argument to the function
data.head()
Out[2]:
In [3]:
data_model = RecommenderData(data, 'userid', 'movieid', 'rating', seed=0)
The EigenRec paper follows a specific experimentation setup, mainly based on the settings, proposed in my another favorite paper Performance of recommender algorithms on top-n recommendation tasks, devoted to the PureSVD model itself. For evaluation purposes, the authors sample 1.4% of all available ratings and additionally shrink the resulting sample by leaving 5-star ratings only. Quote from the paper (Section 4.2.1):
This setup can be easily implemented in Polara with the help of test_ratio
and holdout_size
parameters of the RecommendeData
instance. It requires a two-step preparation procedure.
The first step is to sample data without filtering top-rated items. The following configuration does the thing:
In [4]:
data_model.test_ratio = 0 # do not split dataset into folds, use entire dataset for sampling
data_model.holdout_size = 0.014 # sample this fraction of ratings from data
data_model.random_holdout = True # sample ratings randomly (not just 5-star)
data_model.warm_start = False # allow test users to be part of the training (excluding holdout items)
data_model.prepare() # perform sampling
Mind the test_ratio
parameter setting. Together with the test_fold
parameter it controls, which fraction of the dataset to sample from; 0 means the whole dataset and turns off data splitting mechanism used by Polara for cross-validation. The value of test_fold
has no effect in that case. Also note that by default Polara performs some additional manipulations with data like cleaning and reindexing to transform it into a uniform internal representation for further use. Key actions and their results are reported in an output text, which can be turned off by setting data_model.verbose = False
. Here's how to see the final result of sampling:
In [5]:
data_model.test.holdout.head()
Out[5]:
The second step is to leave only items with rating 5, as it was done in the original paper. The easiest way in our case would be to simply run:
data_model.test.holdout.query('rating==5', inplace=True)
However, in general, you shouldn't manually change the data after it was processed by Polara, as it may break some internal logic. A more appropriate and a safier way to achieve the same is to use the set_test_data
method, specifically designed to cover custom configurations:
In [6]:
data_model.set_test_data(holdout=data_model.test.holdout.query('rating==5'), # select only 5-star ratings
warm_start=data_model.warm_start,
reindex=False, # avoid reindexing users and items second time
ensure_consistency=False # do not try to filter out unseen entities (they are already excluded)
# leaving it as True wouldn't change the result but would lead to extra checks
)
Note that we reuse the previously sampled holdout dataset (the $\mathcal{P}$ dataset in the authors' notation), which is already reindexed by Polara's built-in data pre-processing procedure. In order not to loose the index mapping between internal and external representation of movies and users (stored in the data_model.index
attribute) it's very important to set reindex
argument of the set_test_data
method to False
. Now the data_model.test.holdout
dataframe stores the final result, namely the $\mathcal{T}$ dataset:
In [7]:
data_model.test.holdout.head()
Out[7]:
In the simplest case of the EigenRec model, when only the scaling factor is changed, we can go with a very straightforward approach. Instead of computing similarity matrices and solving an eigendecomposition problem, it is sufficient to apply standard SVD to a scaled rating matrix $\tilde R$:
$$ \tilde R = R \, S^{d-1} \approx U\Sigma V^T, $$where $R$ is an $M \times N$ rating matrix, $S = \text{diag}\{\|r_1\|_2, \dots, \|r_N\|_2\}^d$ is a diagonal scaling matrix with its non-zero values depending on a scaling parameter $d$ and $r_i$ denotes an $i$-th column of $R$. Note that due to the orthogonality of columns in the SVD factors the approximation of $\tilde R$ can be written in an equivalent and more convenient form $\tilde RVV^T$, which can be used to generate recommendations.
In order to calculate the scaled version of the PureSVD approach we can reuse the SVDModel
class implemented in Polara. One of the ways to do that is to redefine the build
method in an SVDModel
's subclass. A simpler solution, however, is to directly modify an output of the get_training_matrix
method, which is generally available for all models in Polara and is used internally in the SVDModel
in particular. This method returns the rating matrix in a sparse format, which is then fed into the scipy
's truncated SVD implementation within the build
method (you can run the SVDModel.build??
command with double question mark to see it). Assuming we already have sparse rating matrix, the following function will help to scale it:
In [8]:
from scipy.sparse import diags
from scipy.sparse.linalg import norm as spnorm
def sparse_normalize(matrix, scaling, axis):
'''Function to scale either rows or columns of the sparse rating matrix'''
if scaling == 1: # no scaling (standard SVD case)
return matrix
norm = spnorm(matrix, axis=axis, ord=2) # compute Euclidean norm of rows or columns
scaling_matrix = diags(np.power(norm, scaling-1, where=norm!=0))
if axis == 0: # scale columns
return matrix.dot(scaling_matrix)
if axis == 1: # scale rows
return scaling_matrix.dot(matrix)
Somewhat more involved modifications are required to generate model predictions, as it's based on an additional sampling of items not previously seen by the test users. Quote from the paper (Section 4.2.1):
This means that we need to generate prediction scores for 1000 randomly selected unseen items in addition to every item from the holdout. Moreover, every set of 1001 items is treated independently of the user it belongs to. Normally, Polara performs evaluation on a per user basis; however, in this case the logic is different and we have to take care of users with mulltiple items in the holdout. From the line below it can be clearly seen that some test users can have up to 8 items:
In [9]:
data_model.test.holdout.userid.value_counts().max()
Out[9]:
In order to "flatten" the holdout dataset and to independently generate prediction scores for every holdout item (and 1000 of additionally sampled items) we will customize the get_recommendations
method of the SVDModel
class. Below is the support function, that helps to achieve the necessary result. It iterates over all holdout items, randomly samples a predefined amount of previously unrated items and generates prediction scores for them:
In [10]:
def sample_scores_flat(useridx, itemidx, seen_data, all_items, user_factors, item_factors, sample_size=1000, random_state=None):
'''Function to randomly sample unrated items and generate prediction scores for them.'''
scores = []
for user, items in itemidx.groupby(useridx): # iterate over every test user and get all user items
seen_items = seen_data[1][seen_data[0]==user].tolist() # list of the previously rated items of the user
seen_items.extend(items.tolist()) # take holdout items into account as well
item_pool = all_items[~all_items.isin(seen_items)] # exclude seen items from all available items
for item in items:
sampled_items = item_pool.sample(n=sample_size, random_state=random_state)
scores.append(item_factors[sampled_items.values, :].dot(user_factors[user, :]))
return scores
The code above complies with this definition by expecting user_factors
to be the product $V^Tr$ for a set of test users and item_factors
to be $V$ itself. Below you can find a full implementation of our new model.
In [11]:
import numpy as np
import pandas as pd
from polara import SVDModel
class ScaledSVD(SVDModel):
'''Class that adds scaling functionality to the PureSVD model'''
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.col_scaling = 1 # scaling parameted d, initially corresponds to PureSVD
self.n_rnd_items = 1000 # number of randomly sampled items
self.seed = 0 # to control randomization
self.method = 'ScaledSVD'
def get_training_matrix(self, *args, **kwargs):
svd_matrix = super().get_training_matrix(*args, **kwargs) # get sparse rating matrix
return sparse_normalize(svd_matrix, self.col_scaling, 0)
def get_recommendations(self):
holdout = self.data.test.holdout
itemid = self.data.fields.itemid # "movieid" in the case of Movielense dataset
userid = self.data.fields.userid # "userid" in the case of Movielense dataset
itemidx = holdout[itemid] # holdout items of the test users
useridx = pd.factorize(holdout[userid])[0] # have to "rebase" user index;
# necessary for indexing rows of the matrix with test user ratings
# prediction scores for holdout items
test_matrix, seen_data = self.get_test_matrix()
item_factors = self.factors[itemid] # right singular vectors, matrix V
user_factors = test_matrix.dot(item_factors) # according to product V^T r for every test user
holdout_scores = (user_factors[useridx, :] * item_factors[itemidx.values, :]).sum(axis=1).squeeze()
# scores for randomly sampled unseen items
all_items = self.data.index.itemid.new # all unique (reindexed) items
rs = np.random.RandomState(self.seed) # fixing random state to control random output
sampled_scores = sample_scores_flat(useridx, itemidx,
seen_data, all_items,
user_factors, item_factors,
self.n_rnd_items, random_state=rs)
# combine all scores and rank selected items
scores = np.concatenate((holdout_scores[:, None], sampled_scores), axis=1) # stack into array with 1001 columns
rankings = np.apply_along_axis(np.argsort, 1, -scores)
return rankings
The model is ready and can be used in a standard way:
In [12]:
svd = ScaledSVD(data_model) # create model
svd.rank = 50
svd.col_scaling = 0.5
svd.build() # fit model
Now, when we have our model computed, its time to evaluate it. However, we cannot use the built-in evaluation routine. Normally, the number of test users is equal to the number of rows in recommendations array and that's the logic Polara relies on. In our case the number of test users is lower than the number of rows in recommendations array and actually corresponds to the total number of ratings in the holdout:
In [13]:
# if you run the cell for the first time you'll notice a short delay before print output due to calculation of recommendations
print('# of test users:', data_model.test.holdout.userid.nunique())
print('# of rows and columns in recommendations array:', svd.recommendations.shape)
print('# of ratinhgs in the holdout:', data_model.test.holdout.shape[0])
We will fix this inconsistency in the next section.
Worth noting here that Polara implements a unified system of callbacks, which reset the svd.recommendations
property whenever either the data_model
or the model itself are changed in a way that affects the models' output (try, for example, call svd.recommendations
, then set the rank of the model to some higher value and call svd.recommendations
again). This mechanism helps to ensure predictable and consistent state and to prevent accidental reuse of the cached results during experiments.
It can also be extended with user-defined triggers, which is probably the topic for another tutorial.
When you try to evaluate your model, it calls for the model.recommendations
property which is automatically filled with the result of the get_recommendations
method. The simplest way to evaluate the result in accordance with the new structure of the recommendations array is to define a small function as shown below:
In [14]:
def evaluate_mrr(model):
'''Function to calculate MRR score.'''
is_holdout = model.recommendations==0 # holdout items are always in the first column before sorting
pos = np.where(is_holdout)[1] + 1.0 # position of holdout items (indexing starts from 0, so adding 1)
mrr = np.reciprocal(pos).mean() # mean reciprocal rank
return mrr
Finally, to compute the MRR score, as it is done in the original paper, simply run:
In [15]:
evaluate_mrr(svd)
Out[15]:
While the previously described approach is fully working and easy, in some cases you may want to use the built-in model.evaluate
method, as it provides additional functionality. It is also useful to see how Polara can be customized to serve specific needs. The key ingredient here is the control of the type of entities that are recommended. By default, Polara expects items to be recommended to users and looks for the corresponding fields in the test data. These fields are defined via data_model.fields.userid
and data_model.fields.itemid
attributes respectively. The default behavior, however, can be redefined at the model level be setting model._prediction_key
(users by default) and model._prediction_target
(items by default) attributes to custom values. This scheme, for example, can be utilized in cold start experiments, where the task is to find users potentially interested in a "cold" item instead of recommending items to users (see polara.recommender.coldstart
for implementation details). The following lines show how to change the default settings for our needs:
In [16]:
svd._prediction_key = 'xuser'
svd._prediction_target = 'xitem'
Now, we need to specify the corresponding fields in the holdout data. Recall that our goal is to treat every item in the holdout independently of the user or, in other words, to assign every item to a unique "virtual" user ('xuser'
). Furthermore, by construction, prediction scores for holdout items are located in the first column of the recommendations array. This means that every holdout item ('xitem'
) should have index 0. Here's the necessary modification:
In [17]:
data_model.test.holdout['xuser'] = np.arange(data_model.test.holdout.shape[0]) # number of rated items defines the range
data_model.test.holdout['xitem'] = 0
Let's check that the result is the same (up to a small rounding error due to different calculation schemes):
In [18]:
svd.evaluate('ranking', simple_rates=True) # `simple_rates` is used to enforce calculation of MRR
Out[18]:
Less lines of code typically means less risks for having bugs or inconsistencies. By following a certain protocol, Polara provides a high-level interface that abstracts many technical aspects allowing to focus on the most important parts of research.
The next task is to repeat experiments from the EigenRec paper, where the authors compute
The beauty of SVD-based models is that it is much easier to perform grid-search for finding optimal values of hyper-parameters. Once you have computed a model for a certain set of hyper parameters with some rank value $k$, you can quickly find all other models of rank "k' < k" without recomputing SVD.
This not only allows to perform experiments faster, but also simplifies the code for it. Moreover, SVDModel
already has the necessary rank-check procedures, which allow to avoid rebuilding the model when user sets a smaller value of rank. No special actions are required here. Below is the code that implements the grid search experiment, taking that feature into account (note that on a moderate hardware the code will run for approximately half an hour):
In [20]:
try:
from ipypb import track
except ImportError:
from tqdm import tqdm_notebook as track
%matplotlib inline
svd_mrr_flat = {} # will stor results here
svd.verbose = False
max_rank = 150
scaling_params = np.arange(-20, 21, 2) / 10 # values of d from -2 to 2 with step 0.2
svd_ranks = range(10, max_rank+1, 10) # ranks from 10 to max_ranks with step 10
for scaling in track(scaling_params):
svd.col_scaling = scaling
svd.rank = max_rank
svd.build()
for rank in list(reversed(svd_ranks)): # iterating over rank values in a descending order
svd.rank = rank # allows to truncate factor matrices without recomputing SVD
svd_mrr_flat[(scaling, rank)] = svd.evaluate('ranking', simple_rates=True).mrr
Now we have the results of the grid search stored in the svd_mrr_flat
dictionary. There's one catch that wasn't clear for me at first:
This means that the curve on Figure 1 in the original paper is obtained with a fixed value of rank, corresponding to the optimal point at the top of the curve, and all other points are obtained by only changing the scaling factor. Here's one way to draw it:
In [21]:
result_flat = pd.Series(svd_mrr_flat)
best_d, best_rank = result_flat.idxmax()
best_d, best_rank
Out[21]:
In [22]:
result_flat.xs(best_rank, axis=0, level=1).plot(label='fixed rank', legend=True, title='MRR',
figsize=(4.3, 2), ylim=(0, None), xlim=(-2, 2), grid=True);
Comparing this picture to the bottom left graph of Figure 1 in the original paper leads to a satisfactory conclusion that the curves on the graphs are very close. Of course, there are slight differences; however, there are many factors that may affect it, like data sampling and unrated items randomization. It would be a good idea to repeat the experiment with different seed
values and draw a confidence region around the curve. However, there are no drammatic differences in the general behavior of the curves, which is a very nice result that didn't take too much efforts. Here are some top-score configurations from the experiment:
In [23]:
result_flat.sort_values(ascending=False).head()
Out[23]:
The difference between the best result achieved with the EigenRec approach and the standard PureSVD result (that corresponds to the point with scaling parameter equal to 1) is quite large. However, such a comparison is a bit unfair as the restriction on having a fixed value of rank is artificial. We can draw another curve that corresponds to optimal values of both scaling parameter and rank of the decomposition:
In [24]:
result_flat.groupby(level=0).max().plot(label='optimal rank', legend=True)
result_flat.xs(best_rank, axis=0, level=1).plot(label='fixed rank', legend=True, title='MRR',
figsize=(4.3, 2), ylim=(0, None), xlim=(-2, 2), grid=True);
Now the difference is less pronounced. Anyway, the EigenRec approach still performs better. Moreover, the difference vary significantly from dataset to dataset and in some cases that difference can be much more noticeable. Another degree of freedom here, which may increase the top score, is the maximum value of rank used in the grid search. We have manually set it to be 150. Let's look which values of rank were used at each point of the curve:
In [25]:
ax = result_flat.groupby(level=0).idxmax().str[1].plot(label='optimal rank value', ls=":", legend=True,
secondary_y=True, c='g')
result_flat.groupby(level=0).max().plot(label='optimal rank experiment', legend=True)
result_flat.xs(best_rank, axis=0, level=1).plot(label='fixed rank experiment', legend=True, title='MRR',
figsize=(4.3, 2), ylim=(0, None), xlim=(-2, 2), grid=True);
Clearly, the values are capped on the left half of the graph, which leaves the room for further improvement. I have performed experiments with a higher threshold and was able to achieve a higher top score with a bit lower value of the scaling parameter. If you want to see this, simply rerun the grid search experiment with a higher value of the max_rank
variable. Be prepared that it will take a longer time (hint: reduce the search space for the scaling parameter). Anyway, the key conclusion doesn't change - even a simple scaling factor can be advantageous and allows to outperform the standard model. This conclusion is supported by many other experiments in the original paper, which we haven't run here.
The main change in the model is basically enclosed in the following line
scaled_matrix = sparse_normalize(svd_matrix, self.col_scaling, 0)
which scales columns of the rating matrix. However, there's nothing really special about this particular type of scaling and we could also scale rows instead of or in addition to that. Scaling rows would help to control the contribution of users with either too high or too low number of rated items. The entire code for defining the double-scaled model is listed below:
class DScaledSVD(ScaledSVD):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.row_scaling = 1 # PureSVD config
self.method = 'DScaledSVD'
def get_training_matrix(self, *args, **kwargs):
svd_matrix = super().get_training_matrix(*args, **kwargs)
return sparse_normalize(svd_matrix, self.row_scaling, 1)
Note that running the grid search experiment with two scaling parameters, row_scaling
and col_scaling
, will take more time. In my experiments with Movielense data and the value of rank limited by 150 from the above there was a very weak improvement, which wasn't worth the efforts. This however, may change with higher values of ranks or at least with another data. I'll leave verifying this for the reader.
Being a researcher myself, I'm often involved in some sort of "redoing" the work that was already done by someone else. Leaving the reproducibility aspect aside, there are many other reasons why it can be useful, e.g., it may help to understand presented ideas better or to see if there're any subtle moments in an original work that are not evident at first glance. It helps to create a playground ready for further exploration and may even lead to new ideas. I hope that Polara will help you with this as it helps me in my own research.