Demonstration of a simple evaluation pipeline

Polara supports various evaluation regimes and can be tuned flexibly to achieve the setup you need.

Note, that particular evaluation settings may not be directly supported by some models and may require a certain model modification.

For example, matrix factorization models are not directly applicable in the warm-start scenario (when test users are not present in the training set) until the folding-in technique is implemented for generating recommendations. Keep this in mind when creating your custom solutions.

In this example we demonstrate basic evaluation of two well-known factorization models - `PureSVD` and `iALS` - in both standard and warm-start settings.

The models can be easily tested in both scenarios without any modification due to automatic handling of the warm-start case, provided by Polara.

Preparing Movielens-1M data



In [1]:

    
from polara.recommender.data import RecommenderData
from polara.datasets.movielens import get_movielens_data



In [2]:

    
data = get_movielens_data() # will automatically download it, or you can specify a path to the local copy
data.head()



In [3]:

    
data.shape









    Out[3]:





(1000209, 3)



In [4]:

    
data_model = RecommenderData(data, 'userid', 'movieid', 'rating', seed=0)
data_model.get_configuration()









    Out[4]:





{'permute_tops': False,
 'warm_start': True,
 'holdout_size': 3,
 'test_sample': None,
 'shuffle_data': False,
 'random_holdout': False,
 'negative_prediction': False,
 'test_fold': 5,
 'test_ratio': 0.2}

Standard scenario with known users



In [5]:

    
data_model.random_holdout = True # allow not only top-rated items in evaluation, this reduces evaluation biases
data_model.warm_start = False # standard case
data_model.prepare()









    



Preparing data...
Done.
There are 996585 events in the training and 3624 events in the holdout.

Let's check for demonstration purposes that all test users are present in the training set:



In [6]:

    
data_model.test.holdout['userid'].isin(data_model.index.userid.training.new).all()









    Out[6]:





True

PureSVD



In [7]:

    
from polara.recommender.models import SVDModel



In [8]:

    
svd = SVDModel(data_model) # create model
svd.switch_positive = 4 # mark ratings below 4 as negative feedback and treat them accordingly in evaluation
svd.build() # fit model
svd.evaluate() # by default it calculates the total number of hits









    



PureSVD training time: 0.149s






    Out[8]:





[Relevance(precision=0.34864790286975716, recall=0.2008830022075055, fallout=0.05601545253863134, specifity=0.6227924944812362, miss_rate=0.7287527593818984),
 Ranking(nDCG=0.1426077960282924, nDCL=0.04915993850533),
 Experience(coverage=0.12169454937938479),
 Hits(true_positive=512, false_positive=112, true_negative=1164, false_negative=1836)]

iALS



In [9]:

    
# implicit library must be installed separately, follow instructions at https://github.com/benfred/implicit 
from polara.recommender.external.implicit.ialswrapper import ImplicitALS



In [10]:

    
als = ImplicitALS(data_model) # create model
als.switch_positive = 4 # same as for PureSVD, affects only evaluation
als.build()
als.evaluate()









    



WARNING:root:Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading






    



iALS training time: 1.837s






    Out[10]:





[Relevance(precision=0.34864790286975716, recall=0.2015728476821192, fallout=0.06084437086092715, specifity=0.6179635761589404, miss_rate=0.7280629139072847),
 Ranking(nDCG=0.14200497688853128, nDCL=0.055115784933481085),
 Experience(coverage=0.14813815434430652),
 Hits(true_positive=514, false_positive=118, true_negative=1158, false_negative=1834)]



In [ ]:

The maximum possible number of correct recomendations is:



In [11]:

    
data_model.test.holdout.query('rating>=4').shape[0]









    Out[11]:





2348

Both models correctly retrieve around a quarter of all items. Let's look on the averaged relevance scores:



In [12]:

    
svd.evaluate('relevance')









    Out[12]:





Relevance(precision=0.34864790286975716, recall=0.2008830022075055, fallout=0.05601545253863134, specifity=0.6227924944812362, miss_rate=0.7287527593818984)



In [13]:

    
als.evaluate('relevance')









    Out[13]:





Relevance(precision=0.34864790286975716, recall=0.2015728476821192, fallout=0.06084437086092715, specifity=0.6179635761589404, miss_rate=0.7280629139072847)

Warm-start scenario

This will split test users from the training data.



In [14]:

    
data_model.warm_start = True # warm-start case
data_model.prepare()









    



Preparing data...
19 unique movieid's within 26 testset interactions were filtered. Reason: not in the training data.
1 unique movieid's within 1 holdout interactions were filtered. Reason: not in the training data.
1 of 1208 userid's were filtered out from holdout. Reason: incompatible number of items.
1 userid's were filtered out from testset. Reason: inconsistent with holdout.
Done.
There are 807458 events in the training and 3621 events in the holdout.

There's no intersection between test and training users:



In [15]:

    
data_model.index.userid.test.old.isin(data_model.index.userid.training.old).any()









    Out[15]:





False

Polara makes a certain level of efforts to preserve data sanity and consistency.

For example, as can be seen from the log message above, it filters out the items that are happen to be in the test split but are not a part of the training set.

PureSVD



In [16]:

    
svd.build()
svd.evaluate()









    



PureSVD training time: 0.107s






    Out[16]:





[Relevance(precision=0.34907484120408727, recall=0.2020160176746755, fallout=0.05661419497376415, specifity=0.6219276442971554, miss_rate=0.7275614471140569),
 Ranking(nDCG=0.1425979962160517, nDCL=0.04954089721828449),
 Experience(coverage=0.12042310821806347),
 Hits(true_positive=515, false_positive=111, true_negative=1164, false_negative=1831)]

Note, that you do not have to recreate the models as they operate on top of the data_model instance.
In fact, the state change in data_model is synchronized with the dependent models' states. It will force models to rebuild themselves even if you do not explicitly specify it (even though it is recommended to be explicit to conform with the Zen of Python).

iALS



In [17]:

    
als.evaluate()









    



WARNING:root:Intel MKL BLAS detected. Its highly recommend to set the environment variable 'export MKL_NUM_THREADS=1' to disable its internal multithreading






    



iALS model is not ready. Rebuilding.
iALS training time: 1.615s






    Out[17]:





[Relevance(precision=0.35183650925158794, recall=0.19994476663904998, fallout=0.05979011322838994, specifity=0.6187517260425297, miss_rate=0.7296326981496823),
 Ranking(nDCG=0.1389285485663426, nDCL=0.05443458388828408),
 Experience(coverage=0.14591809058855437),
 Hits(true_positive=510, false_positive=117, true_negative=1158, false_negative=1836)]

The maximum possible number of correct recomendations is:



In [18]:

    
data_model.test.holdout.query('rating>=4').shape[0]









    Out[18]:





2346

Check relevance scores:



In [19]:

    
svd.evaluate('relevance')









    Out[19]:





Relevance(precision=0.34907484120408727, recall=0.2020160176746755, fallout=0.05661419497376415, specifity=0.6219276442971554, miss_rate=0.7275614471140569)



In [20]:

    
als.evaluate('relevance')









    Out[20]:





Relevance(precision=0.35183650925158794, recall=0.19994476663904998, fallout=0.05979011322838994, specifity=0.6187517260425297, miss_rate=0.7296326981496823)

Final remark

In these experiments we used the default settings for both models and, therefore, the results may not be neccessarily optimal. Also note, that the output of SVD is deterministic, while iALS tends to provide varying results, spread around some average value.

In order to provide a fair comparison of these models one have to run a full cross-validation experiment with parameters tuning and confidence interval estimation.

It can be easily done within Polara as well and will be covered in a separate guide.



In [ ]: