Using Polara for custom evaluation scenarios

Polara is designed to automate the process of model prototyping and evaluation as much as possible. As a part of it,

Polara follows a certain data management workflow, aimed at maintaining a consistent and predictable internal state.

By default, it implements several conventional evaluation scenarios fully controlled by a set of configurational parameters. A user does not have to worry about anything beyond just setting the appropriate values of these parameters (a complete list of them can be obtained by calling the get_configuration method of a RecommenderData instance). As the result an input preferences data will be automatically pre-processed and converted into a convenient representation with an independent access to the training and evaluation parts.

This default behaviour, however, can be flexibly manipulated to run custom scenarios with externally provided evaluation data. This flexibility is achieved with the help of the special set_test_data method implemented in the RecommenderData class. This guide demonstrates how to use the configuration parameters in conjunction with this method to cover various customizations.

Prepare data

We will use Movielens-1M data for experimentation. The data will be divided into several parts:

  1. observations, used for training,
  2. holdout, used for evaluating recommendations against the true preferences,
  3. unseen data, used for warm-start scenarios, where test users with their preferences are not a part of training.

The last two datasets serve as an imitation of external data sources, which are not a part of initial data model.
Also note, that holdout dataset contains items of both known and unseen (warm-start) users.


In [1]:
import numpy as np
from polara.datasets.movielens import get_movielens_data

In [2]:
seed = 0
def random_state(seed=seed): # to fix random state in experiments
    return np.random.RandomState(seed=seed)

Downloading the data (alternatively you can provide a path to the local copy of the data as an argument to the function):


In [3]:
data = get_movielens_data()

Sampling 5% of the preferences data to form the holdout dataset:


In [4]:
data_sampled = data.sample(frac=0.95, random_state=random_state()).sort_values('userid')

In [5]:
holdout = data[~data.index.isin(data_sampled.index)]

Make 20% of all users unseen during the training phase:


In [6]:
users, unseen_users = np.split(data_sampled.userid.drop_duplicates().values,
                               [int(0.8*data_sampled.userid.nunique()),])

In [7]:
observations = data_sampled.query('userid in @users')

Scenario 0: building a recommender model without any evaluation

This is the simplest case, which allows to completely ignore evaluation phase. This sets an initial configuration for all further evaluation scenarios.


In [8]:
from polara.recommender.data import RecommenderData
from polara.recommender.models import SVDModel

In [9]:
data_model = RecommenderData(observations, 'userid', 'movieid', 'rating', seed=seed)

We will use prepare_training_only method instead of the general prepare:


In [10]:
data_model.prepare_training_only()


Preparing data...
Done.
There are 766928 events in the training and 0 events in the holdout.

This sets all the required configuration parameters and transform the data accordingly.

Let's check that test data is empty,


In [11]:
data_model.test


Out[11]:
TestData(testset=None, holdout=None)

and the whole input was used as a training part:


In [12]:
data_model.training.shape


Out[12]:
(766928, 3)

In [13]:
observations.shape


Out[13]:
(766928, 3)

Internally, the data was transformed to have a certain numeric representation, which Polara relies on:


In [14]:
data_model.training.head()


Out[14]:
userid movieid rating
15 0 2571 4
48 0 1837 5
12 0 2193 4
8 0 575 4
35 0 733 4

In [15]:
observations.head()


Out[15]:
userid movieid rating
15 1 2791 4
48 1 2028 5
12 1 2398 4
8 1 594 4
35 1 783 4

The mapping between external and internal data representations is stored in the `data_model.index` attribute.
The transformation can be disabled by setting the build_index attribute to False before data processing (not recommended).

You can easily build a recommendation model now:


In [16]:
svd = SVDModel(data_model)
svd.build()


PureSVD training time: 0.128s

However, the recommendations cannot be generated, as there is no testing data. The following function call will raise an error:

svd.get_recommendations()

Scenario 1: evaluation with pre-specified holdout data for known users

In the competitions like Netflix Prize you may be provided with a dedicated evaluation dataset (a probe set), which contains hidden preferences information about known users. In terms of the Polara syntax, this is a holdout set.

You can assign this holdout set to the data model by calling the set_test_data method as follows:


In [17]:
data_model.set_test_data(holdout=holdout, warm_start=False)


6 unique movieid's within 6 holdout interactions were filtered. Reason: not in the training data.
1129 unique userid's within 9479 holdout interactions were filtered. Reason: not in the training data.

Mind the warm_start=False argument, which tells Polara to work only with known users. If some users from holdout are not a part of the training data, they will be filtered out and the corresponding notification message will be displayed (you can turn it off by setting data_model.verbose=False). In this example 1129 users were filtered out, as initially the holdout set contained both known and unknown users.

Note, that items not present in the training data are also filtered. This behavior can be changed by setting data_model.ensure_consistency=False (not recommended).


In [18]:
data_model.test.holdout.userid.nunique()


Out[18]:
4484

The recommendation model can now be evaluated:


In [19]:
svd.switch_positive = 4 # treat ratings below 4 as negative feedback
svd.evaluate()


Out[19]:
[Relevance(precision=0.4671998676068703, recall=0.18790260795025798, fallout=0.0587545652398142, specifity=0.7075255418074651, miss_rate=0.7362722359391265),
 Ranking(nDCG=0.16791941496631102, nDCL=0.07078245692187013),
 Experience(coverage=0.14883215643671918),
 Hits(true_positive=4443, false_positive=1131, true_negative=15870, false_negative=19081)]

In [20]:
data_model.test.holdout.query('rating>=4').shape[0] # maximum number of possible true_positive hits


Out[20]:
23524

In [21]:
svd.evaluate('relevance')


Out[21]:
Relevance(precision=0.4671998676068703, recall=0.18790260795025798, fallout=0.0587545652398142, specifity=0.7075255418074651, miss_rate=0.7362722359391265)

Scenario 2: see recommendations for selected known users without evaluation

Polara also allows to handle cases, where you don't have a probe set and the task is to simply generate recommendations for a list of selected test users. The evaluation in that case is to be performed externally.

Let's randomly pick a few test users from all known users (i.e. those who are present in the training data):


In [22]:
test_users = random_state().choice(users, size=5, replace=False)
test_users


Out[22]:
array([4138,  776, 4747, 1966, 4423], dtype=int64)

You can provide this list by setting the test_users argument of the set_test_data method:


In [23]:
data_model.set_test_data(test_users=test_users, warm_start=False)

Recommendations in that case will have a corresponding shape of number of test users x top-n (by default top-10).


In [24]:
svd.get_recommendations().shape


Out[24]:
(5, 10)

In [25]:
print((len(test_users), svd.topk))


(5, 10)

As the holdout was not provided, it's previous state is cleared from the data model:


In [26]:
print(data_model.test.holdout)


None

The order of test user id's in the recommendations matrix may not correspond to their order in the test_users list. The true order can be obtained via index attribute - the users are sorted in ascending order by their internal index. This order is used to construct the recommendations matrix.


In [27]:
data_model.index.userid.training.query('old in @test_users')


Out[27]:
old new
775 776 775
1965 1966 1965
4137 4138 4137
4422 4423 4422
4746 4747 4746

In [28]:
test_users


Out[28]:
array([4138,  776, 4747, 1966, 4423], dtype=int64)

Note, that there's no need to provide testset argument in the case of known users. All the information about test users' preferences is assumed to be fully present in the training data and the following function call will intentionally raise an error:

data_model.set_test_data(testset=some_test_data, warm_start=False)

If the testset contains new (unseen) information, you should consider the warm-start scenarios, described below.

Scenario 3: see recommendations for unseen users without evaluation

Let's form a dataset with new users and their preferences:


In [29]:
unseen_data = data_sampled.query('userid in @unseen_users')
unseen_data.shape


Out[29]:
(183271, 3)

In [30]:
assert unseen_data.userid.nunique() == len(unseen_users)
print(len(unseen_users))


1208

None of these users are present in the training:


In [31]:
data_model.index.userid.training.old.isin(unseen_users).any()


Out[31]:
False

In order to generate recommendations for these users, we assign the dataset of their preferences as a testset (mind the warm_start argument value):


In [32]:
data_model.set_test_data(testset=unseen_data, warm_start=True)


18 unique movieid's within 26 testset interactions were filtered. Reason: not in the training data.

As we use an SVD-based model, there is no need for any modifications to generate recommendations - it uses the same analytical formula for both standard and warm-start regime:


In [33]:
svd.get_recommendations().shape


Out[33]:
(1208, 10)

Note, that internally the unseen_data dataset is transformed: users are reindexed starting from 0 and items are reindexed based on the current item index of the training set.


In [34]:
data_model.test.testset.head()


Out[34]:
userid movieid rating
807503 0 707 3
807519 0 831 5
807532 0 2542 5
807529 0 2514 4
807477 0 1051 4

In [35]:
data_model.index.userid.test.head() # test user index mapping, new index starts from 0


Out[35]:
old new
0 4833 0
1 4834 1
2 4835 2
3 4836 3
4 4837 4

In [36]:
data_model.index.itemid.head() # item index mapping


Out[36]:
old new
0 1 0
1 2 1
2 3 2
3 4 3
4 5 4

In [37]:
unseen_data.head()


Out[37]:
userid movieid rating
807503 4833 750 3
807485 4833 2555 1
807482 4833 3157 4
807480 4833 1197 5
807491 4833 24 3

Scenario 4: evaluate recommendations for unseen users with external holdout data

This is the most complete scenario. We generate recommendations based on the test users' preferences, encoded in the testset, and evaluate them against the holdout. You should use this setup only when the Polara's built-in warm-start evaluation pipeline (turned on by data_model.warm_start=True ) is not sufficient, , e.g. when the preferences data is fixed and provided externally.


In [38]:
data_model.set_test_data(testset=unseen_data, holdout=holdout, warm_start=True)


18 unique movieid's within 26 testset interactions were filtered. Reason: not in the training data.
6 unique movieid's within 6 holdout interactions were filtered. Reason: not in the training data.
4484 userid's were filtered out from holdout. Reason: inconsistent with testset.
79 userid's were filtered out from testset. Reason: inconsistent with holdout.

As previously, all unrelated users and items are removed from the datasets and the remaining entities are reindexed.


In [39]:
data_model.test.testset.head(10)


Out[39]:
userid movieid rating
807503 0 707 3
807519 0 831 5
807532 0 2542 5
807529 0 2514 4
807477 0 1051 4
807524 0 3113 4
807505 0 1191 5
807483 0 2294 5
807473 0 3169 2
807500 0 1170 5

In [40]:
data_model.test.holdout.head(10)


Out[40]:
userid movieid rating
807464 0 844 5
807471 0 901 5
807494 0 1116 5
807499 0 1165 5
807502 0 1182 5
807507 0 2982 4
807509 0 2389 4
807811 1 1010 4
807784 1 950 4
807756 1 2147 5

In [41]:
svd.switch_positive = 4
svd.evaluate()


Out[41]:
[Relevance(precision=0.48771352650892064, recall=0.1962177724147278, fallout=0.05871125477406844, specifity=0.6808813050133541, miss_rate=0.741780456106087),
 Ranking(nDCG=0.17149113058615703, nDCL=0.06967069219097612),
 Experience(coverage=0.10999456816947312),
 Hits(true_positive=1063, false_positive=245, true_negative=3505, false_negative=4666)]

In [42]:
data_model.test.holdout.query('rating>=4').shape[0] # maximum number of possible true positives


Out[42]:
5729

In [43]:
svd.evaluate('relevance')


Out[43]:
Relevance(precision=0.48771352650892064, recall=0.1962177724147278, fallout=0.05871125477406844, specifity=0.6808813050133541, miss_rate=0.741780456106087)

In [ ]: