CHAPTER 6

6.1 Benchmark: Score

Now that we have a convenient way to make recommendations and a convenient way to split our data into training and test sets, it is straightforward to benchmark our algorithms and find both the best one and the best settings for it.

For that purpose, bestPy provides the Benchmark class, which brings together a fully configured recommender (i.e., a RecoBasedOn instance) and the test data to provide a score for the algorithm used in the recommender. Let's see how that works.

Preliminaries

We only need this because the examples folder is a subdirectory of the bestPy package.


In [1]:
import sys
sys.path.append('../..')

Imports, logging, and data

In addition to what we did in the last notebook, we now import also the Benchmark class from the top-level package. As an algorithm, we are going to chose CollaborativeFiltering and import also a two similarities for comparison. Importantly, we also need the Baseline to establish a basic score that any algorithm worthy of the name should beat.


In [2]:
from bestPy import write_log_to
from bestPy.datastructures import TrainTest
from bestPy import Benchmark, RecoBasedOn  # Additionally import Benchmark
from bestPy.algorithms import Baseline, CollaborativeFiltering  # Import also Baseline for score to beat
from bestPy.algorithms.similarities import kulsinski, cosine  # Import two similarities for comparison

logfile = 'logfile.txt'
write_log_to(logfile, 20)

file = 'examples_data.csv'
data = TrainTest.from_csv(file)

Split the TrainTest data and set up a recommender with the Baseline algorithm

Let's stick with holding out the last 4 unique purchases from each customer and let's say we want to recommend also articles that customers bought before. After splitting the data accordingly, we are going to set up a recommender with the training data and the Baseline as algorithm.


In [3]:
data.split(4, False)

algorithm = Baseline()

recommender = RecoBasedOn(data.train).using(algorithm).keeping_old

Creating a new Benchmark object

To instantiate the Benchmark class, a recommender object of type RecoBasedOn is required (think: "What do I want to benchmark?"), like so:


In [4]:
benchmark = Benchmark(recommender)

Tab inspection tells us that the newly created object has a single method against(). In order to provide a benchmark score for our recommender, which was trained on the training data, we also need the held out test data to test it against. So the argument of against() is the held-out test data and its return value is the benchmark object, now with test data attached.


In [5]:
benchmark = benchmark.against(data.test)

The beauty of this peculiar way of calling the against() method is again revealed when combining it with instantiation of the Benchmark class into a single, elegant line of code that reads like an instruction in natural language.


In [6]:
benchmark = Benchmark(recommender).against(data.test)

Scoring a recommender with the Benchmark object

Now that we have set up a Benchmark object with a fully configured recommender, trained with training data, and pitched against the according test data, a new attribute score magically appeared.


In [7]:
benchmark.score


Out[7]:
0.11555864949530108

It tells us that, on average, customers actually bought about 0.12 articles out of the 4 we recommended. And that's just the baseline recommendation that does not at all take into account differences between customers. Any algorithm that does should easily beat that number. Let's try.


In [8]:
algorithm = CollaborativeFiltering()
algorithm.binarize = False
algorithm.similarity = cosine

recommender = RecoBasedOn(data.train).using(algorithm).pruning_old

benchmark = Benchmark(recommender).against(data.test)
benchmark.score


Out[8]:
0.4552732335537765

Indeed, using collaborative filtering, we could significantly improve over the customer-agnostic baseline and get about 0.46 articles out of 4 right. But maybe we should not count each individual purchase and, instead, just consider whether a customer bought an article or not.


In [9]:
algorithm.binarize= True
benchmark.score


Out[9]:
0.4684998259658893

That improved our score a little. But we also have other knobs to turn. How about trying a different way of measuring similarity between articles?


In [10]:
algorithm.similarity = kulsinski
benchmark.score


Out[10]:
0.49182039679777234

Better again! You can clearly see where this is going. How high can you get the score? Happy exploring!

NOTE: You may have realized that, when we split the data, we told bestPy that we would recommend back to customers also articles that they bought before.

data.split(4, False)

When setting up our recommender with collaborative filtering, however, we wrote

recommender = RecoBasedOn(data.train).using(algorithm).pruning_old

meaning that we would not, in fact recommend previously bought articles. That's a clear contradiction. Which one is it going to be? Because, in this context, the way we have split the data determines the way we have to test it, the value of the only_new attrribute of the test data takes precedence over the value of the same attribute of the recommender object.


In [11]:
recommender = RecoBasedOn(data.train).using(algorithm).pruning_old
print('Recommender before:', recommender.only_new)
print('Test data:', data.test.only_new)

benchmark = Benchmark(recommender).against(data.test)
print('Recommender after:', recommender.only_new)


Recommender before: True
Test data: False
Recommender after: False

This reset conveniently takes place behind the scenes but, to keep you informed of what's happening at all times, it is of course logged.

[INFO    ]: Resetting recommender to "keeping_old" because of test-data preference. (benchmark|against)

And this concludes our discussion of how to benchmark and tune bestPy's algorithms.


In [ ]: