Now that we have a profound knowledge of the primary datastructure used in bestPy
, the question is what to do with it. Obviously, to arrive at a recommendation for a customer, some sort of algorithm needs to operate on that data. To introduce the basic properties of bestPy
's algorithms, we will examine the simplest (and fastest) of them all, the Baseline, before we do anything fancy.
Not to be underestimated, though, some sort of baseline algorithm is critical to the recommendation business. Specifically it is needed to provide a
We only need this because the examples folder is a subdirectory of the bestPy
package.
In [1]:
import sys
sys.path.append('../..')
In [2]:
from bestPy import write_log_to
from bestPy.datastructures import Transactions
from bestPy.algorithms import Baseline # Additionally import the baseline algorithm
logfile = 'logfile.txt'
write_log_to(logfile, 20)
file = 'examples_data.csv'
data = Transactions.from_csv(file)
In [3]:
algorithm = Baseline()
Inspecting the new recommendation
object with Tab completion reveals binarize
as a first attribute.
In [4]:
algorithm.binarize
Out[4]:
What its default value or True
means is that, instead of judging an article's popularity by how many times it was bought, we are only going to count each unique customer only once. How often a given customer bought a given article no longer matters. It's 0
or 1
. Hence the attribute's name. You can set it to False
if you want to take into account multiple buys by the same customer.
This decision depends on the use case. Do I really like an article more because I bought more than one unit of it? If you sell both consumables and more specialized items, the answer is not so clear. Suppose I bought 6 pairs of socks (which I ended up hating) and one copy of a book (which I ended up loving). Does it really make sense to base your recommendation on the assumption that I liked the socks 6 times as much as the book? Probably not. So this is a case where the default value of True
for the binarize
attribute might make sense.
If, on the other hand, you are selling consumables only, then the number of times I buy an item might indeed hint towards me liking that item more than others and setting the binarize
attribute to False
might be adequate.
In [5]:
algorithm.binarize = False
Up to you to test and to act accordingly.
An that's it with setting up the configurable parameters of the Baseline
algorithm. Without data, there is nothing else we can do for now, other than convincing us that there is indeed no data associated with the algorithm yet.
In [6]:
algorithm.has_data
Out[6]:
In [7]:
recommendation = algorithm.operating_on(data)
recommendation.has_data
Out[7]:
Note: Of course, you can also directly instantiate the algorithm with data attached
recommendation = Baseline().operating_on(data)
and configure its parameters (the binarize
attribute) later.
Now that we have data attached to our algorithm, Tab completion shows us that an additional method for_one()
has mysteriously appeared. This method, which does not make any sense without data and was, therefore, hidded before, returns an array of numbers, one for each article with the first for the article with index 0, the next for the article with index 1, etc. The highest number indicates the most and the lowest the least recommended article.
In [8]:
recommendation.for_one()
Out[8]:
As discussed above, these numbers correpsond to either the count of unique buyers or the count of buys, depending on whether the attribute binarize
is set to True
or False
, respectively.
In [9]:
recommendation.binarize = True
recommendation.for_one()
Out[9]:
In [10]:
recommendation.binarize = 'foo'
And that's an error! If you examine your logfile, you should find the according line there.
[ERROR ]: Attempt to set "binarize" to non-boolean type. (baseline|__check_boolean_type_of)
Remember to check you logfile every once in a while to see what's going on!
In [ ]: