CHAPTER 4

4.2 Algorithms: Collaborative filtering

Having understood the basics of how an algorithm is configured, married with data, and deployed in bestPy, we are now ready to move from a baseline recommendation to something more inolved. In particular, we are going to discuss the implementation and use of collaborative filtering without, however, going too deep into the technical details of how the algorithm works.

Preliminaries

We only need this because the examples folder is a subdirectory of the bestPy package.


In [1]:
import sys
sys.path.append('../..')

Imports, logging, and data

On top of doing the things we already know, we now additionally import also the CollaborativeFiltering algorithm, which is, as should be obvious by now, accessible through the bestPy.algorithms subpackage.


In [2]:
from bestPy import write_log_to
from bestPy.datastructures import Transactions
from bestPy.algorithms import Baseline, CollaborativeFiltering  # Additionally import CollaborativeFiltering

logfile = 'logfile.txt'
write_log_to(logfile, 20)

file = 'examples_data.csv'
data = Transactions.from_csv(file)

Creating a new CollaborativeFiltering object with data

Again, this is as straightforward as you would expect. This time, we will attach the data to the algorithm right away.


In [3]:
recommendation = CollaborativeFiltering().operating_on(data)
recommendation.has_data


Out[3]:
True

Parameters of the collaborative filtering algorithm

Inspecting the new recommendation object with Tab completion again reveals binarize as a first attribute.


In [4]:
recommendation.binarize


Out[4]:
True

It has the same meaning as in the baseline recommendation: True means we only care whether or not a customer bought an article and False means we also take into account how often a customer bought an article.

Speaking about baseline, you will notice that the recommendation object we just created actually has an attribute baseline.


In [5]:
recommendation.baseline


Out[5]:
'Baseline'

Indeed, collaborative filtering cannot necessarily provide recommendations for all customers. Specifically, it fails to do so if the customer in question only bought articles that no other customer has bought. For these cases, we need a fallback solution, which is provided by the algorithm specified through the baseline attribute. As you can see, that algorithm is currently a Baseline instance. We could, of course, also provide the baseline algorithm manually.


In [6]:
recommendation.baseline = Baseline()
recommendation.baseline


Out[6]:
'Baseline'

More about that later. There is one more paramter to be explored first.


In [7]:
recommendation.similarity


Out[7]:
'kulsinski'

In short, collaborative filtering (as it is implemented in bestPy) works by recommending articles that are most similar to the articles the target customer has already bought. What exactly similar means, however, is not set in stone and quite a few similarity measures are available.

  • Dice (dice)
  • Jaccard (jaccard)
  • Kulsinksi (kulsinski)
  • Sokal-Sneath (sokalsneath)
  • Russell-Rao (russellrao)
  • cosine (cosine)
  • binary cosine (cosine_binary)

In the last option, we recognize again our concept of binarize where, to compute the cosine similarity between two articles, we do not count how often they have been bought by any particular user but only if they have been bought.

It is not obvious which similarity measure is best in which case, so some experimentation is required. If we want to set the similarity to something other than the default choice of kulsinski, we have to import what we need from the logically located subsubpackage.


In [8]:
from bestPy.algorithms.similarities import dice, jaccard, sokalsneath, russellrao, cosine, cosine_binary

recommendation.similarity = dice
recommendation.similarity


Out[8]:
'dice'

And that's it for the parameters of the collaborative filtering algorithm.

Making a recommendation for a target customer

Now that everything is set up and we have data attached to the algorithm, its for_one() method is available and can be called with the internal integer index of the target customer as argument.


In [9]:
customer = data.user.index_of['5']
recommendation.for_one(customer)


Out[9]:
array([ 0.        ,  0.69444444,  2.62288862, ...,  0.        ,
        0.        ,  0.        ])

And, voilà, your recommendation. Again, a higher number means that the article with the same index as that number is more highly recommended for the target customer.

To appreciate the necessity for this fallback solution, we try to get a recommendation for the customer with ID '4' next.


In [10]:
customer = data.user.index_of['4']
recommendation.for_one(customer)


Out[10]:
array([ 1.,  1.,  8., ...,  1.,  1.,  1.])

Checking your logfile, you will now see the additional line:

[INFO]: Uncomparable user with ID 4. Returning baseline recommendation. (collaborativefiltering|__for_one)

As you try different users, you will notice that, due to the sparsity of the data, this happens more often than you'd think. Good thing we have a customer-agnostic baseline and a logging facility that let's us know when it's used.


In [ ]: