Having understood the basics of how an algorithm is configured, married with data, and deployed in bestPy
, we are now ready to move from a baseline recommendation to something more inolved. In particular, we are going to discuss the implementation and use of collaborative filtering without, however, going too deep into the technical details of how the algorithm works.
We only need this because the examples folder is a subdirectory of the bestPy
package.
In [1]:
import sys
sys.path.append('../..')
In [2]:
from bestPy import write_log_to
from bestPy.datastructures import Transactions
from bestPy.algorithms import Baseline, CollaborativeFiltering # Additionally import CollaborativeFiltering
logfile = 'logfile.txt'
write_log_to(logfile, 20)
file = 'examples_data.csv'
data = Transactions.from_csv(file)
In [3]:
recommendation = CollaborativeFiltering().operating_on(data)
recommendation.has_data
Out[3]:
In [4]:
recommendation.binarize
Out[4]:
It has the same meaning as in the baseline recommendation: True
means we only care whether or not a customer bought an article and False
means we also take into account how often a customer bought an article.
Speaking about baseline, you will notice that the recommendation
object we just created actually has an attribute baseline
.
In [5]:
recommendation.baseline
Out[5]:
Indeed, collaborative filtering cannot necessarily provide recommendations for all customers. Specifically, it fails to do so if the customer in question only bought articles that no other customer has bought. For these cases, we need a fallback solution, which is provided by the algorithm specified through the baseline
attribute. As you can see, that algorithm is currently a Baseline
instance. We could, of course, also provide the baseline algorithm manually.
In [6]:
recommendation.baseline = Baseline()
recommendation.baseline
Out[6]:
More about that later. There is one more paramter to be explored first.
In [7]:
recommendation.similarity
Out[7]:
In short, collaborative filtering (as it is implemented in bestPy
) works by recommending articles that are most similar to the articles the target customer has already bought. What exactly similar means, however, is not set in stone and quite a few similarity measures are available.
dice
)jaccard
)kulsinski
)sokalsneath
)russellrao
)cosine
)cosine_binary
)In the last option, we recognize again our concept of binarize where, to compute the cosine similarity between two articles, we do not count how often they have been bought by any particular user but only if they have been bought.
It is not obvious which similarity measure is best in which case, so some experimentation is required. If we want to set the similarity to something other than the default choice of kulsinski
, we have to import what we need from the logically located subsubpackage.
In [8]:
from bestPy.algorithms.similarities import dice, jaccard, sokalsneath, russellrao, cosine, cosine_binary
recommendation.similarity = dice
recommendation.similarity
Out[8]:
And that's it for the parameters of the collaborative filtering algorithm.
Now that everything is set up and we have data attached to the algorithm, its for_one()
method is available and can be called with the internal integer index of the target customer as argument.
In [9]:
customer = data.user.index_of['5']
recommendation.for_one(customer)
Out[9]:
And, voilà, your recommendation. Again, a higher number means that the article with the same index as that number is more highly recommended for the target customer.
To appreciate the necessity for this fallback solution, we try to get a recommendation for the customer with ID '4' next.
In [10]:
customer = data.user.index_of['4']
recommendation.for_one(customer)
Out[10]:
Checking your logfile, you will now see the additional line:
[INFO]: Uncomparable user with ID 4. Returning baseline recommendation. (collaborativefiltering|__for_one)
As you try different users, you will notice that, due to the sparsity of the data, this happens more often than you'd think. Good thing we have a customer-agnostic baseline and a logging facility that let's us know when it's used.
In [ ]: