In [1]:
import numpy as np
import json
import graphviz
from helpers import makePopulation, generatePurchaseData
from populationModels import pop_crude
from sklearn.cluster import AgglomerativeClustering

Studying relations between clusters of people and clusters of stuff they purchase

The old saying goes something on the lines of "you are what you eat". On modern, digitally connected societies the sentiment shifts towards "you are what you consume". That may include food, of course.

One important difference since a decade ago is what can be measured about how a user consumes a product. Lately, it seems like every little thing that we purchase/consume can be used to build a projection of ourselves from our habits.

That "projection" allows companies to build products you enjoy more (and thus, continue paying for). Like any other tool in history it can be used for less constructive purposes. Either way, the idea is poking the reptilian parts of your reward circuitry so they release the right cocktail of hormones. A cocktail that makes you reach for your wallet or click "I accept the terms and conditions".

There are many classic recipes for such cocktails and a lot of work is put on refining them. As a user, actively tasting those recipes in modern products can be as interesting as wine tasting, minus the inebriation.

The ethics around studying customers are not the focus of this post (as relevant as that discussion is). The actual focus of this post is trying out different open-source libraries on a simplified model of "people according to what we know about them".

We will use graph theory, statistics, mathematics, machine learning and borrow a technique or two from unfamiliar places. Have fun!

Less informal problem statement

Suppose we have two populations $A$ and $B$ with totally different features. The two populations are related by a Function $\cal{F}$ mapping $A_{i}$ to $\{B_j ... B_k \}$. So $\cal{F}$ takes an element of $A$ and returns a subset of $B$.

We can model $\cal{F}$ with an undirected graph. Relations of $A$ and $B$ thru $\cal{F}$ can look like the following figure.


In [3]:
F=graphviz.Graph()#(engine='neato')
F.graph_attr['rankdir'] = 'LR'
F.edge('A_1','B_1')
F.edge('A_1','B_2')
F.edge('A_2','B_1')
F.edge('A_3','B_1')
F.edge('A_4','B_2')
F.edge('A_5','B_2')
F.edge('A_5','B_3')
F


Out[3]:
%3 A_1 A_1 B_1 B_1 A_1--B_1 B_2 B_2 A_1--B_2 A_2 A_2 A_2--B_1 A_3 A_3 A_3--B_1 A_4 A_4 A_4--B_2 A_5 A_5 A_5--B_2 B_3 B_3 A_5--B_3

The structure which we just described here happens to match a well-studied family of graphs known as Bipartite Graphs. There are tons of algorithms/ math tools that can be used to study such graphs. Lucky us!

If we were to find clusters of elements of $A$ and $B$ -separately-, and we would count the edges between the elements of inside the different clusters, that would give us one way to measure how related are clusters of populations of $A$ and $B$.

For the previous example, such graph would look similar to the following figure.


In [2]:
F=graphviz.Graph()
F.graph_attr['rankdir'] = 'LR'
F.edge('A_1, A_2, A_3','B_1')
F.edge('A_1, A_2, A_3','B_1')
F.edge('A_1, A_2, A_3','B_1')

F.edge('A_1, A_2, A_3','B_2, B_3')

F.edge('A_4, A_5','B_2, B_3')
F.edge('A_4, A_5','B_2, B_3')
F


Out[2]:
%3 A_1, A_2, A_3 A_1, A_2, A_3 B_1 B_1 A_1, A_2, A_3--B_1 A_1, A_2, A_3--B_1 A_1, A_2, A_3--B_1 B_2, B_3 B_2, B_3 A_1, A_2, A_3--B_2, B_3 A_4, A_5 A_4, A_5 A_4, A_5--B_2, B_3 A_4, A_5--B_2, B_3

Given some data we generate, we want to see what different algorithms tell us about:

  • Easily visible clusters in $A$ and $B$.
  • Given clusters in $A$ and $B$, find what can $\cal{F}$ say about them.
  • See how well the algorithms are suited for finding the "Truth" about the data.

Simple, right?. In order to do this, we will need to find a way to:

  • Generate $A$ and $B$ given some "True" cluster membership and some constraints.
  • Generate $\cal{F}$ according to some "True" relation we want to study.

For generating $A$ and $B$ we want to consider different proportions and dependencies.

We expect that different algorithms will be more suited for different types of structures and relations. Unfortunately, in real life we can't really know this in retrospective. If you are lucky, there will be some theory on top of which you can -reasonably- support the choice of an algorithm, but if you are exploring something very dynamic you are better off with an open mind.

Testing different user models, spec models and purchasing functions

The Bipartite graph (Bigraph) representation lends itself pretty easily to model relations observable in snapshots of time in which a set of people ($A$) perform an action over a set of items ($B$). And let's call that action a decision to purchase. The set of those decisions is returned by $\cal{F}$.

In that way we can take a dataset of timestamps of purchases, and use it to construct a Bigraph. Then we can see what can be extracted out of that.


In [ ]:

Building Datasets

To generate $A$, $B$ and $\cal{F}$ we can use makePopulation from "helpers.py". This method allows building labelled rows representing purchases. We can use the labels as a ground truth for different algorithms.

Note that it is entirely possible for an algorithm to discover an interesting property of the data which was not considered while generating it, yet having some ground truth to discover is the main task.

TODO: explain how makepopulation works in more detail


In [ ]:
with open("population_config.json","r") as configfilex:
    cs=json.load(configfilex)

In [ ]:
population=makePopulation(100,["something","age","postcode"],
                          pop_crude,
                          c1=cs["c1"],
                          c2=cs["c2"],
                          c3=cs["c3"])

In [ ]:
print (population.keys())
n_clusters=len(set(population["cluster_label"]))
print (n_clusters)

In [ ]:
fittable=np.array([population["something"],
                   population["postcode"],
                   population["age"]]).T
alg=AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
alg.fit(fittable)

In [ ]:
clusterlabels=list(zip(alg.labels_,population["cluster_label"]))
print("Outs:{}".format(clusterlabels))

In [ ]:
#ninja visualizing hamming distance using html 
with open("htmlcolors.json","r") as colorfile:
    colors=json.load(colorfile)
#sample "nclusters" colors from that list, use them for visulaizing
colors=list(colors.keys())
colors=np.random.choice(colors,size=n_clusters)
print(colors)
colormatch={label:colors[indx] for indx,label in enumerate(population["cluster_label"])}

#todo: something like map(lambda x,y: colormatch[x]=y, (alg.labels_,colors))

print (colormatch)
cell="<td bgcolor={}>{}</td>"
row="<tr>{}</tr>"

    

table="<!DOCTYPE html><html><body><table>{}</table></body></html>"
#%% HTML
#<iframe width="100%" height "25%" src="outs/clabels.html"></iframe>

Try other linkage methods in agglomerative


In [ ]:
linkage_methods=['ward', 'average', 'complete']
aggl=lambda x: AgglomerativeClustering(n_clusters=n_clusters, linkage=x)

In [ ]:
import webcolors

In [ ]: