Lets assume we have a shop that sells bananas

We buy bananas from one dealer who claims that he cultivates them organically in his own garden.

When we recieve our bananas do not look the same though, some small some thick some long... we observe variation.

There are 2 scenarios: 1) It is indeed one sort of bananas and the variation is natural 2) There are more than one sort of banans and our dealer probably buys them from third party dealers.


In [7]:
%matplotlib inline
import matplotlib.pyplot as plt

In [51]:
# Lets assume our dealer lies and indeed bananas come from elsewhere
from sklearn.datasets.samples_generator import make_blobs

bananas_dimentions = \
  [[10, 3],   # long - thin 
   [5, 2],    # short - thin
   [7.5, 5]]  # middle -thick
    
bananas_dimentions_std = 1.0
n_bananas = 1000

X, banana_labels = make_blobs(n_samples=n_bananas, 
                              centers=bananas_dim, 
                              cluster_std=bananas_dimentions_std)

In [52]:
# Now lets pretend all we have are the n_bananas above all in the same basket
# We know nothing about the origin
# Lets plot what we see in term of thickness / length
plt.scatter(X[:,0], X[:,1])


Out[52]:
<matplotlib.collections.PathCollection at 0x113e2b2d0>

Perform Clustering (Kmeans)


In [53]:
# At first look something looks suspicious 
# lets perform some clustering to see what we can measure

In [57]:
# Since we do not know if our bananas come from different places, 
# we can try to assign different number of clusters and evaluate some metric

def perform_k_means(X, n_clusters):
    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs
    y_pred = KMeans(n_clusters=n_clusters)\
                .fit_predict(X)
    return y_pred

In [63]:
# Lets see some plots: 

# dymmy case no cluster
y_pred = perform_k_means(X, 1)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)

y_pred = perform_k_means(X, 2)
plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)

y_pred = perform_k_means(X, 3)
plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)

y_pred = perform_k_means(X, 4)
plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)


Out[63]:
<matplotlib.collections.PathCollection at 0x1194af0d0>

Evaluate The clustering


In [64]:
# At this point we need some evaluation phase
# We need a metric that can give us hints which of the above clusters 
# performs better

In [65]:
# ... TODO coming in the next episode :-)

In [ ]: