Clustering

Objetivo: agrupar observações em clusters
Como esse agrupamento é feito (as propriedades de cada cluster) é o que define cada algoritmo



In [6]:

    
from sklearn.datasets import load_iris
iris = load_iris()

K Médias (K Means)



In [2]:

    
from sklearn.cluster import KMeans

X = iris.data

k_means = KMeans(n_clusters=3)
k_means.fit(X)









    Out[2]:





KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)



In [3]:

    
k_means.labels_









    Out[3]:





array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1,
       2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1], dtype=int32)



In [4]:

    
figsize(14,5)
subplot(1,2,1)
scatter(X[:,2], X[:,3], c=k_means.labels_)
title(u'K Médias')

subplot(1,2,2)
scatter(X[:,2], X[:,3], c=iris.target)
title(u'Ground-truth')









    Out[4]:





<matplotlib.text.Text at 0x7f47db2ee090>



In [5]:

    
from mpl_toolkits.mplot3d import Axes3D

fig = figure(figsize=(14,6))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:, 0], X[:, 2], X[:, 3], c=k_means.labels_)
ax.set_title(u'K Médias')
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.scatter(X[:, 0], X[:, 2], X[:, 3], c=iris.target)
ax.set_title('Ground-truth')









    Out[5]:





<matplotlib.text.Text at 0x7f47dac58210>

K-Médias varia com a inicialização
O algoritmo depende de uma boa estimativa do número de clusters
O algoritmo também está sujeito ao problema com mínimos locais

Mean-shift

Neste exemplo, o algoritmo mean-shift algorithm é empregado para agrupar cores semelhantes em uma imagem,
Este processo é chamado, em processamento de imagens, de quantização de cor ((color quantization)



In [42]:

    
import cv2
I = cv2.cvtColor(cv2.imread('./data/BSD-118035.jpg'), cv2.COLOR_BGR2RGB)



In [43]:

    
imshow(I)









    Out[43]:





<matplotlib.image.AxesImage at 0x7f479745e090>



In [44]:

    
h, w, _ = I_Lab.shape
X = I.reshape(h*w, -1)
X









    Out[44]:





array([[ 10,  37,  44],
       [ 10,  37,  44],
       [ 10,  37,  44],
       ..., 
       [ 91, 117, 130],
       [ 96, 120, 130],
       [ 24,  40,  40]], dtype=uint8)

Means-shift implementation in sklearn employs a flat kernel
Such a kernel is defined by a bandwidth parameter
Bandwidth can be automatically selected
- Sampling of inter-pixels color distances
  - Euclidean distance in Lab approximates human perception
- A quantile is selected to pick the bandwidth value



In [45]:

    
from sklearn.cluster import MeanShift, estimate_bandwidth



In [46]:

    
b = estimate_bandwidth(X, quantile=0.1, n_samples=2500)
ms = MeanShift(bandwidth=b, bin_seeding=True)
ms.fit(X)









    Out[46]:





MeanShift(bandwidth=21.426797705163175, bin_seeding=True, cluster_all=True,
     min_bin_freq=1, seeds=None)

A opção bin_seeding=True initializes the kernel locations to discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth.

ms.labels_ keeps the cluster identification for each pixel
ms.cluster_centers_ stores the cluster centers
The color quantization is performed attributing to each pixel the value the assigned cluster center



In [47]:

    
S = zeros_like(I)
L = ms.labels_.reshape(h, w)
num_clusters = ms.cluster_centers_.shape[0]
print num_clusters

for c in range(num_clusters):
    S[L == c] = ms.cluster_centers_[c]



In [48]:

    
subplot(1,2,1)
imshow(I)
subplot(1,2,2)
imshow(S)









    Out[48]:





<matplotlib.image.AxesImage at 0x7f4792027590>



In [51]:

    
from mpl_toolkits.mplot3d import Axes3D

fig = figure(figsize=(14,8))
ax = fig.add_subplot(1, 2, 1, projection='3d')
centroid_color = [ms.cluster_centers_[c]/255 for c in ms.labels_]
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=array(X, dtype=float32)/255)
ax.set_title('Cores originais')

ax = fig.add_subplot(1, 2, 2, projection='3d')
centroid_color = [ms.cluster_centers_[c]/255 for c in ms.labels_]
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=centroid_color)
ax.set_title('Clustering')









    Out[51]:





<matplotlib.text.Text at 0x7f478b2924d0>