Clustering

  • Objetivo: agrupar observações em clusters
  • Como esse agrupamento é feito (as propriedades de cada cluster) é o que define cada algoritmo

In [6]:
from sklearn.datasets import load_iris
iris = load_iris()

K Médias (K Means)


In [2]:
from sklearn.cluster import KMeans

X = iris.data

k_means = KMeans(n_clusters=3)
k_means.fit(X)


Out[2]:
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
    n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
    verbose=0)

In [3]:
k_means.labels_


Out[3]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1,
       2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2,
       1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1], dtype=int32)

In [4]:
figsize(14,5)
subplot(1,2,1)
scatter(X[:,2], X[:,3], c=k_means.labels_)
title(u'K Médias')

subplot(1,2,2)
scatter(X[:,2], X[:,3], c=iris.target)
title(u'Ground-truth')


Out[4]:
<matplotlib.text.Text at 0x7f47db2ee090>

In [5]:
from mpl_toolkits.mplot3d import Axes3D

fig = figure(figsize=(14,6))
ax = fig.add_subplot(1, 2, 1, projection='3d')
ax.scatter(X[:, 0], X[:, 2], X[:, 3], c=k_means.labels_)
ax.set_title(u'K Médias')
ax = fig.add_subplot(1, 2, 2, projection='3d')
ax.scatter(X[:, 0], X[:, 2], X[:, 3], c=iris.target)
ax.set_title('Ground-truth')


Out[5]:
<matplotlib.text.Text at 0x7f47dac58210>
  • K-Médias varia com a inicialização
  • O algoritmo depende de uma boa estimativa do número de clusters
  • O algoritmo também está sujeito ao problema com mínimos locais

Mean-shift

  • Neste exemplo, o algoritmo mean-shift algorithm é empregado para agrupar cores semelhantes em uma imagem,
  • Este processo é chamado, em processamento de imagens, de quantização de cor ((color quantization)

In [42]:
import cv2
I = cv2.cvtColor(cv2.imread('./data/BSD-118035.jpg'), cv2.COLOR_BGR2RGB)

In [43]:
imshow(I)


Out[43]:
<matplotlib.image.AxesImage at 0x7f479745e090>

In [44]:
h, w, _ = I_Lab.shape
X = I.reshape(h*w, -1)
X


Out[44]:
array([[ 10,  37,  44],
       [ 10,  37,  44],
       [ 10,  37,  44],
       ..., 
       [ 91, 117, 130],
       [ 96, 120, 130],
       [ 24,  40,  40]], dtype=uint8)
  • Means-shift implementation in sklearn employs a flat kernel
  • Such a kernel is defined by a bandwidth parameter
  • Bandwidth can be automatically selected
    • Sampling of inter-pixels color distances
      • Euclidean distance in Lab approximates human perception
    • A quantile is selected to pick the bandwidth value

In [45]:
from sklearn.cluster import MeanShift, estimate_bandwidth

In [46]:
b = estimate_bandwidth(X, quantile=0.1, n_samples=2500)
ms = MeanShift(bandwidth=b, bin_seeding=True)
ms.fit(X)


Out[46]:
MeanShift(bandwidth=21.426797705163175, bin_seeding=True, cluster_all=True,
     min_bin_freq=1, seeds=None)

A opção bin_seeding=True initializes the kernel locations to discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth.

  • ms.labels_ keeps the cluster identification for each pixel
  • ms.cluster_centers_ stores the cluster centers
  • The color quantization is performed attributing to each pixel the value the assigned cluster center

In [47]:
S = zeros_like(I)
L = ms.labels_.reshape(h, w)
num_clusters = ms.cluster_centers_.shape[0]
print num_clusters

for c in range(num_clusters):
    S[L == c] = ms.cluster_centers_[c]


8

In [48]:
subplot(1,2,1)
imshow(I)
subplot(1,2,2)
imshow(S)


Out[48]:
<matplotlib.image.AxesImage at 0x7f4792027590>

In [51]:
from mpl_toolkits.mplot3d import Axes3D

fig = figure(figsize=(14,8))
ax = fig.add_subplot(1, 2, 1, projection='3d')
centroid_color = [ms.cluster_centers_[c]/255 for c in ms.labels_]
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=array(X, dtype=float32)/255)
ax.set_title('Cores originais')

ax = fig.add_subplot(1, 2, 2, projection='3d')
centroid_color = [ms.cluster_centers_[c]/255 for c in ms.labels_]
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=centroid_color)
ax.set_title('Clustering')


Out[51]:
<matplotlib.text.Text at 0x7f478b2924d0>