K-Means

Goal

Unsupervised learning algorithms look for structure in unlabelled data. One of the common objective is to find clusters. Clusters are groups of data that are similar according to a given measurement and different from the datapoints of other clusters. This definition is vague but in the end, the idea is to minimize the distance intra cluster and maximize the distance inter clusters (again, there exists various definition of distance between clusters).

K-mean is an iterative algorithm that can discover such clusters given the number of clusters we are looking for. This is the main drawback as we need to specify this number beforehand. Some more advanced version of K-means are able to discover if the number of clusters is to low or too high but I will not talk much about them here.

To do so, K-means alternates between the two following steps:

compute the centers of the clusters (for the first step, they are taken at random within the dataset, after that, they are defined as the barycenter/mean of the clusters found by the second step)
build the clusters from the center. A data belong to the cluster defined by the closest center

To explain it more sequentially, it begins by choosing at random k centers. Then it seperates the dataset into clusters by computing the distance of all points to all center and saying that a datapoint belongs to the closest center. After that, it computes the mean of the clusters (not that this will change the position of the center and thus the distances to the other points) to find new centers and so on and so forth.

If this is unclear (as I am sure it will be if you have never heard of the algorithm), go check the wikipedia page that contains very well made visual examples of this process.

Implementation

I could not plot the successive step within the notebook. So a new window with the plot will open when executing the last cell. If it does not appear, it might be hidden behind your web browser, reducing the size of it may allow you to see the plots.



In [1]:

    
import random
import time

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from fct import normalize_min_max, plot_2d, plot_clusters

Algorithm

Here are the different elements:

D is the distance matrix, each row correspond to one datapoint and each column to one center $D[i, j] = distance(datapoint_i, center_j)$
G is the matrix that specifies to which center belongs each datapoint. As for the distance matrix, the rows are for the datapoints and the columns are for the centers. $G[i, j] = 1$ if $center_j$ is the closest center to $datapoint_i$ else $G[i, j] = 0$

The algorithm runs while the new centers are not equal to the last ones.



In [4]:

    
def build_d(datas, centers):
    """Return a 2D-numpy array of the distances between each
    point in the dataset and the centers. The distance used is
    the euclidian distance.
    """
    # the distance matrix is d
    d = []
    for center in centers:
        # the list of distances from one center to all the points in the
        # dataset
        dist = []
        for i in range(datas.shape[0]):
            dist.append(np.linalg.norm(datas[i] - center))
        d.append(dist)
    return np.array(d)

def build_g(distances):
    """Return a 2D-numpy array of 0s and 1s that determines
    to which center belong each point in the dataset.
    """
    # k is the number of clusters we look for
    k = distances.shape[0]
    # g is the matrix of affiliation
    g = []
    for i in range(distances.shape[1]):
        # gg elements is 1 only if the point belongs to the
        # corresponding center, else it is 0
        gg = [0] * k
        # computes which center is the closest to the point
        gg[distances[:,i].argmin()] = 1
        g.append(gg)
    return np.array(g).T

def build_clusters(datas, G):
    """Return a list of clusters (lists as well) of points from the dataset."""
    k = G.shape[0]
    clusters = [[] for _ in range(k)]
    # i is the index of the centers, j of the datapoints
    for i in range(G.shape[0]):
        for j in range(G.shape[1]):
            if G[i][j] == 1:
                clusters[i].append(datas[j])
    return clusters

def new_centers(clusters):
    """Return a list of points defined as the barycenter of each new cluster."""
    centers = []
    for cluster in clusters:
        # the center of each cluster is its barycenter
        center = np.mean(cluster, axis=0)
        centers.append(center)
    return centers

def k_means(datas, k):
    """Return the centers of the clusters found after the iterative process."""
    # The initial centers are taken at random without replacement within the 
    # dataset
    centers = random.sample(list(datas), k)
    D = build_d(datas, centers)
    G = build_g(D)
    clusters = build_clusters(datas, G)
    centers_new = new_centers(clusters)

    # while the new centers are not equal to the previous ones (it means the
    # situation is not stationary) then we keep iterating
    while not np.array_equal(np.array(centers), np.array(centers_new)):
        # plot the clusters with different colors. The centers are plotted
        # in blue
        plt.clf()
        plot_clusters(clusters, k)
        X = [center[0] for center in centers]
        Y = [center[1] for center in centers]
        plt.scatter(X,Y)
        plt.show(block=False)
        plt.pause(0.01)
    
        # Build the new clusters from the past centers
        centers = np.copy(centers_new)
        D = build_d(datas, centers)
        G = build_g(D)
        clusters = build_clusters(datas, G)
        # Build the new centers
        centers_new = new_centers(clusters)

    plt.close()
    return centers

Application

Play this part a few times in order to see the algorithm fall into a local minima (it will converge towards a bad solution).



In [5]:

    
dimension = 2
datas = pd.read_csv('datasets/data_clustering.csv')
datas = np.array(datas)
normalize_min_max(datas, dimension)

# You can play with the number of clusters K to 
# see how it affects the result.
K = 4
centers = k_means(datas, K)