Unsupervised learning algorithms look for structure in unlabelled data. One of the common objective is to find clusters. Clusters are groups of data that are similar according to a given measurement and different from the datapoints of other clusters. This definition is vague but in the end, the idea is to minimize the distance intra cluster and maximize the distance inter clusters (again, there exists various definition of distance between clusters).
K-mean is an iterative algorithm that can discover such clusters given the number of clusters we are looking for. This is the main drawback as we need to specify this number beforehand. Some more advanced version of K-means are able to discover if the number of clusters is to low or too high but I will not talk much about them here.
To do so, K-means alternates between the two following steps:
To explain it more sequentially, it begins by choosing at random k centers. Then it seperates the dataset into clusters by computing the distance of all points to all center and saying that a datapoint belongs to the closest center. After that, it computes the mean of the clusters (not that this will change the position of the center and thus the distances to the other points) to find new centers and so on and so forth.
If this is unclear (as I am sure it will be if you have never heard of the algorithm), go check the wikipedia page that contains very well made visual examples of this process.
I could not plot the successive step within the notebook. So a new window with the plot will open when executing the last cell. If it does not appear, it might be hidden behind your web browser, reducing the size of it may allow you to see the plots.
In [1]:
import random
import time
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from fct import normalize_min_max, plot_2d, plot_clusters
Here are the different elements:
The algorithm runs while the new centers are not equal to the last ones.
In [4]:
def build_d(datas, centers):
"""Return a 2D-numpy array of the distances between each
point in the dataset and the centers. The distance used is
the euclidian distance.
"""
# the distance matrix is d
d = []
for center in centers:
# the list of distances from one center to all the points in the
# dataset
dist = []
for i in range(datas.shape[0]):
dist.append(np.linalg.norm(datas[i] - center))
d.append(dist)
return np.array(d)
def build_g(distances):
"""Return a 2D-numpy array of 0s and 1s that determines
to which center belong each point in the dataset.
"""
# k is the number of clusters we look for
k = distances.shape[0]
# g is the matrix of affiliation
g = []
for i in range(distances.shape[1]):
# gg elements is 1 only if the point belongs to the
# corresponding center, else it is 0
gg = [0] * k
# computes which center is the closest to the point
gg[distances[:,i].argmin()] = 1
g.append(gg)
return np.array(g).T
def build_clusters(datas, G):
"""Return a list of clusters (lists as well) of points from the dataset."""
k = G.shape[0]
clusters = [[] for _ in range(k)]
# i is the index of the centers, j of the datapoints
for i in range(G.shape[0]):
for j in range(G.shape[1]):
if G[i][j] == 1:
clusters[i].append(datas[j])
return clusters
def new_centers(clusters):
"""Return a list of points defined as the barycenter of each new cluster."""
centers = []
for cluster in clusters:
# the center of each cluster is its barycenter
center = np.mean(cluster, axis=0)
centers.append(center)
return centers
def k_means(datas, k):
"""Return the centers of the clusters found after the iterative process."""
# The initial centers are taken at random without replacement within the
# dataset
centers = random.sample(list(datas), k)
D = build_d(datas, centers)
G = build_g(D)
clusters = build_clusters(datas, G)
centers_new = new_centers(clusters)
# while the new centers are not equal to the previous ones (it means the
# situation is not stationary) then we keep iterating
while not np.array_equal(np.array(centers), np.array(centers_new)):
# plot the clusters with different colors. The centers are plotted
# in blue
plt.clf()
plot_clusters(clusters, k)
X = [center[0] for center in centers]
Y = [center[1] for center in centers]
plt.scatter(X,Y)
plt.show(block=False)
plt.pause(0.01)
# Build the new clusters from the past centers
centers = np.copy(centers_new)
D = build_d(datas, centers)
G = build_g(D)
clusters = build_clusters(datas, G)
# Build the new centers
centers_new = new_centers(clusters)
plt.close()
return centers
In [5]:
dimension = 2
datas = pd.read_csv('datasets/data_clustering.csv')
datas = np.array(datas)
normalize_min_max(datas, dimension)
# You can play with the number of clusters K to
# see how it affects the result.
K = 4
centers = k_means(datas, K)