CSE 6040, Fall 2015 [27]: K-means Clustering, Part 1

Last week, we studied the classification problem using the logistic regression algorithm. Since each data point needs to be labeled, it is called the supervised learning problem. However, in many applications, the data is not labeled. What can we do in this situation?

Unsupervised learning: find structures/patterns hidden in the unlabeled data

Clustering is one of the most common tasks in unsupervised learning.

Goal: Grouping similar objects together

Input: Data $X = [x_1,\dots, x_n]$, number of clusters k

Output: Partitioning $C_1,\dots, C_k \subset X$

One needs to define an objective function to optimize

K-means clustering problem

Given a set of data points $X = [x_1,\dots, x_n]$, where each data point is a d-dimensional real vector, k-means clustering aims to partition the n observations into $k$ $(k\le n)$ sets $C = \{C_1,\dots, C_k\}$ so as to minimize the within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center). In other words, its objective is to find: $$ \arg\min_C \sum_{i=1}^k \sum_{x\in C_i} ||x - \mu_i||^2, $$ where $\mu_i$ is the mean of points in $C_i$.

Standard K-means algorithm (Lloyd’s algorithm)

Finding the solution is unfortunately NP-hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations:

Assignment step: Given a fixed set of k centers, assign each point to the nearest center in terms of Euclidean distance $$ C_i = \{x: ||x - \mu_i || \le ||x - \mu_j ||, 1\le j\le k \} $$

Update step: Recompute k centroids by averaging all the data points belonging to each cluster (i.e. taking the means) $$ \mu_i = \frac{1}{|C_i|} \sum_{x\in C_i} x $$



In [ ]:

    
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

Let's use the same dataset from the last Lab.



In [ ]:

    
df = pd.read_csv ('http://vuduc.org/cse6040/logreg_points_train.csv')
df.head()



In [ ]:

    
sns.lmplot(data=df, x="x_1", y="x_2", hue="label", fit_reg=False,)



In [ ]:

    
points = df.as_matrix (['x_1', 'x_2'])
labels = df['label'].as_matrix ()
n = points.shape[0]
d = points.shape[1]
k = 2

Note that the labels should not be used in the k-means algorithm. It is only used here as the ground truth for later verification.

initialize k centers

The simplest way is to randomly select k data points from the data. Next time we will discuss more advanced initialization methods.



In [ ]:

    
def init_centers(X, k):
    # @YOUSE: randomly sample k data points as centers
    # should return a (k x d) numpy array
    pass

Compute the distance matrix

The most computational-costly step in k-means is computing the distances between each point to k centers. We can compute all the distances and store them in a (n x k) matrix. In order to calculate the objective function value later, we would like to store the square of Euclidean distances.



In [ ]:

    
def compute_d2(X, centers):
    D = np.empty((n, k))
    
    # @YOUSE: fill D[i,j] as the square euclidean distance from point i to center j
    pass
    
    return D

Exercise: Can you make it even faster? We will discuss about it next time.

Clustering (grouping) based on the distance matrix



In [ ]:

    
def cluster_points(D):
    # @YOUSE: return an (n x 1) numpy array which shows the assigned cluster for each point
    # For example, D = [[0.3, 0.2],
    #                   [0.1, 0.5],
    #                   [0.4, 0.2]]
    # should return np.array([1,0,1])    
    pass

Update the center of each cluster



In [ ]:

    
def update_centers(X, clustering):
    centers = np.empty((k, d))
    for i in range(k):
        # @YOUSE: compute the new center of cluster i  (centers[i, :])
        pass
    return centers

Calculate the objective function



In [ ]:

    
def WCSS(D):
    # @YOUSE: return the objective function value (within-cluster sum of squares)
    # For example, D = [[0.3, 0.2],
    #                   [0.1, 0.5],
    #                   [0.4, 0.2]]
    # should return 0.2 + 0.1 + 0.2 = 0.5
    pass

Check if k-means has already converged



In [ ]:

    
def has_converged(old_centers, centers):
    # @YOUSE: return true if the k center points remain the same 
    # note that the ordering may be different
    pass

Putting it all together



In [ ]:

    
def kmeans(X, k):
    # @YOUSE: implement the k-means algorithm
    # print the objective function value (WCSS) at each iteration until it converges
    # return the final clustering
    pass



In [ ]:

    
clustering = kmeans(points, k)

Plot the clustering result



In [ ]:

    
df['clustering'] = clustering
sns.lmplot(data=df, x="x_1", y="x_2", hue="clustering", fit_reg=False,)

Number of misclustered points



In [ ]:

    
difference = np.sum(labels != clustering)
error = min(difference, n - difference)
error

Next time:

More efficient implementation to compute the distance matrix.
How to select k?
Better way to initialize the centers.