Last week, we studied the classification problem using the logistic regression algorithm. Since each data point needs to be labeled, it is called the supervised learning problem. However, in many applications, the data is not labeled. What can we do in this situation?
Clustering is one of the most common tasks in unsupervised learning.
Goal: Grouping similar objects together
Input: Data $X = [x_1,\dots, x_n]$, number of clusters k
Output: Partitioning $C_1,\dots, C_k \subset X$
One needs to define an objective function to optimize
Given a set of data points $X = [x_1,\dots, x_n]$, where each data point is a d-dimensional real vector, k-means clustering aims to partition the n observations into $k$ $(k\le n)$ sets $C = \{C_1,\dots, C_k\}$ so as to minimize the within-cluster sum of squares (WCSS) (sum of distance functions of each point in the cluster to the K center). In other words, its objective is to find: $$ \arg\min_C \sum_{i=1}^k \sum_{x\in C_i} ||x - \mu_i||^2, $$ where $\mu_i$ is the mean of points in $C_i$.
Finding the solution is unfortunately NP-hard. Nevertheless, an iterative method known as Lloyd’s algorithm exists that converges (albeit to a local minimum) in few steps. The procedure alternates between two operations:
Assignment step: Given a fixed set of k centers, assign each point to the nearest center in terms of Euclidean distance $$ C_i = \{x: ||x - \mu_i || \le ||x - \mu_j ||, 1\le j\le k \} $$
Update step: Recompute k centroids by averaging all the data points belonging to each cluster (i.e. taking the means) $$ \mu_i = \frac{1}{|C_i|} \sum_{x\in C_i} x $$
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
Let's use the same dataset from the last Lab.
In [ ]:
df = pd.read_csv ('http://vuduc.org/cse6040/logreg_points_train.csv')
df.head()
In [ ]:
sns.lmplot(data=df, x="x_1", y="x_2", hue="label", fit_reg=False,)
In [ ]:
points = df.as_matrix (['x_1', 'x_2'])
labels = df['label'].as_matrix ()
n = points.shape[0]
d = points.shape[1]
k = 2
Note that the labels should not be used in the k-means algorithm. It is only used here as the ground truth for later verification.
In [ ]:
def init_centers(X, k):
# @YOUSE: randomly sample k data points as centers
# should return a (k x d) numpy array
pass
The most computational-costly step in k-means is computing the distances between each point to k centers. We can compute all the distances and store them in a (n x k) matrix. In order to calculate the objective function value later, we would like to store the square of Euclidean distances.
In [ ]:
def compute_d2(X, centers):
D = np.empty((n, k))
# @YOUSE: fill D[i,j] as the square euclidean distance from point i to center j
pass
return D
Exercise: Can you make it even faster? We will discuss about it next time.
In [ ]:
def cluster_points(D):
# @YOUSE: return an (n x 1) numpy array which shows the assigned cluster for each point
# For example, D = [[0.3, 0.2],
# [0.1, 0.5],
# [0.4, 0.2]]
# should return np.array([1,0,1])
pass
In [ ]:
def update_centers(X, clustering):
centers = np.empty((k, d))
for i in range(k):
# @YOUSE: compute the new center of cluster i (centers[i, :])
pass
return centers
In [ ]:
def WCSS(D):
# @YOUSE: return the objective function value (within-cluster sum of squares)
# For example, D = [[0.3, 0.2],
# [0.1, 0.5],
# [0.4, 0.2]]
# should return 0.2 + 0.1 + 0.2 = 0.5
pass
In [ ]:
def has_converged(old_centers, centers):
# @YOUSE: return true if the k center points remain the same
# note that the ordering may be different
pass
In [ ]:
def kmeans(X, k):
# @YOUSE: implement the k-means algorithm
# print the objective function value (WCSS) at each iteration until it converges
# return the final clustering
pass
In [ ]:
clustering = kmeans(points, k)
In [ ]:
df['clustering'] = clustering
sns.lmplot(data=df, x="x_1", y="x_2", hue="clustering", fit_reg=False,)
In [ ]:
difference = np.sum(labels != clustering)
error = min(difference, n - difference)
error