6. Unsupervised Learning

In [5]:
%matplotlib inline

Clustering (and outlier detection) and dimension reduction. Focus on former in this unit.

Points are represented in high-dim Euclidean spaces or any metric spaces.

Distance functions

  • $d : D \times D \rightarrow [0, \infty)$
  • $d(x, y) \ge 0, \> \forall x, y \in D$
  • $d(x, x) = 0, \> \forall x \in D$
  • $d(x, y) = d(y, x), \> \forall x, y \in D$
  • $d(x, y) + d(y, z) \ge d(x, z), \> \forall x, y, z \in D$ (Triangle inequality)
  • Bonus if $d(x, y) = 0 \implies x = y, \> \forall x, y \in D$, then we say that $D$ is a metric space.

Note: it seems certain authors simply ignore distance functions without the identity of indiscernables and also call metrics distance functions (e.g. in ML slides).

Types of clustering

Hierarchical clustering

Build a tree representing distances among the data points.

Each point starts by being its own cluster, and then cluster keep getting combined based on their "closeness", until a certain threshold is reached (e.g. desired cluster count or compactness). Alternatively, one can keep going until only a single "cluster" exists, and return a hierarchy of clusters.

Other merging criteria:

  • minimum distance between clusters, where that distance is minimum distance between any points of each cluster
  • cluster radius (merge clusters resulting in the smallest-radius cluster) or diameter.

Other stopping criteria:

  • a new best cluster's diameter exceeds some threshold
  • a new best cluster's density exceeds some threshold (density = e.g. nr points / volume)
  • a new best cluster is considered bad (sudden jump in average diameter, compared to previous steps)

Picking a clustroid in non-Euclidean spaces: a point which minimizes:

  • $\sum$ of distances to other points in cluster
  • max distance to another point in cluster
  • $\sum$ of square distances

Overall, not that many things change in non-Euclidean spaces.

Examples: single-, average-linkage agglomerative clustering.

Partitional approaches

Define and optimize an objective function defined over partitions. TODO: What are partitions? Nothing in textbook

Examples: spectral clustering, graph-cut based approaches.

Model-based approaches (main focus)

Maintain cluster "models" and infer cluster membership. Point assignment techniques. Viewed as the "standard" clustering methods.

Examples: k-means, BFR, CURE, Gaussian mixture models (GMMs), etc.


The GRGPF algorithm handles non-main-memory data and does not require a Euclidean space. It takes ideas from both hierarchical and point-assignments (model-based) techniques: it represents clusters by sample points in main memory, but also tries to organize them hierarchically.

GRGPF: the donut technique: in a cluster, keep track of the clustroid, the k-closest, and k-furthest elements.

Other categorization

Euclidean space assumption

Some algorithms only work in Euclidean spaces, whereas others work with any distance measure.

Key difference: Euclidean spaces allow expressing clusters as centroids (average of points). This is not possible in non-Euclidean spaces; instead, we use clustroids, which are actual points from the dataset (which we consider representative).

Scale assumption

Whether the algorithm assumes the data fits in main or just secondary memory. If the latter is true, we need to take certain shortcuts and summarize our clusters in memory.

The curse of dimensionality

In high dimensional spaces (especially but not necessarily Euclidean spaces) almost all pairs of points are equally far away from each other and almost any two vectors are orthogonal (the cosine of their angle is almost always zero, since the $\cos$ denominator grows linearly in $d$, but the numerator doesn't).

This is problematic for clustering, as it becomes very hard to build clusters when there are essentially no pairs of points which are close.

Example: Volume of a high-dimensional sphere

Consider a sphere of radius $r = 1$ in $D$-dimensional space. $V_{D}(r) = K_{D}r^{D}$, with $K_D$ being some constant (e.g. $\frac{4}{3}\pi$ in $\mathbb{R}^{2}$).

What is the volume of the sphere that lies between radius $r = 1 - \epsilon$ and $r = 1$? (i.e. its outer shell).

\begin{equation} \begin{aligned} \frac{V_D(1) - V_D(1 - \epsilon)}{V_D(1)} & = \frac{K_D - K_D(1 - \epsilon)^{D}}{K_D} \\ & = 1 - (1 - \epsilon)^{D} \xrightarrow {D\to\infty} 1 \end{aligned} \end{equation}

"Thus, in spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface!"

-- Bishop, PRML

The following example is based on the second data mining homework, and implements a simple form of cosine hashing, where a 1-bit hash function consists of the sign of a dot product between a to-be-hashed vector and the randomly vector which characterizes the hash function.

It can be seen that as the number of dimensions increases, the utility of the hash functions drops, as most vector pairs become orthogonal (normalized "angle" 0.5).

In [14]:
import random
import math

import numpy as np
import matplotlib.pyplot as plt

def rand_vector_uniform(dimension):
    v = [random.uniform(-1, 1) for _ in range(dimension)]
    return normalize(v)

def norm(v):
    return math.sqrt(sum([x * x for x in v]))

def normalize(v):
    nrm = norm(v)
    return [x / nrm for x in v]

def angle(v1, v2):
    assert len(v1) == len(v2)
    return math.acos(dot_product(v1, v2) / (norm(v1) * norm(v2)))

def random_hash_function(dimension):
    h = rand_vector_uniform(dimension)
    def hash_function(vector):
        return math.copysign(1.0, np.inner(vector, h))
    return hash_function

def hash_agree(v1, v2, hashes):
    return len([h for h in hashes if h(v1) == h(v2)])

def p_same_hash(N, M, d):
    """N vector pairs, M hash functions, d dimensions."""
    vecgen = rand_vector_uniform
    vector_pairs = [(vecgen(d), vecgen(d)) for _ in range(N)]
    hashes = [random_hash_function(d) for _ in range(M)]

    angle_coefs = [1 - angle(v1, v2) / math.pi for (v1, v2) in vector_pairs]
    probabilities = [hash_agree(v1, v2, hashes) * 1.0 / M for (v1, v2) in vector_pairs]
    y = angle_coefs
    x = probabilities

    fig = plt.figure()
    ax = fig.add_subplot(1, 1, 1)
    ax.scatter(x, y, label="foo")
    ax.set_xlabel("Normalized angle (1 - angle(u, v) / pi)")
    ax.set_ylabel("Probability of u, v hashing to the same bucket")
    ax.set_ylim([0, 1])
    ax.set_xlim([0, 1])
    fig.text(0.5, 0.85, "Dimensions: %d, N = %d, M = %d" % (d, N, M), ha='center')
    # fig.savefig("plot-%d-%d-%d.png" % (N, M, d))

In [15]:
p_same_hash(200, 100, 2)

In [16]:
p_same_hash(200, 100, 10)

In [22]:
p_same_hash(200, 100, 100)

The k-means problem

  • Points in Euclidean space, $x_i \in \mathbb{R}^d$.
  • Clusters as centers $\mu_j \in \mathbb{R}^d$; need to know cluster count in advance.
  • Each point assigned to closest center.
  • Goal: pick centers to minimize average squared distance:
\begin{equation} \begin{aligned} L(\mu) & = L(\mu_1, \dots, \mu_k) = \sum_{i = 1}^{N} \min_{j} \| x_i - \mu_j \|_2^2 \\ \mu^{*} & = \arg \min_{\mu} L(\mu) \end{aligned} \end{equation}

NP-hard to solve, so we use Lloyd's heuristic, commonly (though not entirely accurately) referred to as the k-means algorithm. Even this heuristic can take exponential time in $\mathbb{R}^{2}$ with specially crafted input!

The feasible space for k-means is always $\mathbb{R}^{d \> \times \> k}$ (no projection possible).

Can pick an optimal $k$ using a binary-search technique. We search for the largest value of $k$ for which a decrease below k clusters results in a radically higher average diameter of the clusters.

The algorithm:

  • Initialize cluster centers (randomly or in a smarter way): $\mu^{(0)} = [\mu_1^{(0)}, \dots, \mu_k^{(0)} ]$
  • Assign every point to closest cluster: $z_i = \arg \min_{j} \| x_i - \mu_j^{t - 1} \|_2^2$
  • Update cluster centers to be at the center of the newly updated cluster: $\mu_j^{(t)} = \frac{1}{n_j} \sum_{i:z_i = j} x_i$
  • Repeat until convergence (e.g. minimum cluster center movement threshold).


  • Guaranteed (can be shown) to monotonically decrease average squared distance in each iteration: $L(\mu^{t+1}) \le L(\mu^{t})$
  • Converges to local optimum; in practice, we do multiple restarts so that we minimize the chance to get stuck in a local optimum.
  • $O(nkd)$ per iteration ($n$ elements, $k$ clusters, $d$ dimensions); have to process entire data set in every iteration, so difficult to parallelize.

Initializing the clusters

Want to pick points that have a good chance of lying in different clusters. Two approaches:

  1. Pick points that are far away from each other.

    • Pick first point at random, and then sequentially pick the farthest point from the already-picked points.
    • There are better way to do this; see coresets below!
  2. Cluster a subsample of the data (e.g. using hierarchical clustering) until we get k clusters, and use their centroids as initial k-means centroids.

TODO: k-means++

The BFR algorithm

Variant of k-means designed for high-dim Euclidean spaces.

Makes strong assumptions about the clusters--they must be normally distributed around a centroid, but is capable of streaming through data (though not in a parallel or distributed fashion).

Keep track of "lite" representation of data in memory:

  • The Discard Set: highly summarized information on points which clearly belong to a cluster. The points themselves are not retained, hence the name.
  • The Compressed Set: summaries for groups of points found close to each other, but not to any cluster. Also called miniclusters.
  • The Retained Set: outliers; stored uncompressed in main memory until they can be shoved into one of the previous two categories.

Image credit: Rajaraman, Anand, and Jeffrey D. Ullman. Mining of massive datasets. Vol. 77. Cambridge: Cambridge University Press, 2012.

Clustering a stream (mostly from book)

Sliding window approach (of size $N$). Ask for centroids for last $m \le N$ points.

Idea: precluster subsets of points so that we can quickly answer queries.

The BDMO Algorithm

  • Based on technique for counting items in stream (DGIM algorithm for counting 1's in the sliding window of a stream).
  • Uses buckets over time, each with its own clusters.

Parallel environments

  • Can use MapReduce, but we're almost always constrained to a single reducer.
  • Coreset merging can be hierarchical; TODO: elaborate
  • General idea:
    • mappers apply clustering on partitions of the data and output their clusters
    • reducer merges these clusters (e.g. by running a clustering algorithm again)

Online k-means

Online SVM (& SGD) had the property that theirr loss function decomposed additively over the data set; the total loss is just a sum of individual hinge loss funtion values, based on the model $w$.

\begin{equation} L(w) = \sum_{t=1}^{N} \operatorname{hinge}(x_t; y_t, w) \end{equation}

This meant that we could take a sub-gradient for every data point and iteratively update our model:

\begin{equation} w^{(t+1)} = \operatorname{Proj}_{S}\left(w^{(t)} - \eta_t \nabla f_t(w_t)\right) \end{equation}

For k-means, our loss function is also just a sum of "individual data point penalties". I.e., it also decomposes additivel over the data set.

Note: the order of $x$ and $\mu$ doesn't matter, since the one we're deriving by will always be first in the derivative formula (calculus 101, but I got confused by it for some reason).

\begin{equation} L(\mu) = \sum_{t=1}^{N} \min_j \| x_t - \mu_j \|_2^2 = \sum_{t=1}^{N} l_t(\mu) \end{equation}

We differentiate by $\mu$ since it is the thing we want to optimize (change). We obviously can't compute any gradient with respect to $x$, since we can't "change" the data.

\begin{equation} \frac{\partial}{\partial \mu_j} l_t(\mu) = 2(\mu_j - x_t) \end{equation}

Question: dm-10:11, does the order of the arguments inside $\| \cdot \|$ matter? Should it? Nope. The first, positive, term is always going to be the one we're deriving by.

Our function is not diffable all the way, since it has discontinuities (see full definition in slides), but we can still use a subgradient to implement online k-means.

The algorithm

  • Initialize centers randomly
  • For t = 1:N
    • Find closest cluster: $c = \arg \min_j \| \mu_j - x_t \|_2$
    • Move that cluster slightly towards the current point: $\mu_c \leftarrow \mu_c + \eta_t (x_t - \mu_c)$

Only requirement to converge (to local optimum): $\sum_t \eta_t = \infty$, $\sum_t\eta_t^2 < \infty$. $\eta_t = \frac{1}{t}$ is a good option.

Remember: $\sum_{n=1}^{\infty} \frac{1}{n} = \infty$ BUT $\sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6} < \infty$.

Practical aspects

  • Random ordering is best (like in SGD)
  • Want to first choose larger $k$, then drop/merge small clusters
  • Run multiple random "restarts" in parallel, by keeping track of several cluster sets at once (initialized differently)

However, this algorithm still doesn't lend itself to parallelization very nicely. Moreover, the objective function is non-convex, unlike SGD. We can get stuck in local optima.


Alternative to online k-means: summarize large data sets, and then run k-means on the summary.

Key idea: represent many points by one weighted representative. Our k-means objective now becomes ($C$ is the summarized version of our data $D$):

\begin{equation} L_k(\mu; C) = \sum_{(w, x) \in C} w \min_{j} \| \mu_j - x \|_2^2 \end{equation}

$C$ is called a $(k, \epsilon)$-coreset for data set $D$ if ($k$ clusters, $\epsilon$ error threshold vs. the clustering for the whole dataset):

\begin{equation} (1 - \epsilon)L_k(\mu; D) \le L_k(\mu; C) \le (1 + \epsilon)L_k(\mu; D) \end{equation}

Constructing a coreset

Uniform sampling fails for imbalanced data, which is very common.

Use importance sampling instead (do not confuse with uncertainty sampling from active learning).

Key idea: interpret k-means cost function as expectation, i.e. a sum over products between some function value (a point's distance to the clusters) and a probability (uniform and boring $\frac{1}{N}$ if we take all our points into consideration, which is the default in the original k-means problem statement; this is the insight -- we can sample much fewer points than $N$ according to some other (smarter) distribution and they'll still approximate our $L$ well.

So what distribution to use?

We want to be biased towards small clusters, so we don't miss those points like we would be likely to if we were doing uniform sampling. And to compensate for this bias, we also want to give them a small weight (weight $\gamma \propto \frac{1}{q}$).

This also correspons to the mathematical consequence of treating or loss function as an expectation and swapping our uniform sampling with something else. We had:

\begin{equation} L(\mu; D) \approx \frac{1}{m} \sum_{l=1}^{m} \frac{d(x_{i_l}, \mu)}{q(i_l)} \end{equation}

When sampling according to distribution $q$, we have to give every term $i$ a weight of $\frac{1}{q(i)}$ in order to correctly approximate $L(\mu; D)$.

Are coresets really better than uniform sampling?

We can verify by computing the reduction in $L$'s variance we get by using $C$, our smart importance-sampled coreset, versus $U$, an uniformly-sampled "coreset".

(see slides)

Want to maximize $q(x_i)$ for the sampled points in order to reduce variance.

Need to reduce variance for all clusterings simulatenously? (TODO: why?)

Sensitivity is the worst case (max) effect a data point can have on any possible clustering (max over $\mu$).

We want to sample the most sensitive points, since they are likely to have the highest impact on the solution. However, sensitivity is hard to compute.

TODO: wait, both q and s are distributions?

But we can show that just sampling "enough" points from $q$ is good enough to construct a coreset.

TODO: info on bicriteria approximation.


  • Repeat until no more points left
    • Sample a small set uniformly at random
    • Remove half of points nearest the samples
  • Partition data into Voronoi diagram centered at selected points
  • Construct $q$ by giving weights to points based on:
    • Points in sparse cells get more mass
    • Points far from cell center get more mass
  • Sample elements from $D$ using $q$

Coresets have multiple applications: k-means, k-line means, PageRank, SVMs, etc.

Coreset composition


The union of two $(k, \epsilon)$-coresets is a $(k, \epsilon)$-coreset.


A $(k, \delta)$ corest of a $(k, \epsilon)$-coreset is a $(k, \epsilon + \delta + \epsilon\delta)$-coreset.

Can merge & compress online. Error grows linearly with number of compressions. Can instead merge coresets as trees (error growth $\propto$ size of tree). Use more memory but lower error.

Can construct k, e coresets in parallel. And use 1 or more reduce stages to merge (and, optionally, compress) them.

In [ ]: