In [5]:
%matplotlib inline
Clustering (and outlier detection) and dimension reduction. Focus on former in this unit.
Points are represented in high-dim Euclidean spaces or any metric spaces.
Note: it seems certain authors simply ignore distance functions without the identity of indiscernables and also call metrics distance functions (e.g. in ML slides).
Build a tree representing distances among the data points.
Each point starts by being its own cluster, and then cluster keep getting combined based on their "closeness", until a certain threshold is reached (e.g. desired cluster count or compactness). Alternatively, one can keep going until only a single "cluster" exists, and return a hierarchy of clusters.
Other merging criteria:
Other stopping criteria:
Picking a clustroid in non-Euclidean spaces: a point which minimizes:
Overall, not that many things change in non-Euclidean spaces.
Examples: single-, average-linkage agglomerative clustering.
The GRGPF algorithm handles non-main-memory data and does not require a Euclidean space. It takes ideas from both hierarchical and point-assignments (model-based) techniques: it represents clusters by sample points in main memory, but also tries to organize them hierarchically.
GRGPF: the donut technique: in a cluster, keep track of the clustroid, the k-closest, and k-furthest elements.
Some algorithms only work in Euclidean spaces, whereas others work with any distance measure.
Key difference: Euclidean spaces allow expressing clusters as centroids (average of points). This is not possible in non-Euclidean spaces; instead, we use clustroids, which are actual points from the dataset (which we consider representative).
Whether the algorithm assumes the data fits in main or just secondary memory. If the latter is true, we need to take certain shortcuts and summarize our clusters in memory.
In high dimensional spaces (especially but not necessarily Euclidean spaces) almost all pairs of points are equally far away from each other and almost any two vectors are orthogonal (the cosine of their angle is almost always zero, since the $\cos$ denominator grows linearly in $d$, but the numerator doesn't).
This is problematic for clustering, as it becomes very hard to build clusters when there are essentially no pairs of points which are close.
Consider a sphere of radius $r = 1$ in $D$-dimensional space. $V_{D}(r) = K_{D}r^{D}$, with $K_D$ being some constant (e.g. $\frac{4}{3}\pi$ in $\mathbb{R}^{2}$).
What is the volume of the sphere that lies between radius $r = 1 - \epsilon$ and $r = 1$? (i.e. its outer shell).
\begin{equation} \begin{aligned} \frac{V_D(1) - V_D(1 - \epsilon)}{V_D(1)} & = \frac{K_D - K_D(1 - \epsilon)^{D}}{K_D} \\ & = 1 - (1 - \epsilon)^{D} \xrightarrow {D\to\infty} 1 \end{aligned} \end{equation}"Thus, in spaces of high dimensionality, most of the volume of a sphere is concentrated in a thin shell near the surface!"
-- Bishop, PRML
The following example is based on the second data mining homework, and implements a simple form of cosine hashing, where a 1-bit hash function consists of the sign of a dot product between a to-be-hashed vector and the randomly vector which characterizes the hash function.
It can be seen that as the number of dimensions increases, the utility of the hash functions drops, as most vector pairs become orthogonal (normalized "angle" 0.5).
In [14]:
import random
import math
import numpy as np
import matplotlib.pyplot as plt
def rand_vector_uniform(dimension):
v = [random.uniform(-1, 1) for _ in range(dimension)]
return normalize(v)
def norm(v):
return math.sqrt(sum([x * x for x in v]))
def normalize(v):
nrm = norm(v)
return [x / nrm for x in v]
def angle(v1, v2):
assert len(v1) == len(v2)
return math.acos(dot_product(v1, v2) / (norm(v1) * norm(v2)))
def random_hash_function(dimension):
h = rand_vector_uniform(dimension)
def hash_function(vector):
return math.copysign(1.0, np.inner(vector, h))
return hash_function
def hash_agree(v1, v2, hashes):
return len([h for h in hashes if h(v1) == h(v2)])
def p_same_hash(N, M, d):
"""N vector pairs, M hash functions, d dimensions."""
vecgen = rand_vector_uniform
vector_pairs = [(vecgen(d), vecgen(d)) for _ in range(N)]
hashes = [random_hash_function(d) for _ in range(M)]
angle_coefs = [1 - angle(v1, v2) / math.pi for (v1, v2) in vector_pairs]
probabilities = [hash_agree(v1, v2, hashes) * 1.0 / M for (v1, v2) in vector_pairs]
y = angle_coefs
x = probabilities
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x, y, label="foo")
ax.set_xlabel("Normalized angle (1 - angle(u, v) / pi)")
ax.set_ylabel("Probability of u, v hashing to the same bucket")
ax.set_ylim([0, 1])
ax.set_xlim([0, 1])
fig.text(0.5, 0.85, "Dimensions: %d, N = %d, M = %d" % (d, N, M), ha='center')
# fig.savefig("plot-%d-%d-%d.png" % (N, M, d))
plt.show()
In [15]:
p_same_hash(200, 100, 2)
In [16]:
p_same_hash(200, 100, 10)
In [22]:
p_same_hash(200, 100, 100)
NP-hard to solve, so we use Lloyd's heuristic, commonly (though not entirely accurately) referred to as the k-means algorithm. Even this heuristic can take exponential time in $\mathbb{R}^{2}$ with specially crafted input!
The feasible space for k-means is always $\mathbb{R}^{d \> \times \> k}$ (no projection possible).
Can pick an optimal $k$ using a binary-search technique. We search for the largest value of $k$ for which a decrease below k clusters results in a radically higher average diameter of the clusters.
Want to pick points that have a good chance of lying in different clusters. Two approaches:
Pick points that are far away from each other.
Cluster a subsample of the data (e.g. using hierarchical clustering) until we get k clusters, and use their centroids as initial k-means centroids.
TODO: k-means++
Variant of k-means designed for high-dim Euclidean spaces.
Makes strong assumptions about the clusters--they must be normally distributed around a centroid, but is capable of streaming through data (though not in a parallel or distributed fashion).
Keep track of "lite" representation of data in memory:
Sliding window approach (of size $N$). Ask for centroids for last $m \le N$ points.
Idea: precluster subsets of points so that we can quickly answer queries.
Online SVM (& SGD) had the property that theirr loss function decomposed additively over the data set; the total loss is just a sum of individual hinge loss funtion values, based on the model $w$.
\begin{equation} L(w) = \sum_{t=1}^{N} \operatorname{hinge}(x_t; y_t, w) \end{equation}This meant that we could take a sub-gradient for every data point and iteratively update our model:
\begin{equation} w^{(t+1)} = \operatorname{Proj}_{S}\left(w^{(t)} - \eta_t \nabla f_t(w_t)\right) \end{equation}For k-means, our loss function is also just a sum of "individual data point penalties". I.e., it also decomposes additivel over the data set.
Note: the order of $x$ and $\mu$ doesn't matter, since the one we're deriving by will always be first in the derivative formula (calculus 101, but I got confused by it for some reason).
\begin{equation} L(\mu) = \sum_{t=1}^{N} \min_j \| x_t - \mu_j \|_2^2 = \sum_{t=1}^{N} l_t(\mu) \end{equation}We differentiate by $\mu$ since it is the thing we want to optimize (change). We obviously can't compute any gradient with respect to $x$, since we can't "change" the data.
\begin{equation} \frac{\partial}{\partial \mu_j} l_t(\mu) = 2(\mu_j - x_t) \end{equation}Question: dm-10:11, does the order of the arguments inside $\| \cdot \|$ matter? Should it? Nope. The first, positive, term is always going to be the one we're deriving by.
Our function is not diffable all the way, since it has discontinuities (see full definition in slides), but we can still use a subgradient to implement online k-means.
Only requirement to converge (to local optimum): $\sum_t \eta_t = \infty$, $\sum_t\eta_t^2 < \infty$. $\eta_t = \frac{1}{t}$ is a good option.
Remember: $\sum_{n=1}^{\infty} \frac{1}{n} = \infty$ BUT $\sum_{n=1}^{\infty} \frac{1}{n^2} = \frac{\pi^2}{6} < \infty$.
However, this algorithm still doesn't lend itself to parallelization very nicely. Moreover, the objective function is non-convex, unlike SGD. We can get stuck in local optima.
Alternative to online k-means: summarize large data sets, and then run k-means on the summary.
Key idea: represent many points by one weighted representative. Our k-means objective now becomes ($C$ is the summarized version of our data $D$):
\begin{equation} L_k(\mu; C) = \sum_{(w, x) \in C} w \min_{j} \| \mu_j - x \|_2^2 \end{equation}$C$ is called a $(k, \epsilon)$-coreset for data set $D$ if ($k$ clusters, $\epsilon$ error threshold vs. the clustering for the whole dataset):
\begin{equation} (1 - \epsilon)L_k(\mu; D) \le L_k(\mu; C) \le (1 + \epsilon)L_k(\mu; D) \end{equation}Uniform sampling fails for imbalanced data, which is very common.
Use importance sampling instead (do not confuse with uncertainty sampling from active learning).
Key idea: interpret k-means cost function as expectation, i.e. a sum over products between some function value (a point's distance to the clusters) and a probability (uniform and boring $\frac{1}{N}$ if we take all our points into consideration, which is the default in the original k-means problem statement; this is the insight -- we can sample much fewer points than $N$ according to some other (smarter) distribution and they'll still approximate our $L$ well.
So what distribution to use?
We want to be biased towards small clusters, so we don't miss those points like we would be likely to if we were doing uniform sampling. And to compensate for this bias, we also want to give them a small weight (weight $\gamma \propto \frac{1}{q}$).
This also correspons to the mathematical consequence of treating or loss function as an expectation and swapping our uniform sampling with something else. We had:
\begin{equation} L(\mu; D) \approx \frac{1}{m} \sum_{l=1}^{m} \frac{d(x_{i_l}, \mu)}{q(i_l)} \end{equation}When sampling according to distribution $q$, we have to give every term $i$ a weight of $\frac{1}{q(i)}$ in order to correctly approximate $L(\mu; D)$.
We can verify by computing the reduction in $L$'s variance we get by using $C$, our smart importance-sampled coreset, versus $U$, an uniformly-sampled "coreset".
(see slides)
Want to maximize $q(x_i)$ for the sampled points in order to reduce variance.
Need to reduce variance for all clusterings simulatenously? (TODO: why?)
Sensitivity is the worst case (max) effect a data point can have on any possible clustering (max over $\mu$).
We want to sample the most sensitive points, since they are likely to have the highest impact on the solution. However, sensitivity is hard to compute.
TODO: wait, both q and s are distributions?
But we can show that just sampling "enough" points from $q$ is good enough to construct a coreset.
TODO: info on bicriteria approximation.
Algorithm:
Coresets have multiple applications: k-means, k-line means, PageRank, SVMs, etc.
The union of two $(k, \epsilon)$-coresets is a $(k, \epsilon)$-coreset.
A $(k, \delta)$ corest of a $(k, \epsilon)$-coreset is a $(k, \epsilon + \delta + \epsilon\delta)$-coreset.
Can merge & compress online. Error grows linearly with number of compressions. Can instead merge coresets as trees (error growth $\propto$ size of tree). Use more memory but lower error.
Can construct k, e coresets in parallel. And use 1 or more reduce stages to merge (and, optionally, compress) them.
In [ ]: