Clustering techniques are unsupervised learning algorithms that try to group unlabelled data into "clusters", using the (typically spatial) structure of the data itself.
The easiest way to demonstrate how clustering works is to simply generate some data and show them in action. We'll start off by importing the libraries we'll be using today.
In [1]:
%matplotlib inline
import math, numpy as np, matplotlib.pyplot as plt, operator, torch
In [2]:
n_clusters=6
n_samples =250
To generate our data, we're going to pick 6 random points, which we'll call centroids, and for each point we're going to generate 250 random points about it.
In [3]:
centroids = np.random.uniform(-35, 35, (n_clusters, 2))
slices = [np.random.multivariate_normal(centroids[i], np.diag([5., 5.]), n_samples)
for i in range(n_clusters)]
data = np.concatenate(slices).astype(np.float32)
Below we can see each centroid marked w/ X, and the coloring associated to each respective cluster.
In [4]:
def plot_data(centroids, data, n_samples):
colour = plt.cm.rainbow(np.linspace(0,1,len(centroids)))
for i, centroid in enumerate(centroids):
samples = data[i*n_samples:(i+1)*n_samples]
plt.scatter(samples[:,0], samples[:,1], c=colour[i], s=1)
plt.plot(centroid[0], centroid[1], markersize=10, marker="x", color='k', mew=5)
plt.plot(centroid[0], centroid[1], markersize=5, marker="x", color='m', mew=2)
In [5]:
plot_data(centroids, data, n_samples)
Most people that have come across clustering algorithms have learnt about k-means. Mean shift clustering is a newer and less well-known approach, but it has some important advantages:
The algorithm is as follows:
This will iteratively push points that are close together even closer until they are next to each other.
So here's the definition of the gaussian kernel, which you may remember from high school...
In [6]:
def gaussian(d, bw):
return np.exp(-0.5*((d/bw))**2) / (bw*math.sqrt(2*math.pi))
This person at the science march certainly remembered!
In our implementation, we choose the bandwidth to be 2.5.
One easy way to choose bandwidth is to find which bandwidth covers one third of the data.
In [7]:
def meanshift(data):
X = np.copy(data)
for it in range(5):
for i, x in enumerate(X):
dist = np.sqrt(((x-X)**2).sum(1))
weight = gaussian(dist, 2.5)
X[i] = (np.expand_dims(weight,1)*X).sum(0) / weight.sum()
return X
In [8]:
%time X=meanshift(data)
We can see that mean shift clustering has almost reproduced our original clustering. The one exception are the very close clusters, but if we really wanted to differentiate them we could lower the bandwidth.
What is impressive is that this algorithm nearly reproduced the original clusters without telling it how many clusters there should be.
In [9]:
plot_data(centroids+2, X, n_samples)
We should be able to accelerate this algorithm with a GPU.
In [10]:
v=np.array([1,2,3]); v, v.shape
Out[10]:
In [11]:
m=np.array([v,v*2,v*3]); m, m.shape
Out[11]:
In [12]:
m+v
Out[12]:
In [13]:
v1=np.expand_dims(v,-1); v1, v1.shape
Out[13]:
In [14]:
m+v1
Out[14]:
Pytorch does not support broadcasting, therefore I have replaced the operators with broadcasting versions.
In [15]:
def unit_prefix(x, n=1):
for i in range(n): x = x.unsqueeze(0)
return x
def align(x, y, start_dim=2):
xd, yd = x.dim(), y.dim()
if xd > yd: y = unit_prefix(y, xd - yd)
elif yd > xd: x = unit_prefix(x, yd - xd)
xs, ys = list(x.size()), list(y.size())
nd = len(ys)
for i in range(start_dim, nd):
td = nd-i-1
if ys[td]==1: ys[td] = xs[td]
elif xs[td]==1: xs[td] = ys[td]
return x.expand(*xs), y.expand(*ys)
def aligned_op(x,y,f): return f(*align(x,y,0))
def add(x, y): return aligned_op(x, y, operator.add)
def sub(x, y): return aligned_op(x, y, operator.sub)
def mul(x, y): return aligned_op(x, y, operator.mul)
def div(x, y): return aligned_op(x, y, operator.truediv)
One advantage of pytorch is that it's very similar to numpy. For instance, the definition of gaussian
is identical, except for the namespace.
In [16]:
def gaussian(d, bw):
return torch.exp(-0.5*((d/bw))**2) / (bw*math.sqrt(2*math.pi))
And the implementation of meanshift is nearly identical too!
In [17]:
def meanshift(data):
X = torch.FloatTensor(np.copy(data))
for it in range(5):
for i, x in enumerate(X):
dist = torch.sqrt((sub(x, X)**2).sum(1))
weight = gaussian(dist, 3).unsqueeze(1) # - Adapted to obtain working shape
num = mul(weight, X).sum(0).unsqueeze(0) # - Adapted to obtain working shape
X[i] = num / weight.sum()
return X
This implementation actually takes longer. Oh dear! What do you think is causing this?
In [18]:
%time X = meanshift(data).numpy()
All the computation is happening in the for loop, which isn't accelerated by pytorch. Each iteration launches a new cuda kernel, which takes time and slows the algorithm down as a whole. Furthermore, each iteration doesn't have enough processing to do to fill up all of the threads of the GPU. But at least the results are correct...
In [19]:
plot_data(centroids+2, X, n_samples)
To truly accelerate the algorithm, we need to be performing updates on a batch of points per iteration, instead of just one as we were doing.
In [20]:
def dist_b(a,b):
return torch.sqrt((sub(a.unsqueeze(0),b.unsqueeze(1))**2).sum(2))
In [21]:
a=torch.rand(2,2)
b=torch.rand(3,2)
# dist_b(b, a).squeeze(2) # - changed to obtain working shape
dist_b(b, a)
Out[21]:
In [22]:
def gaussian(d, bw):
return torch.exp(-0.5*((d/bw))**2) / (bw*math.sqrt(2*math.pi))
In [23]:
def sum_sqz(a,axis): return a.sum(axis).squeeze(axis)
In [24]:
def meanshift(data, bs=500):
n = len(data)
X = torch.FloatTensor(np.copy(data)).cuda()
for it in range(5):
for i in range(0,n,bs):
s = slice(i,min(n,i+bs))
weight = gaussian(dist_b(X, X[s]), 2).unsqueeze(2) # - Adapted to obtain working shape
num = sum_sqz(mul(weight, X), 1)
X[s] = div(num, sum_sqz(weight, 1).unsqueeze(1)) # - Adapted to obtain working shape
return X
Although each iteration still has to launch a new cuda kernel, there are now fewer iterations, and the acceleration from updating a batch of points more than makes up for it.
In [25]:
%time X = meanshift(data).cpu().numpy()
That's more like it! We've gone from 914ms to 44ms, which is a speedup of over 2000%. Oh, and it even gives the right answer:
In [26]:
plot_data(centroids+2, X, n_samples)
If you found this interesting, you might enjoy the 30+ hours of deep learning lessons at course.fast.ai. There's also a very active forum of deep learning practitioners and learners at forums.fast.ai. Hope to see you there! :)