Manifold Learning: Part II


In [ ]:
import numpy as np
import sklearn.datasets, sklearn.linear_model, sklearn.neighbors
import sklearn.manifold, sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
import sys, os, time
import scipy.io.wavfile, scipy.signal
import cv2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (18.0, 10.0)

In [ ]:
from jslog import js_key_update
# This code logs keystrokes IN THIS JUPYTER NOTEBOOK WINDOW ONLY (not any other activity)
# Log file is ../jupyter_keylog.csv

In [ ]:
%%javascript
function push_key(e,t,n){var o=keys.push([e,t,n]);o>500&&(kernel.execute("js_key_update(["+keys+"])"),keys=[])}var keys=[],tstart=window.performance.now(),last_down_t=0,key_states={},kernel=IPython.notebook.kernel;document.onkeydown=function(e){var t=window.performance.now()-tstart;key_states[e.which]=[t,last_down_t],last_down_t=t},document.onkeyup=function(e){var t=window.performance.now()-tstart,n=key_states[e.which];if(void 0!=n){var o=n[0],s=n[1];if(0!=s){var a=t-o,r=o-s;push_key(e.which,a,r),delete n[e.which]}}};

Dimensional reduction

A very common unsupervised learning task is dimensional reduction; taking a dataset with a dimension of $d_h$ and reducing to a dimension of $d_l$ which is smaller than $d_h$ but retains as much of the useful information as possible. The most common application is for visualisation, because humans are best at interpreting 2D data and struggle with higher dimensions.

Even 3D structure can be tricky for humans to get their heads around!

Dimensional reduction can be thought of as a form of lossy compression -- finding a "simpler" representation of the data which captures its essential properties. This of course depends upon what the "essential properties" that we want to keep are, but generally we want to reject noise and keep non-random structure. We find a subspace that captures the meaningful variation of a dataset.

One way of viewing this process is finding latent variables; variables we did not directly observe, but which are simple explanations of the ones we did observe. For example, if we measure a large number of weather measurements (rainfall, pressure, humidity, windspeed), these might be a very redundant representation of a few simple variables (e.g. is there a storm?). If features correlate or cluster in the measured data we can learn this structure even without knowing training labels.

Manifold learning

One way of looking at this problem is learning a manifold on which the data lies (or lies close to). A manifold is a geometrical structure which is locally like a low-dimensional Euclidean space. Imagine data points lying on the surface of a sheet of paper crumpled into a ball, or a 1D filament or string tangled up in a 3D space.

Manifold approaches attempt to automatically find these smooth embedded structures by examining the local structure of datapoints (often by analysing the nearest neighbour graph of points). This is more flexible than linear dimensional reduction as it can in theory unravel very complex or tangled datasets.

However, the algorithms are usually approximate, they do not give guarantees that they will find a given manifold, and can be computationally intensive to run.


Self organising maps

Self-organising maps are a nice half way house between clustering and manifold learning approaches. They create a dense "net" of clusters in the original (high-dimensional space), and force the cluster points to also lie in a low-dimensional space with local structure, for example, on a regular 2D grid. This maps a discretized low-dimensional space into the high-dimensional space.

The algorithm causes the clusters have local smoothness in both the high and the low dimensional space; it does this by forcing cluster points on the grid to move closer (in the high-d space) to their neighbours (in the low-d grid).

[Image from https://en.wikipedia.org/wiki/Self-organizing_map]

In other words: clusters that are close together in the high-dimensional space should be close together in the low dimensional space. This "unravels" high dimensional structure into a simple low-dimensional approximation.


In [ ]:
## Self organising maps
digits = sklearn.datasets.load_digits()
digits.data -= 8.0

In [ ]:
import som
som = reload(som)
som_map = som.SOM(48,48,64)
som_map.learn(digits.data, epochs=50000)

In [ ]:
for v in [20,30,40,50]:
    plt.figure()
    plt.imshow(som_map.codebook[:,:,v], cmap="magma", interpolation="nearest")
    plt.axis("off")

In [ ]:
plt.imshow(som_map.codebook[10,10,:].reshape(8,8), cmap="gray", interpolation="nearest")
plt.grid("off")

In [ ]:
plt.figure(figsize=(32,32))
for i in range(0,48,2):
    for j in range(0,48,2):
        img = som_map.codebook[i,j,:].reshape(8,8)        
        plt.imshow(img, cmap="gray", extent=[i,i+2,j,j+2])
plt.xlim(0,48)
plt.ylim(0,48)
plt.axis("off")

The U-Matrix

One very nice aspect of the self-organsing map is that we can extract the U-matrix which captures how close together in the high-dimensional space points in the low-dimensional map are. This lets us see whether there are natural partitions in the layout; wrinkles in the layout that might be good clustering points.


In [ ]:
import scipy.spatial.distance

def umatrix(codebook):
    ## take the average HD distance to all neighbours within
    ## certain radius in the 2D distance    
    x_code, y_code = np.meshgrid(np.arange(codebook.shape[0]), np.arange(codebook.shape[1]))
    hdmatrix = codebook.reshape(codebook.shape[0]*codebook.shape[1], codebook.shape[2])    
    hd_distance = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(hdmatrix))**2
    ld_distance = scipy.spatial.distance.squareform(scipy.spatial.distance.pdist(np.vstack([x_code.ravel(), y_code.ravel()]).T))
    return np.mean(hd_distance * (np.logical_and(ld_distance>0,ld_distance<1.5)),axis=1).reshape(codebook.shape[0], codebook.shape[1])
    
plt.figure(figsize=(14,14))    
um = umatrix(som_map.codebook)    
plt.imshow(um, interpolation="nearest", cmap="viridis")
plt.grid("off")

ISOMAP: The face-direction example

A popular manifold learning algorithm is ISOMAP which uses nearest neighbour graphs to identify locally connected parts of a dataset. This examines local neighbor graphs to find an "unraveling" of the space to a 1D or 2D subspace, which can deal with very warped high-dimensional data, and doesn't get confused by examples like the swiss roll above (assuming parameters are set correctly!).

Let's use ISOMAP (a local neighbours embedding approach) to build a real, working vision based interface.


In [ ]:
# load a video of my head in different orientations
face_frames = np.load("data/face_frames.npz")['arr_0']

In [ ]:
# show the video in opencv -- it's just a raw sequence of values
# the video is 700 frames of 64x64 imagery
frame_ctr = 0
# play the video back
while frame_ctr<face_frames.shape[1]:
    frame = face_frames[:,frame_ctr].reshape(64,64)
    cv2.imshow('Face video', cv2.resize(frame, (512,512), interpolation=cv2.INTER_NEAREST))
    frame_ctr += 1
    key = cv2.waitKey(1) & 0xff
    if key  == 27:
        break
        
# clean up
cv2.destroyAllWindows()

In [ ]:
# fit isomap to the face data (this takes a few minutes)
faces = face_frames.T
isomap = sklearn.manifold.Isomap(n_neighbors=25)
isomap.fit(faces)
xy = isomap.transform(faces)
orig_xy = np.array(xy)

In [ ]:
## the following code just plots images on the plot without overlap
overlaps = []

def is_overlap(ra,rb):
    P1X, P2X, P1Y, P2Y = ra
    P3X, P4X, P3Y, P4Y = rb
    
    return not ( P2X <= P3X or P1X >= P4X or P2Y <= P3Y or P1Y >= P4Y )

def overlap_test(r):
    if any([is_overlap(r,rb) for rb in overlaps]):
        return False
    overlaps.append(r)
    return True

def plot_some_faces(xy, faces, thin=1.0, sz=8):
    global overlaps
    overlaps = []
    q = sz/4
    for i in range(len(xy)):
        x, y = xy[i,0], xy[i,1]
        image = faces[i,:].copy()
        
        if np.random.random()<thin:
            for j in range(10):
                x, y = xy[i,0], xy[i,1]
                x += np.random.uniform(-q,q)
                y += np.random.uniform(-q, q)
                x *= q
                y *= q
                extent = [x, x+sz, y, y+sz]
                if overlap_test(extent):                    
                    img = image.reshape(64,64)
                    img[:,0] = 1
                    img[:,-1] = 1
                    img[0,:] = 1
                    img[-1,:] = 1                    
                    plt.imshow(img, vmin=0, vmax=1, cmap="gray",interpolation="lanczos",extent=extent, zorder=100)
                    break

In [ ]:
## make a 2D plot of the faces
# tweak co-ordinates

xy[:,0] = -orig_xy[:,0] / 2.5
xy[:,1] = orig_xy[:,1] 
plt.figure(figsize=(20,20))

# plot the faces
plot_some_faces(xy, faces, sz=10)

# the axes correctly
plt.xlim(np.min(xy[:,0])-10,np.max(xy[:,0])+10)
plt.ylim(np.min(xy[:,1])-10,np.max(xy[:,1])+10)
plt.gca().patch.set_facecolor('gray')
plt.xlim(-70,70)
plt.ylim(-70,70)
plt.grid("off")

In [ ]:
frame_ctr = 0
# play the video back, but show the projected dimension on the screen

while frame_ctr<face_frames.shape[1]:
    frame = face_frames[:,frame_ctr].reshape(64,64)
    frame = (frame*256).astype(np.uint8)    
    frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2RGB)
    xy = isomap.transform([face_frames[:,frame_ctr]])
    cx, cy = 256, 256
    s = 6
    x,y = xy[0]    
    resized_frame = cv2.resize(frame, (512,512), interpolation=cv2.INTER_NEAREST)
    cv2.circle(resized_frame, (int(cx-x*s), int(cy-y*s)), 10, (0,255,0), -1)
    
    cv2.imshow('Face video', resized_frame)
    
    frame_ctr += 1
    key = cv2.waitKey(1) & 0xff
    if key  == 27:
        break
        
cv2.destroyAllWindows()

Mapping UI controls to unsupervised structures

The point of all of this is to find control structures in sensor data. That is, to find regularities in measured values that we could use to control a user interface.

To do this, we need to map unsupervised structure onto the interface itself. We could at this point move to a supervised approach, now that we have likely candidates to target. But a simpler approach is just to hand-map unsupervised structure to controls.

Clusters

For example, if we have clustered a set of data (e.g. measurements of the joint angles of the hand), and extracted a set of fundamental poses, we can then create a mapping table from cluster indices to actions.

cluster 1 2 3 4
action confirm cancel increase decrease

Distance transform

Sometimes it is useful to have some continuous elements in an otherwise discrete interface (e.g. to support animation on state-transitions). A useful trick is to use a distance transform, which takes a datapoint in the original measured space $D_H$ and returns the distances to all cluster centres. (sklearn's transform function for certain clustering algorithms does this transformation for you)

This could be used, for example, to find the top two candidates for a hand pose, and show a smooth transition between actions as the hand interpolates between them.

The most obvious use of this is to disable any action when the distance to all clusters is too great. This implements a quiescent state and is part of solving the Midas touch problem; you only spend a small amount of time on a UI actively interacting and don't want to trigger actions all the time!

Manifolds

In the continuous case, with a dimensional reduction approach, then the mapping can often be a simple transformation of the inferred manifold. This usually requires that the manifold be oriented correctly; for example, in the head pointing example, I adjusted the signs of the resulting 2D manifold to match the direction my nose points in. More generally, it might be necessary to apply a scaling or rotation of the output with a linear transform:

$$ x_l = f(x_h)\\ x_c = Ax_l, $$

where $x_l$ is the low-dimensional vector, $x_h$ is high dimensional sensor vector, $x_c$ is the vector (e.g. a cursor) we pass to the UI, and $A$ is a hand-tuned or learned transformation matrix.

As an example, $A = \begin{bmatrix}0 & 1 \\ -1 & 1\end{bmatrix}$ exchanges the $x$ and $y$ co-ordinate and flips the sign of $y$.


Challenge

In this practical, you will capture images from your webcam, and build a UI control using unlabeled data. Without providing any class labels or values, you have to build an interaction that can do "something interesting" from the image data.

You have complete freedom to choose what the configuration space you want to use is; you could take images of your face or hands; take images of drawn figures; image an object rotating or moving across a surface; or anything else you want.

As an illustrative example, the unsupervised approach could be used to image a soft drinks can at different rotations, and recover the rotation angle as an input (i.e. as a physical "dial").

The criterion is the most interesting but functional interface. The control can be discrete (using clustering) or continuous (using manifold learning). You don't have to map the controls onto a real UI, just extract and visualise a useful signal from the image data.

The final system should be able to take a webcam image and output either a class or a (possibly $n$-dimensional) continuous value.

Tips

  • The webcam capture code is provided for you. cam = Webcam() creates a camera object and img = cam.snap() captures a single image from the first video device; if you have several, then you can use cam = Webcam(1) etc. to select the device to use. The result will be a $W\times H\times 3$ NumPy array, with colours in the BGR order.

  • You should resize your image (using scipy.ndimage.zoom) to something small (e.g. 32x48 or 64x64) so that the learning is feasible in the time available.

  • Your "interface" should probably show a 2D or 1D layout of the data in the training set, and have a mode where a new webcam image can be captured and plotted on the layout. You should consider colouring the data points by their attributes (e.g. cluster label) and/or showing some small images on the plot to get an idea of what is going on.

  • You can preprocess features as you like, but a good clustering/manifold learning algorithm will be able to capture much of the structure without this. The simplicity of the processing applied will considered in judging!; minimise the amount of hand-tweaking that you do.

  • Remember that some layout algorithms (e.g. t-SNE) are unstable. You may want to run the dimensional reduction several times and choose a good result, and use a repeatable random number seed (e.g. set it using np.random.seed or pass a custom RandomState to sklearn).


In [ ]:
# simple OpenCV image capture from the video device
class Webcam(object):
    def __init__(self, cam_id=0):
        cap = cv2.VideoCapture(cam_id)        
        
    def snap(self):
        ret, frame = cap.read()
        return frame    
    
# snap(), snap(), snap()...

In [ ]:
# Solution