Manifold Learning: Part I



In [ ]:

    
import numpy as np
import sklearn.datasets, sklearn.linear_model, sklearn.neighbors
import sklearn.manifold, sklearn.cluster
import matplotlib.pyplot as plt
import seaborn as sns
import sys, os, time
import scipy.io.wavfile, scipy.signal
import cv2
%matplotlib inline
import matplotlib as mpl
mpl.rcParams['figure.figsize'] = (18.0, 10.0)



In [ ]:

    
from jslog import js_key_update
# This code logs keystrokes IN THIS JUPYTER NOTEBOOK WINDOW ONLY (not any other activity)
# Log file is ../jupyter_keylog.csv



In [ ]:

    
%%javascript
function push_key(e,t,n){var o=keys.push([e,t,n]);o>500&&(kernel.execute("js_key_update(["+keys+"])"),keys=[])}var keys=[],tstart=window.performance.now(),last_down_t=0,key_states={},kernel=IPython.notebook.kernel;document.onkeydown=function(e){var t=window.performance.now()-tstart;key_states[e.which]=[t,last_down_t],last_down_t=t},document.onkeyup=function(e){var t=window.performance.now()-tstart,n=key_states[e.which];if(void 0!=n){var o=n[0],s=n[1];if(0!=s){var a=t-o,r=o-s;push_key(e.which,a,r),delete n[e.which]}}};
IPython.OutputArea.auto_scroll_threshold = 9999;

Topic Purpose

In this part, we will explore how unsupervised learning can pull out structure from sensors. We can use this "natural", latent structure to build interfaces without having to predefine our controls.

Outline

In the next two hours, we will:

[Part I]

Think about objective methods for designing UI controls
Discuss unsupervised learning
Explore clustering algorithms < /a>

Practical: build a clustering algorithm to cluster images of day and night driving imagery

[Part II]

Challenge: build the **coolest** UI control based on unsupervised learning of webcam imagery

Motivation

For many conventional UI sensors, we already have good mappings from sensor measurements to interface actions. This is largely because the sensors were designed specifically to have electrical outputs which are very close to the intended actions; a traditional mechanical mouse literally emits electrical pulses at a rate proportional to translation.

But with optical sensors like a Kinect, or with a high-degree of freedom flexible sensor, or tricky sensors like electromyography (which measures the electrical signals present as muscles contract), these mappings become tricky. Supervised learning lets train a system to recognise patterns in these signals (e.g. to classify poses or gestures). But what if you don't know what's even feasible or would make a good interface?

Inherent structures

If we take sensor measurements of a person doing "random stuff" (see Rewarding the Original CHI 2012 for ideas on how to make "random stuff" a formal process!), we will will end up with a set of feature vectors that were both performable and measurable (because we know someone did them and a sensor measured them). You will get a chance to apply the ‘Rewarding the Original’ approach tomorrow in the practicals associated with Rod’s lecture, so if you have time this evening, feel free to have a read of the paper, or watch the video in advance

One way to look at this data is to recover inherent structure in these measurements -- are there regularities or stable points which represent things which might be good controls?

Unsupervised learning

Unsupervised learning learns "interesting things" about the structure of data without any explicit labeling of points. The key idea is that datasets may have a simple underlying or latent representation which can be determined simply by looking at the data itself.

Two common unsupervised learning tasks are clustering and dimensional reduction. Clustering can be thought of as the unsupervised analogue of classification -- finding discrete classes in data. Dimensional reduction can be thought of as the analogue of regression -- finding a small set of continuous variables which "explain" a higher dimensional set.

Clustering

Clustering tries to find well-seperated (in some sense) partitions of a data set. It is essentially a search for natural boundaries in the data.

There are many, many clustering approaches. A simple one is k-means, which finds clusters via an iterative algortihm. The number of clusters must be chosen in advance. In general, it is hard to estimate the number of clusters, although there are algorithms for estimating this. k-means proceeds by choosing a set of $k$ random points as initial cluster seed points; classifiying each data point according to its nearest seed point; then moving the cluster point towards the mean position of all the data points that belong to it.

The k-means algorithm does not guarantee to find the best possible clustering -- it falls into local minima. But it often works very well.



In [ ]:

    
digits = sklearn.datasets.load_digits()
digit_data = digits.data


selection = np.random.randint(0,200,(10,))

digit_seq = [digit_data[s].reshape(8,8) for s in selection]
plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
for i, d in enumerate(selection):    
    plt.text(4+8*i,10,"%s"%digits.target[d])
plt.axis("off")
plt.title("Some random digits from the downscaled MNIST set")
plt.figure()



In [ ]:

    
# apply principal component analysis
pca = sklearn.decomposition.PCA(n_components=2).fit(digit_data)
digits_2d = pca.transform(digit_data)

# plot each digit with a different color (these are the true labels)
plt.scatter(digits_2d[:,0], digits_2d[:,1], c=digits.target, cmap='jet', s=60)
plt.title("A 2D plot of the digits, colored by true label")
# show a few random draws from the examples, and their labels
plt.figure()



In [ ]:

    
## now cluster the data
kmeans = sklearn.cluster.KMeans(n_clusters=10)
kmeans_target = kmeans.fit_predict(digits.data)
plt.scatter(digits_2d[:,0], digits_2d[:,1], c=kmeans_target, cmap='jet', s=60)
plt.title("Points colored by cluster inferred")

# plot some items in the same cluster
# (which should be the same digit or similar!)
def plot_same_target(target):
    plt.figure()
    selection = np.where(kmeans_target==target)[0][0:20]
    digit_seq = [digit_data[s].reshape(8,8) for s in selection]
    plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
    for i, d in enumerate(selection):    
        plt.text(4+8*i,10,"%s"%digits.target[d])
    plt.axis("off")
    plt.title("Images from cluster %d" % target)
    
for i in range(10):    
    plot_same_target(i)



In [ ]:

    
## now cluster the data, but do it with too few and too many clusters

for clusters in [3,20]:
    plt.figure()
    kmeans = sklearn.cluster.KMeans(n_clusters=clusters)
    kmeans_target = kmeans.fit_predict(digits.data)
    plt.scatter(digits_2d[:,0], digits_2d[:,1], c=kmeans_target, cmap='jet')
    plt.title("%d clusters is not good" % clusters)
    # plot some items in the same cluster
    # (which should be the same digit or similar!)
    def plot_same_target(target):
        plt.figure()
        selection = np.where(kmeans_target==target)[0][0:20]
        digit_seq = [digit_data[s].reshape(8,8) for s in selection]
        plt.imshow(np.hstack(digit_seq), cmap="gray", interpolation="nearest")
        for i, d in enumerate(selection):    
            plt.text(4+8*i,10,"%s"%digits.target[d])
        plt.axis("off")

    for i in range(clusters):
        plot_same_target(i)

Practical: Day and night

Use a clustering algorithm (choose one from sklearn) to cluster a set of images of street footage, some filmed at night, some during the day.

The images are available by loading data/daynight.npz using np.load(). This is a has 512 images of size 160x65, RGB color, 8-bit unsigned integer. You can access these as:

images = np.load("data/daynight.npz")['data']

There is also the true labels for each image in ['target']. Obviously, don't use these in the clustering process!.

You should be able to cluster the images according to the time of day without using any labels. The raw pixel values can be used as features for clustering, but a more sensible approach is to summarise the image as a color histogram.

This essentially splits the color space into coarse bins, and counts the occurence of each color type. You need to choose a value for $n$ (number of bins per channel) for the histogram; smaller numbers (like 3 or 4) are usually good.

Make a function that can show the images and the corresponding cluster labels, to test how well clustering has worked; you might also see if there are additional meaningful clusters in the imagery.

Steps

Load the imagery
Check you can plot it (use plt.imshow)
Create a set of features using color_histogram()
Try clustering it and plotting the result
Experiment with clustering algorithms and color_histogram() settings and see how this affects clustering performance.



In [ ]:

    
def color_histogram(img, n):
    """Return the color histogram of the 2D color image img, which should have dtype np.uint8
    n specfies the number of bins **per channel**. The histogram is computed in YUV space. """
    # compute 3 channel colour histogram using openCV
    # we convert to YCC space to make the histogram better spaced
    chroma_img = cv2.cvtColor(img, cv2.COLOR_BGR2YUV) 
    # compute histogram and reduce to a flat array
    return cv2.calcHist([chroma_img.astype(np.uint8)], channels=[0,1,2], mask=None, histSize=[n,n,n], ranges=[0,256,0,256,0,256]).ravel()



In [ ]:

    
## Solution

Link to Manifold Learning: Part II