Dimensionality Reduction with the Shogun Machine Learning Toolbox

By Sergey Lisitsyn (lisitsyn) and Fernando J. Iglesias Garcia (iglesias).

This notebook illustrates unsupervised learning using the suite of dimensionality reduction algorithms available in Shogun. Shogun provides access to all these algorithms using Tapkee, a C++ library especialized in dimensionality reduction.

Hands-on introduction to dimension reduction

First of all, let us start right away by showing what the purpose of dimensionality reduction actually is. To this end, we will begin by creating a function that provides us with some data:


In [ ]:
import numpy
import os
SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')

def generate_data(curve_type, num_points=1000):
	if curve_type=='swissroll':
		tt = numpy.array((3*numpy.pi/2)*(1+2*numpy.random.rand(num_points)))
		height = numpy.array((numpy.random.rand(num_points)-0.5))
		X = numpy.array([tt*numpy.cos(tt), 10*height, tt*numpy.sin(tt)])
		return X,tt
	if curve_type=='scurve':
		tt = numpy.array((3*numpy.pi*(numpy.random.rand(num_points)-0.5)))
		height = numpy.array((numpy.random.rand(num_points)-0.5))
		X = numpy.array([numpy.sin(tt), 10*height, numpy.sign(tt)*(numpy.cos(tt)-1)])
		return X,tt
	if curve_type=='helix':
		tt = numpy.linspace(1, num_points, num_points).T / num_points
		tt = tt*2*numpy.pi
		X = numpy.r_[[(2+numpy.cos(8*tt))*numpy.cos(tt)],
		             [(2+numpy.cos(8*tt))*numpy.sin(tt)],
		             [numpy.sin(8*tt)]]
		return X,tt

The function above can be used to generate three-dimensional datasets with the shape of a Swiss roll, the letter S, or an helix. These are three examples of datasets which have been extensively used to compare different dimension reduction algorithms. As an illustrative exercise of what dimensionality reduction can do, we will use a few of the algorithms available in Shogun to embed this data into a two-dimensional space. This is essentially the dimension reduction process as we reduce the number of features from 3 to 2. The question that arises is: what principle should we use to keep some important relations between datapoints? In fact, different algorithms imply different criteria to answer this question.

Just to start, lets pick some algorithm and one of the data sets, for example lets see what embedding of the Swissroll is produced by the Isomap algorithm. The Isomap algorithm is basically a slightly modified Multidimensional Scaling (MDS) algorithm which finds embedding as a solution of the following optimization problem:

$$ \min_{x'_1, x'_2, \dots} \sum_i \sum_j \| d'(x'_i, x'_j) - d(x_i, x_j)\|^2, $$

with defined $x_1, x_2, \dots \in X~~$ and unknown variables $x_1, x_2, \dots \in X'~~$ while $\text{dim}(X') < \text{dim}(X)~~~$, $d: X \times X \to \mathbb{R}~~$ and $d': X' \times X' \to \mathbb{R}~~$ are defined as arbitrary distance functions (for example Euclidean).

Speaking less math, the MDS algorithm finds an embedding that preserves pairwise distances between points as much as it is possible. The Isomap algorithm changes quite small detail: the distance - instead of using local pairwise relationships it takes global factor into the account with shortest path on the neighborhood graph (so-called geodesic distance). The neighborhood graph is defined as graph with datapoints as nodes and weighted edges (with weight equal to the distance between points). The edge between point $x_i~$ and $x_j~$ exists if and only if $x_j~$ is in $k~$ nearest neighbors of $x_i$. Later we will see that that 'global factor' changes the game for the swissroll dataset.

However, first we prepare a small function to plot any of the original data sets together with its embedding.


In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

%matplotlib inline

def plot(data, embedded_data, colors='m'):
	fig = plt.figure()
	fig.set_facecolor('white')
	ax = fig.add_subplot(121,projection='3d')
	ax.scatter(data[0],data[1],data[2],c=colors,cmap=plt.cm.Spectral)
	plt.axis('tight'); plt.axis('off')
	ax = fig.add_subplot(122)
	ax.scatter(embedded_data[0],embedded_data[1],c=colors,cmap=plt.cm.Spectral)
	plt.axis('tight'); plt.axis('off')
	plt.show()

In [ ]:
from shogun import Isomap, RealFeatures, MultidimensionalScaling

# wrap data into Shogun features
data, colors = generate_data('swissroll')
features = RealFeatures(data)

# create instance of Isomap converter and configure it
isomap = Isomap()
isomap.set_target_dim(2)
# set the number of neighbours used in kNN search
isomap.set_k(20)

# create instance of Multidimensional Scaling converter and configure it
mds = MultidimensionalScaling()
mds.set_target_dim(2)

# embed Swiss roll data
embedded_data_mds = mds.embed(features).get_feature_matrix()
embedded_data_isomap = isomap.embed(features).get_feature_matrix()

plot(data, embedded_data_mds, colors)
plot(data, embedded_data_isomap, colors)

As it can be seen from the figure above, Isomap has been able to "unroll" the data, reducing its dimension from three to two. At the same time, points with similar colours in the input space are close to points with similar colours in the output space. This is, a new representation of the data has been obtained; this new representation maintains the properties of the original data, while it reduces the amount of information required to represent it. Note that the fact the embedding of the Swiss roll looks good in two dimensions stems from the intrinsic dimension of the input data. Although the original data is in a three-dimensional space, its intrinsic dimension is lower, since the only degree of freedom are the polar angle and distance from the centre, or height.

Finally, we use yet another method, Stochastic Proximity Embedding (SPE) to embed the helix:


In [ ]:
from shogun import StochasticProximityEmbedding

# wrap data into Shogun features
data, colors = generate_data('helix')
features = RealFeatures(data)

# create MDS instance
converter = StochasticProximityEmbedding()
converter.set_target_dim(2)

# embed helix data
embedded_features = converter.embed(features)
embedded_data = embedded_features.get_feature_matrix()

plot(data, embedded_data, colors)

References

  • Lisitsyn, S., Widmer, C., Iglesias Garcia, F. J. Tapkee: An Efficient Dimension Reduction Library. (Link to paper in JMLR.)
  • Tenenbaum, J. B., de Silva, V. and Langford, J. B. A Global Geometric Framework for Nonlinear Dimensionality Reduction. (Link to Isomap's website.)