In this tutorial, we explore manifold learning techniques to visualize 4000 SDSS spectral data. This is a much more exploratory exercise than the previous two. The goal is to determine how to best visualize this high-dimensional space. You will implement PCA, LLE, Modified LLE, and Isomap, for various data normalizations. The goal is to find the best visualization of the data, where "best" in this case is a qualitative measure of how well the different classes of points are separated in the projected space.
Because we're going to be plotting things below, we'll first make sure we're in ipython's pylab mode:
In [ ]:
%pylab inline
import pylab as pl
This tutorial assumes the notebook is within the tutorial directory structure, and that the fetch_data.py script has been run to download the data locally. If the data is in a different location, you can change the DATA_HOME variable below.
In [ ]:
import os
DATA_HOME = os.path.abspath('../data/sdss_spectra')
In [ ]:
import numpy as np
data = np.load(os.path.join(DATA_HOME, 'spec4000_corrected.npz'))
X = data['X']
y = data['y']
labels = data['labels']
# shuffle the data
i = np.arange(y.shape[0], dtype=int)
np.random.shuffle(i)
X = X[i]
y = y[i]
print X.shape
So we see that the dataset consists of 4000 points in 1000 dimensions. Let's plot a few of the spectra so we can see what they look like:
In [ ]:
def plot_spectral_types(data):
X = data['X']
y = data['y']
wavelengths = data['wavelengths']
labels = data['labels']
for i_class in (2, 3, 4, 5, 6):
i = np.where(y == i_class)[0][0]
l = pl.plot(wavelengths, X[i] + 20 * i_class)
c = l[0].get_color()
pl.text(6800, 2 + 20 * i_class, labels[i_class], color=c)
pl.subplots_adjust(hspace=0)
pl.xlabel('wavelength (Angstroms)')
pl.ylabel('flux + offset')
pl.title('Sample of Spectra')
plot_spectral_types(data)
Here we see what the 1000 dimensions mean: for each object, there is a measurement in each of 1000 wavelength bins. In other words, these objects exist in a 1000 dimensional parameter space. If we could draw a graph in 1000 dimensions, each spectrum could be represented by a single point in this 1000 dimensional space.
Unfortunately, it is very difficult for us to visualize even four or five dimensions, let alone 1000. This is why it is often useful to think about finding an optimal lower-dimensional projection of the dataset, where optimal is defined in some quantitative way.
In this exercise we will be visualizing several three-dimensional projections of high dimensional data. To streamline this, we'll first define a function which lets us scatter-plot three dimensions of data:
In [ ]:
def three_component_plot(c1, c2, c3, color, labels,
trim_outliers=True, sigma_cutoff=3):
if trim_outliers:
mask = np.zeros(c1.shape, dtype=bool)
for c in [c1, c2, c3]:
c = np.asarray(c)
mu = np.mean(c)
std = np.std(c)
mask |= (c > mu + sigma_cutoff * std)
mask |= (c < mu - sigma_cutoff * std)
print "removing %i outliers" % mask.sum()
c1 = c1[~mask]
c2 = c2[~mask]
c3 = c3[~mask]
color = color[~mask]
fig = pl.figure(figsize=(8,8))
fig.subplots_adjust(hspace=0.05, wspace=0.05)
kwargs = dict(s=8, lw=0, c=color, vmin=2, vmax=6)
ax1 = fig.add_subplot(221)
pl.scatter(c1, c2, **kwargs)
ax1.set_ylabel('component 2')
ax2 = fig.add_subplot(223, sharex=ax1)
pl.scatter(c1, c3, **kwargs)
ax2.set_xlabel('component 1')
ax2.set_ylabel('component 3')
ax3 = fig.add_subplot(224, sharey=ax2)
pl.scatter(c2, c3, **kwargs)
ax3.set_xlabel('component 2')
for ax in (ax1, ax2, ax3):
ax.xaxis.set_major_formatter(pl.NullFormatter())
ax.yaxis.set_major_formatter(pl.NullFormatter())
ax.axis('tight')
color_format = pl.FuncFormatter(lambda i, *args: labels[i])
pl.colorbar(ticks = range(2, 7), format=color_format,
cax = pl.axes((0.52, 0.51, 0.02, 0.39)))
pl.clim(1.5, 6.5)
Let's try this out right now: we'll use three randomly-drawn dimensions for the points within the dataset:
In [ ]:
three_component_plot(X[:, 100], X[:, 200], X[:, 300], y, labels, trim_outliers=True)
We see that there are some strong correlations between the different dimensions. In particular, component 2 and 3 seem to measure nearly the same information.
As we saw in the earlier exercise, one possible projection to use is based on Principal Component Analysis. This looks for the linear combination of data attributes which show the largest variance, and thus are in some sense the most "important" combination of features.
We'll want to experiment with different numbers of samples, and different normalizations, so we'll create a quick convenience function to streamline this:
In [ ]:
from sklearn import preprocessing
def preprocess(data, shuffle=True, n_samples=1000, normalization=None):
"""Preprocess the data
Parameters
----------
shuffle: True or False
n_samples: integer between 1 and 4000
normalization: None or 'L1' or 'L2'
"""
X = data['X']
y = data['y']
# shuffle the data
if shuffle:
i = np.arange(y.shape[0], dtype=int)
np.random.shuffle(i)
X = X[i]
y = y[i]
# truncate the data
X = X[:n_samples]
y = y[:n_samples]
# normalize the data
if not normalization:
pass
elif normalization.lower() == 'l2':
X = preprocessing.normalize(X, 'l2')
elif normalization.lower() == 'l1':
X = preprocessing.normalize(X, 'l1')
else:
raise ValueError("Unrecognized normalization: '%s'" % normalization)
return X, y
Now we'll perform the randomized PCA projection
In [ ]:
# Principal Component Analysis
from sklearn.decomposition import RandomizedPCA
X, y = preprocess(data, shuffle=False, n_samples=1000, normalization='L2')
rpca = RandomizedPCA(n_components=3, random_state=0)
X_proj = rpca.fit_transform(X)
three_component_plot(X_proj[:, 0], X_proj[:, 1], X_proj[:, 2], y, labels, trim_outliers=True)
The PCA projection allows us to visualize the data in a meaningful way. In particular, it shows that the absorption galaxies (blue points) do occupy a different region of parameter space than the emission galaxies (green points), and that quasars (orange and red points) are relatively rare in the dataset. This sort of projection has been successfully used, for example, as a projection method to understand the relationship between galaxies and quasars, and also a step toward automated classification based on spectra (see Yip et al. 2004).
But the weakness of PCA is that it is a linear projection, and features that distinguish quasars (red and orange points) from emission galaxies (green points) are non-linear: for this reason, nonlinear projections can be a better choice (see Vanderplas et al. 2009). In this exercise we will explore several of these projection methods. For a description of the available methods, see http://scikit-learn.org/stable/modules/manifold.html.
First let's try Locally Linear Embedding (available in sklearn.manifold.LLE
).
In [ ]:
# here create an object from sklearn.manifold.LLE with n_components=3 and method='standard'
# you will have to select n_neighbors. 15 is a good first guess.
# initialize the object, fit on the dataset X, and compute the projection X_proj
# The syntax is similar to RandomizedPCA, above
# on older versions of scikit-learn, out_dim is used rather than n_components
#X, y = preprocess(...
# perform LLE fit here
#three_component_plot(...
If the notebook is within the tutorial directory structure, the following command will load the solution:
In [ ]:
%loadpy soln/03-01.py
What do you notice about this projection? What are the effects of normalizing or not normalizing? How does the number of neighbors affect the results?
Next we'll try out modified LLE (MLLE). MLLE essentially uses multiple
coefficients in each neighborhood to better preserve the local geometry
of the data in the projection. You can used modified LLE by copying the
above LLE code, but with the keyword method = 'modified'
:
In [ ]:
# train and plot the MLLE projection of the data here
# use the LLE code from above as a template.
If the notebook is within the tutorial directory structure, the following command will load the solution:
In [ ]:
%loadpy soln/03-02.py
How does this projection compare to that of standard LLE? Does this projection lead to a more intuitive representation of the relationship between the points? Experiment with a few choices of n_neighbors, and normalization vs. no normalization.
The final technique we'll explore is Isomap, short for "Isometric Mapping".
It's an approach to the same problem that is based on graph theory. The
scikit-learn implementation is in sklearn.manifold.Isomap
. As above,
you should preprocess
the data, compute the Isomap projection, and then
plot the results with our three_component_plot
function.
In [ ]:
# compute and plot the isomap projection here:
If the notebook is within the tutorial directory structure, the following command will load the solution:
In [ ]:
%loadpy soln/03-03.py
In [ ]:
#03-03.py
X, y = preprocess(data, shuffle=False, n_samples=1000, normalization=None)
from sklearn.manifold import Isomap
iso = Isomap(n_neighbors=15, n_components=3)
X_proj = iso.fit_transform(X)
three_component_plot(X_proj[:, 0], X_proj[:, 1], X_proj[:, 2], y, labels, trim_outliers=True)
Again, explore the different options for the projection. How does the normalization, the number of neighbors, etc. affect the resulting projection? Between PCA, LLE, MLLE, and Isomap, which method leads to the most useful visualization of the data (i.e. which could be used to construct and intuitive model for a simple classification of the different objects?)
This has been a fairly qualitative exploration of several projection and visualization methods available in scikit-learn. These have been explored in more detail in several papers, a selection of which can be found below:
There are still many unanswered questions in the use of manifold learning methods, both in general and in Astronomical applications.