In [ ]:

    
from __future__ import division, print_function, absolute_import

Unsupervised machine learning problems

By Matthew J. Graham (California Institute of Technology)

Acknowledgements to A. A. Miller and G. Cabrera.

Version 0.1

Given a data set of point sources, a typical machine learning problem is how many different kinds of objects can we distinguish in it? This is best answered with unsupervised algorithms, which aim to identify hidden structures in data sets. Of course, interpretation of any such structures, i.e., whether they actually indicate physically meaningful correlations or relationships, is left to the domain expert.



In [ ]:

    
import numpy as np
import matplot.pyplot as plt
%matplotlib inline

Revision: scikit-learn

You're going to use the Python package scikit-learn to attack a number of problems which are best handled with unsupervised machine learning algorithms (models). In scikit-learn, different models are accessed via the various modules with user-defined tuning parameters. Data (features) for the models are stored in a 2D numpy array, X, with each row representing an individual object and each column a specific feature. Unsupervised models are invoked by calling .fit(X). Once a model has fit a data set, it can be used to generate observations for a new data set Xnew by calling .predict(Xnew). Any other method calls are described in the documentation.

$k$-means clustering

You were introduced to the $k$-means clustering algorithm in the first DSFP session in August 2016 (and in the lecture after lunch today) and its application to the gold standard Iris dataset. We're going to revisit a bit of that material first but moving straight away to an astronomical data set from SDSS: SDSS_colors.csv.

Problem: So let's start with loading the data set and plotting the axes against each other to get a feel for what the data looks like (you should always do this as a matter of course when getting a new (or not so new) data set).



In [ ]:

    
import pandas as pd

df = pd.read_csv("SDSS_colors.csv")
# complete

It certainly appears as though there is structure within this data set but the human brain is very good at finding patterns where there are none (see pareidolia).

Problem: Let's see what it looks like when you've fit it with two different $k$-means models, one with 2 clusters and one with 3 clusters. Plot the results in the $g-r$ vs. $u-g$ plane (and don't forget to also plot the cluster centers).

Hint - Remember the `.labels_` attribute of the `KMeans` object will return the clusters measured by the algorithm.



In [ ]:

    
from sklearn.cluster import KMeans

Kcluster = KMeans( # complete
# complete
    
plt.figure()
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')
        
Kcluster = KMeans( # complete

plt.figure()
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')

Fine tuning the algorithm

The scikit-learn implementation of $k$-means has various optional parameters that can be used to fine tune the algorithm.

Problem: See how the results change if the $k=3$ cluster model is called with n_init = 1 and init = random options instead. Use rs for the random state.

Note: the respective defaults for these parameters are 10 and k-means++, respectively. Refer to the documentation to see why these are better choices than what you are just about to try.



In [ ]:

    
rs = 14
Kcluster1 = KMeans(# complete

plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')

Rescaling the data

By default, $k$-means uses Euclidean distance as its similarity metric and the magnitude of individual features can therefore have a strong effect on the final clustering outcome.

Problem: Determine the mean, standard deviation, and minimum and maximum values of each feature in the data set. Do you think any one feature is particularly dominant?



In [ ]:

    
print("Feature\t\t\tmean\tstd\tmin\tmax")

for key in df.keys():
    print("{:s}\t{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}".format( # complete")

When the ranges of the features are vastly different, it is common practice to rescale each feature, either to the range [0, 1] (known as normalization) or to have zero mean and unit variance (known as standardization - this assumes a Gaussian distribution for the feature).

Problem: Try rescaling the data to see if it makes a difference to previous results.

Hint - The '`StandardScaler()`' class within the `sklearn.preprocessing` module may prove useful.



In [ ]:

    
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler().fit_transform( # complete
    
Kcluster = KMeans( # complete
        
plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')

Different implementations

scikit-learn also has an alternate implementation of the $k$-means algorithm, MiniBatchKMeans which is faster but at the expense of accuracy.

Problem: See what the performance differences between the two implementations are and also whether different results are obtained.

Hint - The iPython `%timeit` command is very good for performance evaluations.



In [ ]:

    
from sklearn.cluster import MiniBatchKMeans

%timeit MiniBatchKMeans( # complete
    
%timeit KMeans( # complete
        
# complete
        
plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')

How many clusters?

One of the issues with the $k$-means algorithm is that the number of clusters to use, $k$, is a parameter. There are a number of ways of determining this for a given data set. One of them is to calculate the mean silhouette for (a subset of) the data for a variety of $k$ values and pick the value closest to 1.

Problem: Calculate the mean silhouette value for $k$ in the range 2 to 10 and select the most appropriate one.

Hint - The `silhouette_score()` class within the `metrics` module may prove useful.

Note: This will take some time to run



In [ ]:

    
from sklearn import metrics

# Select a random subset of the data
N_small = 100
dfSample = df[ # complete
    
ss = []
nn = [2,3,5,10]
    
for n in # complete
    Kcluster = Kmeans( # complete
    
    ss.append(metrics.silhouette_score( # complete
            
plt.figure()
plt.scatter(# complete
plt.xlabel('k')
plt.ylabel('Mean silhouette')

DBSCAN

DBSCAN is a highly popular clustering algorithm that does not require the number of clusters to be specified. Instead, it defines clusters according to two parameters: minPts, the minimum number of points necessary for a cluster, and $\epsilon$, a distance measure. scikit-learn has an implementation of the algorithm as part of the cluster module: DBSCAN. $\epsilon$ and minPts are set by the implementation parameters eps and min_samples, respectively.

Problem: Cluster the SDSS color data with DBSCAN and see what effect changing the fine tuning parameters has on the results. How does DBSCAN compare to $k$-means?

Note: DBSCAN labels outliers as -1 and so plt.scatter() will plot all these points as the same color.



In [ ]:

    
from sklearn.cluster import DBSCAN

dbs = DBSCAN( # complete
dbs.fit( # complete
        
# Class numbers
labels = dbs.labels_
unique_labels = np.unique(labels)
count = [(labels == lab).sum() for lab in unique_labels]
        
plt.figure()
for i in range(len(unique_labels)):
    plt.plot( # complete

plt.xlabel('u - g')
plt.ylabel('g - r')

Fine tuning the algorithm

A recommended approach for fine tuning minPts and $\epsilon$ is to set minPts to (at least) the dimension of the feature set and plot a $k$-distance graph for the data with $k$ = minPts. The appropriate value for $\epsilon$ is indicated by an elbow in the diagram.

Problem: Try this for the data set.



In [ ]:

    
from sklearn.neighbors import NearestNeighbors

N_neigh = # complete
radius = # complete
neigh = NearestNeighbors( # complete
# complete
    
dist, = neigh.kneighbors( # complete
        
plt.figure()
plt.scatter( # complete
plt.xlabel('Points sorted by distance to %sth nearest neighbor' % N_neigh)
plt.ylabel('%sth nearest neighbor distance' % N_neigh)

Gaussian mixture models

Gaussian mixture models (GMMs) are a popular choice for clustering data as well as density estimation.

Problem: Fit a Gaussian mixture model to the data (assume a certain number of Gaussian components). Try different numbers of components.

Hint: The GaussianMixture class may prove useful.



In [ ]:

    
from sklearn.mixture import GaussianMixture

clf = GaussianMixture(n_components = # complete
clf.fit( # complete
        
plt.figure()
plt.scatter( # complete

Hierarchical clustering

Hierarchical clustering algorithms are an alternative to partition clustering ones, such as $k$-means and DBSCAN.

Problem: Fit a hierarchical clustering algorithm to the data. How does it compare with $k$-means or DBSCAN?

Hint: The AgglomerativeClustering class may prove useful.



In [ ]:

    
from sklearn.clustering import AgglomerativeClustering

agg = AgglomerativeClustering( # complete
agg.fit_predict( # complete
                
plt.figure()
plt.scatter( # complete
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')

Density estimation

It's often unclear from simple scatter plots what the probability density function (pdf) for a data set actually looks like; density estimation algorithms attempt to recover it. Many of the basic aspects of density estimation were covered in the Visualization talk from Session 1 (Day 3) so we're going to focus more on some operational details.

Fine tuning histograms

By far the simplest density estimation algorithm is the histogram and several prescriptions exist for the optimal bin width/number of bins to use.

Problem: Using the SDSS galaxy data set, SDSS_gals.csv, show the redshift distribution with Scott's rule, Freedman-Diaconis rule, Knuth rule, and Bayesian blocks.

Hint - The `hist` class in the module `astroML.plotting` may prove useful for Knuth rule and Bayesian blocks. `astroML` goes with the book by Ivezic, Connolly, Gray and VanderPlas and provides additional machine learning material for astronomy.



In [ ]:

    
from astroML.plotting import hist

df = pd.read_csv("SDSS_gals.csv")
z = df # complete

fig, (ax0, ax1, ax2, ax3) = plt.subplots(1, 4, figsize = (17,5))

hist(z # complete 
hist(z # complete 
hist(z # complete 
hist(z # complete

Kernel density estimation

Kernel density estimation fits a kernel to each data point and then estimates the pdf from the sum of all of these.

Problem: Fit the data with a standard KDE. How des it compare against histograms?

Hint - The `KernelDensity` class in the `sklearn.neighbors` module may be useful.



In [ ]:

    
from sklearn.neighbors import KernelDensity

kde = KernelDensity( # complete
    
# complete
    
plt.figure()
plt.scatter( # complete

The optimal bandwidth parameter, $h$, can be determined by cross-validation.

Problem: Try it for the data set over $log(h)$ in the range [-3, 0]:

Hint - The `GridSearchCV` class in the `sklearn.grid_search` module may be useful.



In [ ]:

    
from sklearn.grid_search import GridSearchCV

# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': # complete
grid = GridSearchCV( # complete
# complete
    
# Extract the scores and plot them
# complete

So far we've just been looking at one dimension but now let's consider a two-dimensional case: the density of galaxies in terms of their radius (petroRad_r_kpc) and magnitude (absPetroMag_r).

Problem: Repeat finding the optimal bandwidth via cross-validation and plot the resulting pdf.

Note: The method score_samples gives results as $\log$.



In [ ]:

    
data = # complete

# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': # complete
grid = GridSearchCV( # complete
# complete

x = np.arange(.0, 40, 1)
y = np.arange(-24, -15, 0.2)
X, Y = np.meshgrid(x, y)
XY = np.array([X.flatten(), Y.flatten()]).transpose()
pdf = # complete
             
# Use imshow to make a color density plot
plt.imshow(# complete, interpolation='none', cmap=pl.cm.jet, origin='lower',clip_on=True))

Dimensional reduction

There are a number of unsupervised learning algorithms that can produce a low-dimensional representation of a high-dimensional data set that preserves topological information, i.e., clustering. The state-of-the-art is the t-distributed stochastic neighborhood embedding (t-SNE).

Problem: Fit the data set with a t-SNE (this may take some time as the algorithm is computationally intensive)

Note: The TSNE class within the sklearn.manifold modules uses a slightly different syntax from the other scikit-learn implementations. The call is .fit_transform(X).



In [ ]:

    
from sklearn.manifold import TSNE

tsne = TSNE( # complete
# complete
    
plt.figure()
plt.scatter( # complete



In [ ]: