In [ ]:
from __future__ import division, print_function, absolute_import
By Matthew J. Graham (California Institute of Technology)
Acknowledgements to A. A. Miller and G. Cabrera.
(c) 2017, California Institute of Technology
Version 0.1
Given a data set of point sources, a typical machine learning problem is how many different kinds of objects can we distinguish in it? This is best answered with unsupervised algorithms, which aim to identify hidden structures in data sets. Of course, interpretation of any such structures, i.e., whether they actually indicate physically meaningful correlations or relationships, is left to the domain expert.
In [ ]:
import numpy as np
import matplot.pyplot as plt
%matplotlib inline
You're going to use the Python package scikit-learn
to attack a number of problems which are best handled with unsupervised machine learning algorithms (models). In scikit-learn, different models are accessed via the various modules with user-defined tuning parameters. Data (features) for the models are stored in a 2D numpy array, X
, with each row representing an individual object and each column a specific feature. Unsupervised models are invoked by calling .fit(X)
. Once a model has fit a data set, it can be used to generate observations for a new data set Xnew
by calling .predict(Xnew)
. Any other method calls are described in the documentation.
You were introduced to the $k$-means clustering algorithm in the first DSFP session in August 2016 (and in the lecture after lunch today) and its application to the gold standard Iris dataset. We're going to revisit a bit of that material first but moving straight away to an astronomical data set from SDSS: SDSS_colors.csv.
Problem: So let's start with loading the data set and plotting the axes against each other to get a feel for what the data looks like (you should always do this as a matter of course when getting a new (or not so new) data set).
In [ ]:
import pandas as pd
df = pd.read_csv("SDSS_colors.csv")
# complete
It certainly appears as though there is structure within this data set but the human brain is very good at finding patterns where there are none (see pareidolia).
Problem: Let's see what it looks like when you've fit it with two different $k$-means models, one with 2 clusters and one with 3 clusters. Plot the results in the $g-r$ vs. $u-g$ plane (and don't forget to also plot the cluster centers).
Hint - Remember the `.labels_` attribute of the `KMeans` object will return the clusters measured by the algorithm.
In [ ]:
from sklearn.cluster import KMeans
Kcluster = KMeans( # complete
# complete
plt.figure()
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')
Kcluster = KMeans( # complete
plt.figure()
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')
The scikit-learn
implementation of $k$-means has various optional parameters that can be used to fine tune the algorithm.
Problem: See how the results change if the $k=3$ cluster model is called with n_init = 1
and init = random
options instead. Use rs
for the random state.
Note: the respective defaults for these parameters are 10 and k-means++, respectively. Refer to the documentation to see why these are better choices than what you are just about to try.
In [ ]:
rs = 14
Kcluster1 = KMeans(# complete
plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')
By default, $k$-means uses Euclidean distance as its similarity metric and the magnitude of individual features can therefore have a strong effect on the final clustering outcome.
Problem: Determine the mean, standard deviation, and minimum and maximum values of each feature in the data set. Do you think any one feature is particularly dominant?
In [ ]:
print("Feature\t\t\tmean\tstd\tmin\tmax")
for key in df.keys():
print("{:s}\t{:.2f}\t{:.2f}\t{:.2f}\t{:.2f}".format( # complete")
When the ranges of the features are vastly different, it is common practice to rescale each feature, either to the range [0, 1] (known as normalization) or to have zero mean and unit variance (known as standardization - this assumes a Gaussian distribution for the feature).
Problem: Try rescaling the data to see if it makes a difference to previous results.
Hint - The '`StandardScaler()`' class within the `sklearn.preprocessing` module may prove useful.
In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit_transform( # complete
Kcluster = KMeans( # complete
plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')
scikit-learn
also has an alternate implementation of the $k$-means algorithm, MiniBatchKMeans
which is faster but at the expense of accuracy.
Problem: See what the performance differences between the two implementations are and also whether different results are obtained.
Hint - The iPython `%timeit` command is very good for performance evaluations.
In [ ]:
from sklearn.cluster import MiniBatchKMeans
%timeit MiniBatchKMeans( # complete
%timeit KMeans( # complete
# complete
plt.figure()
plt.scatter(# complete
plt.xlabel('u - g')
plt.ylabel('g - r')
One of the issues with the $k$-means algorithm is that the number of clusters to use, $k$, is a parameter. There are a number of ways of determining this for a given data set. One of them is to calculate the mean silhouette for (a subset of) the data for a variety of $k$ values and pick the value closest to 1.
Problem: Calculate the mean silhouette value for $k$ in the range 2 to 10 and select the most appropriate one.
Hint - The `silhouette_score()` class within the `metrics` module may prove useful.
Note: This will take some time to run
In [ ]:
from sklearn import metrics
# Select a random subset of the data
N_small = 100
dfSample = df[ # complete
ss = []
nn = [2,3,5,10]
for n in # complete
Kcluster = Kmeans( # complete
ss.append(metrics.silhouette_score( # complete
plt.figure()
plt.scatter(# complete
plt.xlabel('k')
plt.ylabel('Mean silhouette')
DBSCAN is a highly popular clustering algorithm that does not require the number of clusters to be specified. Instead, it defines clusters according to two parameters: minPts
, the minimum number of points necessary for a cluster, and $\epsilon$, a distance measure. scikit-learn
has an implementation of the algorithm as part of the cluster
module: DBSCAN
. $\epsilon$ and minPts
are set by the implementation parameters eps
and min_samples
, respectively.
Problem: Cluster the SDSS color data with DBSCAN and see what effect changing the fine tuning parameters has on the results. How does DBSCAN compare to $k$-means?
Note: DBSCAN labels outliers as -1 and so plt.scatter()
will plot all these points as the same color.
In [ ]:
from sklearn.cluster import DBSCAN
dbs = DBSCAN( # complete
dbs.fit( # complete
# Class numbers
labels = dbs.labels_
unique_labels = np.unique(labels)
count = [(labels == lab).sum() for lab in unique_labels]
plt.figure()
for i in range(len(unique_labels)):
plt.plot( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')
A recommended approach for fine tuning minPts
and $\epsilon$ is to set minPts
to (at least) the dimension of the feature set and plot a $k$-distance graph for the data with $k$ = minPts
. The appropriate value for $\epsilon$ is indicated by an elbow in the diagram.
Problem: Try this for the data set.
In [ ]:
from sklearn.neighbors import NearestNeighbors
N_neigh = # complete
radius = # complete
neigh = NearestNeighbors( # complete
# complete
dist, = neigh.kneighbors( # complete
plt.figure()
plt.scatter( # complete
plt.xlabel('Points sorted by distance to %sth nearest neighbor' % N_neigh)
plt.ylabel('%sth nearest neighbor distance' % N_neigh)
Gaussian mixture models (GMMs) are a popular choice for clustering data as well as density estimation.
Problem: Fit a Gaussian mixture model to the data (assume a certain number of Gaussian components). Try different numbers of components.
Hint: The GaussianMixture
class may prove useful.
In [ ]:
from sklearn.mixture import GaussianMixture
clf = GaussianMixture(n_components = # complete
clf.fit( # complete
plt.figure()
plt.scatter( # complete
Hierarchical clustering algorithms are an alternative to partition clustering ones, such as $k$-means and DBSCAN.
Problem: Fit a hierarchical clustering algorithm to the data. How does it compare with $k$-means or DBSCAN?
Hint: The AgglomerativeClustering
class may prove useful.
In [ ]:
from sklearn.clustering import AgglomerativeClustering
agg = AgglomerativeClustering( # complete
agg.fit_predict( # complete
plt.figure()
plt.scatter( # complete
plt.scatter( # complete
plt.xlabel('u - g')
plt.ylabel('g - r')
It's often unclear from simple scatter plots what the probability density function (pdf) for a data set actually looks like; density estimation algorithms attempt to recover it. Many of the basic aspects of density estimation were covered in the Visualization talk from Session 1 (Day 3) so we're going to focus more on some operational details.
By far the simplest density estimation algorithm is the histogram and several prescriptions exist for the optimal bin width/number of bins to use.
Problem: Using the SDSS galaxy data set, SDSS_gals.csv, show the redshift distribution with Scott's rule, Freedman-Diaconis rule, Knuth rule, and Bayesian blocks.
Hint - The `hist` class in the module `astroML.plotting` may prove useful for Knuth rule and Bayesian blocks. `astroML` goes with the book by Ivezic, Connolly, Gray and VanderPlas and provides additional machine learning material for astronomy.
In [ ]:
from astroML.plotting import hist
df = pd.read_csv("SDSS_gals.csv")
z = df # complete
fig, (ax0, ax1, ax2, ax3) = plt.subplots(1, 4, figsize = (17,5))
hist(z # complete
hist(z # complete
hist(z # complete
hist(z # complete
Kernel density estimation fits a kernel to each data point and then estimates the pdf from the sum of all of these.
Problem: Fit the data with a standard KDE. How des it compare against histograms?
Hint - The `KernelDensity` class in the `sklearn.neighbors` module may be useful.
In [ ]:
from sklearn.neighbors import KernelDensity
kde = KernelDensity( # complete
# complete
plt.figure()
plt.scatter( # complete
The optimal bandwidth parameter, $h$, can be determined by cross-validation.
Problem: Try it for the data set over $log(h)$ in the range [-3, 0]:
Hint - The `GridSearchCV` class in the `sklearn.grid_search` module may be useful.
In [ ]:
from sklearn.grid_search import GridSearchCV
# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': # complete
grid = GridSearchCV( # complete
# complete
# Extract the scores and plot them
# complete
So far we've just been looking at one dimension but now let's consider a two-dimensional case: the density of galaxies in terms of their radius (petroRad_r_kpc
) and magnitude (absPetroMag_r
).
Problem: Repeat finding the optimal bandwidth via cross-validation and plot the resulting pdf.
Note: The method score_samples
gives results as $\log$.
In [ ]:
data = # complete
# use grid search cross-validation to optimize the bandwidth
params = {'bandwidth': # complete
grid = GridSearchCV( # complete
# complete
x = np.arange(.0, 40, 1)
y = np.arange(-24, -15, 0.2)
X, Y = np.meshgrid(x, y)
XY = np.array([X.flatten(), Y.flatten()]).transpose()
pdf = # complete
# Use imshow to make a color density plot
plt.imshow(# complete, interpolation='none', cmap=pl.cm.jet, origin='lower',clip_on=True))
There are a number of unsupervised learning algorithms that can produce a low-dimensional representation of a high-dimensional data set that preserves topological information, i.e., clustering. The state-of-the-art is the t-distributed stochastic neighborhood embedding (t-SNE).
Problem: Fit the data set with a t-SNE (this may take some time as the algorithm is computationally intensive)
Note: The TSNE
class within the sklearn.manifold
modules uses a slightly different syntax from the other scikit-learn
implementations. The call is .fit_transform(X)
.
In [ ]:
from sklearn.manifold import TSNE
tsne = TSNE( # complete
# complete
plt.figure()
plt.scatter( # complete
In [ ]: