The objective of this hands-on activity is to cluster a set of candidate sources from the Zwicky Transient Facility's (ZTF) image subtraction pipeline. All candidate features and postage stamps were extracted from ZTF's public alert stream.
The goal of this exercise is to become familiar with the ZTF data, the examination of some of its features, and running sklearn's KMeans algorithm on 2 or more features. Here are the steps we will take:
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import glob
import os
from time import time
from matplotlib.pyplot import imshow
from matplotlib.image import imread
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics.pairwise import euclidean_distances
You will need the following files:
You will need to unzip and unpack this last file (a "tarball") called dsfp_ztf_png_stamps.tar.gz. Run the following commands in the same directory as this notebook to unpack everything (note - some operating systems automatically unzip downloaded files):
You should now have a directory in your current working directory (cwd) called dsfp_ztf_png_stamps.
Please specify the following file locations:
In [ ]:
F_META = # complete
F_FEATS = # complete
D_STAMPS = # complete
We are ready to get started! :) Start by loading the data and confirming that feats has the same number of columns as COL_NAMES. Please note that the last columns is a class label with values {0, 1}, where 0=bogus, and 1=real. Today we are doing unsupervised learning, but some clustering evaluation methods use labels to quantitatively measure the quality of the clustering result.
In [ ]:
meta = np.load(F_META)
feats = np.load(F_FEATS)
COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']
# INSTRUCTION: Verify that feats has the same number of columns as COL_NAMES
#
We will perform K-means clustering using two features: 'chipsf' and 'elong'. Chipsf is the uncertainty associated with performing PSF-fit photometry. The higher the chi values, the more uncertainty associated with the source's PSF fit. Elong is a measure of how elongated the source is. A transient point source should have a spherical point spread function. An elongated point source may be a sign of a problem with image subtraction.
Extract features chipsf and along from the data. Scatter plot them together, and also plot their histograms.
In [ ]:
featnames_to_select = ['chipsf', 'elong']
# Extract the Correct Features
#
featidxs_to_select_indices = [ COL_NAMES.index(x) for x in featnames_to_select]
feats_selected = feats[:,featidxs_to_select_indices]
# Scatter Plot the Two Features
#
def plot_scatter(dat, xlabel, ylabel, xscale='linear', yscale='linear'):
plt.plot(dat[:,0], dat[:,1], 'k.')
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xscale(xscale)
plt.yscale(yscale)
plt.show()
# Scatter Plot the Two Features
#
def plot_histogram(dat, bins, title, xscale='linear', yscale='linear'):
plt.hist(dat, bins)
plt.xscale(xscale)
plt.yscale(yscale)
plt.title(title)
plt.show()
# INSTRUCTION: Scatter Plot the Data
#
# INSTRUCTION: Plot the Histograms for both features. Hint, it may be helpful to plot some features on a log scale.
#
We rarely ever cluster only two features from a dataset. However, the advantage of doing so is that we can readily visualize two-dimensional data. Let's start off by clustering features elong and chipsf with KMeans. The plotKMeans function below implements a visualization of KMean's partitioning that was used in sklearn's KMean's demo.
In [ ]:
def runKMeans(dat, n_clusters=2, seed=0):
return KMeans(n_clusters, random_state=seed).fit(dat)
def plotKMeans(kmeans_res, reduced_dat, xlabel, ylabel, xscale='linear', yscale='linear'):
# Plot the decision boundary. For that, we will assign a color to each
h = .02 # point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = reduced_dat[:, 0].min() - 1, reduced_dat[:, 0].max() + 1
y_min, y_max = reduced_dat[:, 1].min() - 1, reduced_dat[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans_res.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(reduced_dat[:,0], reduced_dat[:,1], 'k.')
plt.scatter(kmeans_res.cluster_centers_[:, 0], kmeans_res.cluster_centers_[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xscale(xscale)
plt.yscale(yscale)
plt.show()
# INSTRUCTION: Use the runKMeans and plotKMeans functions to cluster the data (feats_selected)
# with several values of k.
We just discovered that distance metrics can be sensitive to the scale of your data (e.g., some features span large numeric ranges, but others don't). For machine learning methods that calculate similiarty between feature vectors, it is important to normalize data within a standard range such as (0, 1) or with z-score normalization (scaling to unit mean and variance). Fortunately, sklearn also makes this quite easy. Please review sklearn's preprocessing module options, specifically StandardScaler which corresponds to z-score normalization and MinMaxScaler. Please implement one.
After your data has been scaled, scatter plot your rescaled features, and run KMeans with the transformed data. Compare the results on the transformed data with those above.
In [ ]:
# INSTRUCTION: Re-scale your data using either the MinMaxScaler or StandardScaler from sklearn
#
# INSTRUCTION: Scatter plot your rescaled data
#
# INSTRUCTION: Retry KMeans with the same values of k used above.
#
So far, we've been visually verifying our clusters. Let's use quantitative methods to verify our results.
The following is a score that does not require labels:
The following scores do require labels, and are documented here.
In [ ]:
sample_size = 300
def bench_k_means(estimator, name, data, labels):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0), estimator.inertia_,
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
labels = feats[:,-1]
print(82 * '_')
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')
# INSTRUCTIONS: Use the bench_k_means method to compare your clustering results
#
It can be tempting to let yourself be guided by metrics alone, and the metrics are useful guideposts that can help determine whether you're moving in the right direction. However, the goal of clustering is to reveal structure in your dataset. Fortunately, because the features were extracted from sources that were extracted from images, we can view the cutouts from each source to visually verify whether our clusters contain homogeneous objects.
The display methods below give you an opportunity to display random candidates from each cluster, or the candidates that are closest to the cluster center.
In [ ]:
def display_stamps(candids, fig_title):
# display five across
num_per_row = 5
for i, candid in enumerate(candids):
f_stamp = glob.glob(os.path.join(D_STAMPS, 'candid{}*.png'.format(candid)))[0] # there should only be one file returned!
if (i % num_per_row) == 0:
fig = plt.figure(figsize=(18, 3))
fig.suptitle(fig_title)
ax = fig.add_subplot(1, num_per_row, i%num_per_row + 1)
ax.set_axis_off()
ax.set_title(candid)
stamp = imread(f_stamp)
imshow(stamp)
return
def closest_to_centroid(centroid, cluster_feats, cluster_candids):
dists = euclidean_distances(cluster_feats, centroid.reshape(1, -1))[:,0]
closest_indices = np.argsort(dists)[:10]
return cluster_candids[closest_indices]
def show_cluster_stamps(kmeans_res, displayMode='closest', num_to_display=10):
# spits out a random selection of stamps from each cluster
for i in range(kmeans_res.n_clusters):
centroid = kmeans_res.cluster_centers_[i, :]
mask = kmeans_res.labels_ == i
cluster_candids = meta[mask]['candid']
cluster_feats = feats_selected_scaled[mask]
if displayMode == 'near_centroid':
selected_candids = closest_to_centroid(centroid, cluster_feats, cluster_candids)
if displayMode == 'random':
np.random.shuffle(cluster_candids)
selected_candids = cluster_candids[:num_to_display]
display_stamps(selected_candids, 'Cluster {}'.format(i))
# INSTRUCTION: Use the show_cluster_stamps method to display cutouts associated with each cluster.
# Do you see similar objects in each cluster?
#
Given the tools seen above, starting clustering more than 2 features at a time. This work is free-form. I'll start you off with some suggested features. After plotting the feature distributions, you may choose to down-select further.
Because we're now working with more than 2 features, use PCA to project the feature space onto its first two principal components. You may use the methods above to run KMeans in that reduced feature space and evaluate your results.
In [ ]:
featnames_to_select = ['chipsf', 'elong', 'diffmaglim', 'magpsf', 'sigmapsf',
'chipsf', 'magap', 'sigmagap', 'sky', 'magdiff', 'fwhm',
'mindtoedge', 'magfromlim', 'seeratio', 'aimage', 'bimage',
'aimagerat', 'bimagerat', 'elong', 'nneg', 'nbad', 'sumrat', 'magapbig', 'sigmagapbig']
# INSTRUCTION: Visualize these features. Discard any you consider to be problematic.
# INSTRUCTION: Filter the feature space
# INSTRUCTION: Run PCA on this feature space to reduce it to 2 principal components
# INSTRUCTION: Run KMeans on this 2-dimensional PCA space, and evaluate your results both quantatively and qualitatively.