# K-means with SDSS data

Machine learning exercise by Group 1 at AstroHackWeek 2017, Day 1.

First, we blatantly copy some of the code from the demo-SDSS notebook...

``````

In [ ]:

from os import path
from astropy.table import Table
import h5py
import matplotlib.pyplot as plt
#plt.style.use('notebook.mplstyle')
%matplotlib inline
import numpy as np
from sklearn.cluster import KMeans

``````
``````

In [ ]:

data_path = '/Users/Meredith/Astronomy/astrohack/ahw2017-ml-data/'  # specific to my computer

``````
``````

In [ ]:

path='photoPosPlate')

``````
``````

In [ ]:

len(photoPos)

``````

Pull color information out of the photoPosPlate data file for u-g, g-r, r-i, and i-z colors

``````

In [ ]:

# 01234 = ugriz filters
u_g = photoPos['PSFMAG'][:,0] - photoPos['PSFMAG'][:,1]
g_r = photoPos['PSFMAG'][:,1] - photoPos['PSFMAG'][:,2]
r_i = photoPos['PSFMAG'][:,2] - photoPos['PSFMAG'][:,3]
i_z = photoPos['PSFMAG'][:,3] - photoPos['PSFMAG'][:,4]

``````

Let's take a look at how the spectral data was classified by SDSS into galaxies, QSOs, or stars

``````

In [ ]:

path='specObj')
spec_class = specObj['CLASS'].astype(str)
spec_classes = np.unique(spec_class)
for cls in spec_classes:
print(cls, (spec_class == cls).sum())

``````
``````

In [ ]:

fig, axes = plt.subplots(1, len(spec_classes), figsize=(12.5,5),
sharex=True, sharey=True)

for i, cls in enumerate(spec_classes):
axes[i].plot(g_r[spec_class == cls], r_i[spec_class == cls],
marker='.', linestyle='none', alpha=0.1)
axes[i].set_title(cls)
axes[i].set_xlabel('\$g-r\$ [mag]')

axes.set_xlim(-0.5, 2.5)
axes.set_ylim(-1, 2)

axes.set_ylabel('\$r-i\$ [mag]')

fig.tight_layout()

``````

Hopefully our K-means clustering will show us that the dataset breaks into somewhat similarly-shaped pieces in color-color space.

## Running K-means

To get an idea for how our data for K-means should be structured, we refer to the example at http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

``````

In [ ]:

X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

``````
``````

In [ ]:

X.shape  # (number of data points X number of things per data point)

``````
``````

In [ ]:

colors = np.array([u_g, g_r, r_i, i_z]).T  # put into the same shape as X

``````
``````

In [ ]:

colors.shape

``````
``````

In [ ]:

n_clusters = 8  # the number of clusters to use

``````
``````

In [ ]:

kmeans = KMeans(n_clusters=n_clusters, random_state=0).fit(colors)  # run the K-means analysis

``````
``````

In [ ]:

print(kmeans.labels_)  # the label from 0 to n_clusters assigned to each point

``````
``````

In [ ]:

print(kmeans.cluster_centers_)

``````
``````

In [ ]:

# make a new plot for each cluster center
for k in range(n_clusters):
plt.figure(figsize=(5,5))
idx = (kmeans.labels_ == k)
plt.scatter(g_r[idx], r_i[idx], alpha=0.1, marker='.')
plt.xlabel('\$g - r\$ [mag]')
plt.ylabel('\$r - i\$ [mag]')
plt.xlim([-0.5, 2.5])
plt.ylim([-1, 2])
plt.title('cluster label ' + str(k))

``````

At a glance, it looks like clusters 0, 4, and 6 are mostly galaxies, clusters 1 and 2 are weird outliers, cluster 3 is QSOs (plus some stellar contamination?), and clusters 5 and 7 are mostly stars. We could almost certainly refine this better given more time.

## Troubleshooting the outliers in clusters 1 and 2

Get the indices corresponding to each K-means label 0, 1, and 2 for comparison

``````

In [ ]:

zeroidx = np.where((kmeans.labels_ == 0))
oneidx = np.where((kmeans.labels_ == 1))
twoidx = np.where((kmeans.labels_ == 2))

``````
``````

In [ ]:

print(len(zeroidx))
print(len(oneidx))
print(len(twoidx))

``````
``````

In [ ]:

plt.figure(figsize=(5,5))
plt.plot(g_r, r_i, alpha=0.1, ls='None', marker='.')  # full dataset
plt.plot(g_r[oneidx], r_i[oneidx], ls='None', marker='o', mec='k')  # problem outlier 1
plt.plot(g_r[twoidx], r_i[twoidx], ls='None', marker='o', mec='k')  # problem outlier 2
plt.xlabel('\$g - r\$ [mag]')
plt.ylabel('\$r - i\$ [mag]')
plt.xlim([-0.5, 2.5])
plt.ylim([-1, 2])

``````

The two outlier points highlighted on top of the full dataset.

``````

In [ ]:

``````