An Astronomical Application of Machine Learning:

Separating Stars and Galaxies from SDSS

Version 0.1

By AA Miller 2018 Nov 06

The problems in the following notebook develop an end-to-end machine learning model using actual astronomical data to separate stars and galaxies. There are 5 steps in this machine learning workflow:

  1. Data Preparation
  2. Model Building
  3. Model Evaluation
  4. Model Optimization
  5. Model Predictions

The data come from the Sloan Digital Sky Survey (SDSS), an imaging survey that has several similarities to LSST (though the telescope was significantly smaller and the survey did not cover as large an area).

Science background: Many (nearly all?) of the science applications for LSST data will rely on the accurate separation of stars and galaxies in the LSST imaging data. As an example, imagine measuring the structure of the Milky Way without knowing which sources are galaxies and which are stars.

During this exercise, we will utilize supervised machine learning methods to separate extended sources (galaxies) and point sources (stars) in imaging data. These methods are highly flexible, and as a result can classify sources at higher fidelity than methods that simply make cuts in a low-dimensional space.


In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Problem 1) Examine the Training Data

For this problem the training set, i.e. sources with known labels, includes stars and galaxies that have been confirmed with spectroscopic observations. The machine learning model is needed because there are $\gg 10^8$ sources with photometric observations in SDSS, and only $4 \times 10^6$ sources with spectroscopic observations. The model will allow us to translate our knowledge from the spectroscopic observations to the entire data set. The features include each $r$-band magnitude measurement made by SDSS (don't worry if you don't know what this means...). This yields 8 features to train the models (significantly fewer than the 454 properties measured for each source in SDSS).

If you are curious (and it is fine if you are not) this training set was constructed by running the following query on the SDSS database:

SELECT TOP 20000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r, 
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r, 
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
ORDER BY p.objid ASC

First download the training set and the blind test set for this problem.

Problem 1a

Visualize the training set data. The data have 8 features ['psfMag_r', 'fiberMag_r', 'fiber2Mag_r', 'petroMag_r', 'deVMag_r', 'expMag_r', 'modelMag_r', 'cModelMag_r'], and a 9th column ['class'] corresponding to the labels ('STAR' or 'GALAXY' in this case).

Hint - just execute the cell below.


In [ ]:
sdss_df = pd.read_hdf("sdss_training_set.h5")
sns.pairplot(sdss_df, hue = 'class', diag_kind = 'hist')

Problem 1b

Based on your plots of the data, which feature do you think will be the most important for separating stars and galaxies? Why?

write your answer here - do not change it after later completing the problem

The final data preparation step it to create an independent test set to evalute the generalization error of the final tuned model. Independent test sets are generated by witholding a fraction of the training set. No hard and fast rules apply for the fraction to be withheld, though typical choices vary between $\sim{0.2}-0.5$.

sklearn.model_selection has a useful helper function train_test_split.

Problem 1c Split the 20k spectroscopic sources 70-30 into training and test sets. Save the results in arrays called: train_X, train_y, test_X, test_y, respectively. Use rs for the random_state in train_test_split.

Hint - recall that sklearn utilizes X, a 2D np.array(), and y as the features and labels arrays, respecitively.


In [ ]:
from sklearn.model_selection import train_test_split
rs = 1851

# complete

X = # complete
y = # complete

train_X, test_X, train_y, test_y = # complete

We will now ignore everything in the test set until we have fully optimized the machine learning model.

Problem 2) Model Building

After curating the data, you must select a specific machine learning algorithm. With experience, it is possible to develop intuition for the best ML algorithm given a specific problem.

Short of that? Try two (or three, or four, or five) different models and choose whichever works the best.

Problem 2a

Train a $k$-nearest neighbors model on the star-galaxy training set. Select $k$ = 25 for this model.

Hint - the KNeighborsClassifier object in the sklearn.neighbors module may be useful for this task.


In [ ]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = # complete
# complete

Problem 2b

Train a Random Forest (RF) model (Breiman 2001) on the training set. Include 50 trees in the forest using the n_estimators parameter. Again, set random_state = rs.

Hint - use the RandomForestClassifier object from the sklearn.ensemble module. Also - be sure to set n_jobs = -1 in every call of RandomForestClassifier.


In [ ]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = # complete
# complete

A nice property of RF, relative to $k$NN, is that RF naturally provides an estimate of the most important features in a model.

RF feature importance is measured by randomly shuffling the values of a particular feature, and measuring the decrease in the model's overall accuracy. The relative feature importances can be accessed using the .feature_importances_ attribute associated with the RandomForestClassifer() object. The higher the value, the more important feature.

Problem 2c

Calculate the relative importance of each feature.

Which feature is most important? Does this match your answer from 1c?


In [ ]:
feat_str = ',\n'.join(['{}'.format(feat) for feat in np.array(feats)[np.argsort(rf_clf.feature_importances_)[::-1]]])

print('From most to least important: \n{}'.format(feat_str))

write your answer here

Problem 3) Model Evaluation

To evaluate the performance of the model we establish a baseline (or figure of merit) that we would like to exceed. For our current application we want to maximize the accuracy of the model.

If the model does not improve upon the baseline (or reach the desired figure of merit) then one must iterate on previous steps (feature engineering, algorithm selection, etc) to accomplish the desired goal.

The SDSS photometric pipeline uses a simple parametric model to classify sources as either stars or galaxies. If we are going to the trouble of building a complex ML model, then it stands to reason that its performance should exceed that of the simple model. Thus, we adopt the SDSS photometric classifier as our baseline.

The SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data:

$$\mathtt{psfMag_r} - \mathtt{cModelMag_r} > 0.145.$$

Sources that satisfy this criteria are considered galaxies.

Problem 3a

Determine the baseline figure of merit by measuring the accuracy of the SDSS photometric classifier on the training set.

Hint - the accuracy_score function in the sklearn.metrics module may be useful.


In [ ]:
from sklearn.metrics import accuracy_score

phot_y = # complete
# complete
# complete
# complete

print("The baseline FoM = {:.4f}".format( # complete

Problem 3b

Use 10-fold cross validation to estimate the FoM for the $k$NN model. Take the mean value across all folds as the FoM estimate.

Hint - the cross_val_score function from the sklearn.model_selection module performs the necessary calculations.


In [ ]:
from sklearn.model_selection import cross_val_score

knn_cv = cross_val_score( # complete

print('The kNN model FoM = {:.4f} +/- {:.4f}'.format( # complete

Problem 3c

Use 10-fold cross validation to estimate the FoM for the random forest model.


In [ ]:
rf_cv = cross_val_score( # complete

print('The RF model FoM = {:.4f} +/- {:.4f}'.format( # complete

Problem 3d

Do the machine-learning models outperform the SDSS photometric classifier?

write your answer here

Problem 4) Model Optimization

While the "off-the-shelf" model provides an improvement over the SDSS photometric classifier, we can further refine and improve the performance of the machine learning model by adjusting the model tuning parameters. A process known as model optimization.

All machine-learning models have tuning parameters. In brief, these parameters capture the smoothness of the model in the multidimentional-feature space. Whether the model is smooth or coarse is application dependent -- be weary of over-fitting or under-fitting the data. Generally speaking, RF (and most tree-based methods) have 3 flavors of tuning parameter:

  1. $N_\mathrm{tree}$ - the number of trees in the forest n_estimators (default: 10) in sklearn
  2. $m_\mathrm{try}$ - the number of (random) features to explore as splitting criteria at each node max_features (default: sqrt(n_features)) in sklearn
  3. Pruning criteria - defined stopping criteria for ending continued growth of the tree, there are many choices for this in sklearn (My preference is min_samples_leaf (default: 1) which sets the minimum number of sources allowed in a terminal node, or leaf, of the tree)

Just as we previously evaluated the model using CV, we must optimize the tuning paramters via CV. Until we "finalize" the model by fixing all the input parameters, we cannot evalute the accuracy of the model with the test set as that would be "snooping."

Before globally optimizing the model, let's develop some intuition for how the tuning parameters affect the final model predictions.

Problem 4a

Determine the 10-fold cross validation accuracy for $k$NN models with $k$ = 1, 10, 100.

How do you expect changing the number of neighbors to affect the results?


In [ ]:
for k in [1,10,100]:
    # complete

    print('With k = {:d}, the kNN FoM = {:.4f} +/- {:.4f}'.format( # complete

write your answer here

Problem 4b

Determine the 10-fold cross validation accuracy for RF models with $N_\mathrm{tree}$ = 1, 10, 30, 100, and 300.

How do you expect changing the number of trees to affect the results?


In [ ]:
for ntree in [1,10,30,100,300]:    
    # complete

    print('With {:d} trees the FoM = {:.4f} +/- {:.4f}'.format( # complete

write your answer here

Now you are ready for the moment of truth!

Problem 5) Model Predictions

Problem 5a

Calculate the FoM for the SDSS photometric model on the test set.


In [ ]:
phot_y = # complete
# complete
# complete
# complete

print("The baseline FoM = {:.4f}".format( # complete

Problem 5b

Using the optimal number of trees from 4b calculate the FoM for the random forest model.

Hint - remember that the model should be trained on the training set, but the predictions are for the test set.


In [ ]:
rf_clf = RandomForestClassifier( # complete
# complete
# complete

print("The RF model has FoM = {:.4f}".format( # complete

Problem 5c

Calculate the confusion matrix for the test set. Is there symmetry to the misclassifications?

Hint - the confusion_matrix function in sklearn.metrics will help.


In [ ]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix( # complete

write your answer here

Problem 5d

Calculate (and plot the region of interest) the ROC curve assumming that stars are the positive class.

Hint 1 - you will need to calculate probabilistic classifications for the test set using the predict_proba() method.

Hint 2 - the roc_curve function in the sklearn.metrics module will be useful.


In [ ]:
from sklearn.metrics import roc_curve

test_y_int = # complete
# complete

test_preds_proba = rf_clf.predict_proba( # complete

fpr, tpr, thresh = roc_curve( # complete

fig, ax = plt.subplots()
ax.plot( # complete
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')

Problem 5e

Suppose that (like me) you really care about supernovae. In this case you want a model that correctly classifies 99% of all stars, so that stellar flares do not fool you into thinking you have found a new supernova.

What classification threshold should be adopted for this model?

What fraction of galaxies does this model misclassify?


In [ ]:
tpr_99_thresh = # complete

print('This model requires a classification threshold of {:.4f}'.format(tpr_99_thresh))

fpr_at_tpr_99 = # complete

print('This model misclassifies {:.2f}% of galaxies'.format(fpr_at_tpr_99*100))

Problem 6) Classify New Data

Run the cell below to load in some new data (which in this case happens to have known labels, but in practice this will almost never be the case...)


In [ ]:
new_data_df = pd.read_hdf("blind_test_set.h5")

Problem 6a

Create a feature and label array for the new data.

Hint - copy the code you developed above in Problem 2.


In [ ]:
new_X = # complete
new_y = # complete

Problem 6b

Calculate the accuracy of the model predictions on the new data.


In [ ]:
new_preds = # complete
print("The model has an accuracy of {:.4f}".format( # complete

Problem 6c

Can you explain why the accuracy for the new data is significantly lower than what you calculated previously?

If you can build and train a better model (using the trianing data) for classifying the new data - I will be extremely impressed.

write your answer here

Challenge Problem) Full RF Optimization

Now we will optimize the model over all tuning parameters. How does one actually determine the optimal set of tuning parameters? Brute force.

We will optimize the model via a grid search that performs CV at each point in the 3D grid. The final model will adopt the point with the highest accuracy.

It is important to remember two general rules of thumb: (i) if the model is optimized at the edge of the grid, refit a new grid centered on that point, and (ii) the results should be stable in the vicinity of the grid maximum. If this is not the case the model is likely overfit.

Use GridSearchCV to perform a 3-fold CV grid search to optimize the RF star-galaxy model. Remember the rules of thumb.

What are the optimal tuning parameters for the model?

Hint 1 - think about the computational runtime based on the number of points in the grid. Do not start with a very dense or large grid.

Hint 2 - if the runtime is long, don't repeat the grid search even if the optimal model is on an edge of the grid


In [ ]:
from sklearn.model_selection import GridSearchCV

# complete

In [ ]:
print('The best model has {}'.format( # complete