The problems in the following notebook develop an end-to-end machine learning model using actual astronomical data to separate stars and galaxies. There are 5 steps in this machine learning workflow:
The data come from the Sloan Digital Sky Survey (SDSS), an imaging survey that has several similarities to LSST (though the telescope was significantly smaller and the survey did not cover as large an area).
Science background: Many (nearly all?) of the science applications for LSST data will rely on the accurate separation of stars and galaxies in the LSST imaging data. As an example, imagine measuring the structure of the Milky Way without knowing which sources are galaxies and which are stars.
During this exercise, we will utilize supervised machine learning methods to separate extended sources (galaxies) and point sources (stars) in imaging data. These methods are highly flexible, and as a result can classify sources at higher fidelity than methods that simply make cuts in a low-dimensional space.
In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
For this problem the training set, i.e. sources with known labels, includes stars and galaxies that have been confirmed with spectroscopic observations. The machine learning model is needed because there are $\gg 10^8$ sources with photometric observations in SDSS, and only $4 \times 10^6$ sources with spectroscopic observations. The model will allow us to translate our knowledge from the spectroscopic observations to the entire data set. The features include each $r$-band magnitude measurement made by SDSS (don't worry if you don't know what this means...). This yields 8 features to train the models (significantly fewer than the 454 properties measured for each source in SDSS).
If you are curious (and it is fine if you are not) this training set was constructed by running the following query on the SDSS database:
SELECT TOP 20000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
ORDER BY p.objid ASC
First download the training set and the blind test set for this problem.
Problem 1a
Visualize the training set data. The data have 8 features ['psfMag_r', 'fiberMag_r', 'fiber2Mag_r', 'petroMag_r', 'deVMag_r', 'expMag_r', 'modelMag_r', 'cModelMag_r'], and a 9th column ['class'] corresponding to the labels ('STAR' or 'GALAXY' in this case).
Hint - just execute the cell below.
In [ ]:
sdss_df = pd.read_hdf("sdss_training_set.h5")
sns.pairplot(sdss_df, hue = 'class', diag_kind = 'hist')
Problem 1b
Based on your plots of the data, which feature do you think will be the most important for separating stars and galaxies? Why?
write your answer here - do not change it after later completing the problem
The final data preparation step it to create an independent test set to evalute the generalization error of the final tuned model. Independent test sets are generated by witholding a fraction of the training set. No hard and fast rules apply for the fraction to be withheld, though typical choices vary between $\sim{0.2}-0.5$.
sklearn.model_selection
has a useful helper function train_test_split
.
Problem 1c Split the 20k spectroscopic sources 70-30 into training and test sets. Save the results in arrays called: train_X, train_y, test_X, test_y
, respectively. Use rs
for the random_state
in train_test_split
.
Hint - recall that sklearn
utilizes X, a 2D np.array()
, and y as the features and labels arrays, respecitively.
In [ ]:
from sklearn.model_selection import train_test_split
rs = 1851
# complete
X = # complete
y = # complete
train_X, test_X, train_y, test_y = # complete
We will now ignore everything in the test set until we have fully optimized the machine learning model.
After curating the data, you must select a specific machine learning algorithm. With experience, it is possible to develop intuition for the best ML algorithm given a specific problem.
Short of that? Try two (or three, or four, or five) different models and choose whichever works the best.
Problem 2a
Train a $k$-nearest neighbors model on the star-galaxy training set. Select $k$ = 25 for this model.
Hint - the KNeighborsClassifier
object in the sklearn.neighbors
module may be useful for this task.
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = # complete
# complete
Problem 2b
Train a Random Forest (RF) model (Breiman 2001) on the training set. Include 50 trees in the forest using the n_estimators
parameter. Again, set random_state
= rs.
Hint - use the RandomForestClassifier
object from the sklearn.ensemble
module. Also - be sure to set n_jobs = -1
in every call of RandomForestClassifier
.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
rf_clf = # complete
# complete
A nice property of RF, relative to $k$NN, is that RF naturally provides an estimate of the most important features in a model.
RF feature importance is measured by randomly shuffling the values of a particular feature, and measuring the decrease in the model's overall accuracy. The relative feature importances can be accessed using the .feature_importances_
attribute associated with the RandomForestClassifer()
object. The higher the value, the more important feature.
Problem 2c
Calculate the relative importance of each feature.
Which feature is most important? Does this match your answer from 1c?
In [ ]:
feat_str = ',\n'.join(['{}'.format(feat) for feat in np.array(feats)[np.argsort(rf_clf.feature_importances_)[::-1]]])
print('From most to least important: \n{}'.format(feat_str))
write your answer here
To evaluate the performance of the model we establish a baseline (or figure of merit) that we would like to exceed. For our current application we want to maximize the accuracy of the model.
If the model does not improve upon the baseline (or reach the desired figure of merit) then one must iterate on previous steps (feature engineering, algorithm selection, etc) to accomplish the desired goal.
The SDSS photometric pipeline uses a simple parametric model to classify sources as either stars or galaxies. If we are going to the trouble of building a complex ML model, then it stands to reason that its performance should exceed that of the simple model. Thus, we adopt the SDSS photometric classifier as our baseline.
The SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data:
$$\mathtt{psfMag_r} - \mathtt{cModelMag_r} > 0.145.$$Sources that satisfy this criteria are considered galaxies.
Problem 3a
Determine the baseline figure of merit by measuring the accuracy of the SDSS photometric classifier on the training set.
Hint - the accuracy_score
function in the sklearn.metrics
module may be useful.
In [ ]:
from sklearn.metrics import accuracy_score
phot_y = # complete
# complete
# complete
# complete
print("The baseline FoM = {:.4f}".format( # complete
Problem 3b
Use 10-fold cross validation to estimate the FoM for the $k$NN model. Take the mean value across all folds as the FoM estimate.
Hint - the cross_val_score
function from the sklearn.model_selection
module performs the necessary calculations.
In [ ]:
from sklearn.model_selection import cross_val_score
knn_cv = cross_val_score( # complete
print('The kNN model FoM = {:.4f} +/- {:.4f}'.format( # complete
Problem 3c
Use 10-fold cross validation to estimate the FoM for the random forest model.
In [ ]:
rf_cv = cross_val_score( # complete
print('The RF model FoM = {:.4f} +/- {:.4f}'.format( # complete
Problem 3d
Do the machine-learning models outperform the SDSS photometric classifier?
write your answer here
All machine-learning models have tuning parameters. In brief, these parameters capture the smoothness of the model in the multidimentional-feature space. Whether the model is smooth or coarse is application dependent -- be weary of over-fitting or under-fitting the data. Generally speaking, RF (and most tree-based methods) have 3 flavors of tuning parameter:
n_estimators
(default: 10) in sklearn
max_features
(default: sqrt(n_features)
) in sklearn
sklearn
(My preference is min_samples_leaf
(default: 1) which sets the minimum number of sources allowed in a terminal node, or leaf, of the tree)Just as we previously evaluated the model using CV, we must optimize the tuning paramters via CV. Until we "finalize" the model by fixing all the input parameters, we cannot evalute the accuracy of the model with the test set as that would be "snooping."
Before globally optimizing the model, let's develop some intuition for how the tuning parameters affect the final model predictions.
Problem 4a
Determine the 10-fold cross validation accuracy for $k$NN models with $k$ = 1, 10, 100.
How do you expect changing the number of neighbors to affect the results?
In [ ]:
for k in [1,10,100]:
# complete
print('With k = {:d}, the kNN FoM = {:.4f} +/- {:.4f}'.format( # complete
write your answer here
Problem 4b
Determine the 10-fold cross validation accuracy for RF models with $N_\mathrm{tree}$ = 1, 10, 30, 100, and 300.
How do you expect changing the number of trees to affect the results?
In [ ]:
for ntree in [1,10,30,100,300]:
# complete
print('With {:d} trees the FoM = {:.4f} +/- {:.4f}'.format( # complete
write your answer here
Now you are ready for the moment of truth!
Problem 5a
Calculate the FoM for the SDSS photometric model on the test set.
In [ ]:
phot_y = # complete
# complete
# complete
# complete
print("The baseline FoM = {:.4f}".format( # complete
Problem 5b
Using the optimal number of trees from 4b calculate the FoM for the random forest model.
Hint - remember that the model should be trained on the training set, but the predictions are for the test set.
In [ ]:
rf_clf = RandomForestClassifier( # complete
# complete
# complete
print("The RF model has FoM = {:.4f}".format( # complete
Problem 5c
Calculate the confusion matrix for the test set. Is there symmetry to the misclassifications?
Hint - the confusion_matrix
function in sklearn.metrics
will help.
In [ ]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix( # complete
write your answer here
Problem 5d
Calculate (and plot the region of interest) the ROC curve assumming that stars are the positive class.
Hint 1 - you will need to calculate probabilistic classifications for the test set using the predict_proba()
method.
Hint 2 - the roc_curve
function in the sklearn.metrics
module will be useful.
In [ ]:
from sklearn.metrics import roc_curve
test_y_int = # complete
# complete
test_preds_proba = rf_clf.predict_proba( # complete
fpr, tpr, thresh = roc_curve( # complete
fig, ax = plt.subplots()
ax.plot( # complete
ax.set_xlabel('FPR')
ax.set_ylabel('TPR')
Problem 5e
Suppose that (like me) you really care about supernovae. In this case you want a model that correctly classifies 99% of all stars, so that stellar flares do not fool you into thinking you have found a new supernova.
What classification threshold should be adopted for this model?
What fraction of galaxies does this model misclassify?
In [ ]:
tpr_99_thresh = # complete
print('This model requires a classification threshold of {:.4f}'.format(tpr_99_thresh))
fpr_at_tpr_99 = # complete
print('This model misclassifies {:.2f}% of galaxies'.format(fpr_at_tpr_99*100))
In [ ]:
new_data_df = pd.read_hdf("blind_test_set.h5")
Problem 6a
Create a feature and label array for the new data.
Hint - copy the code you developed above in Problem 2.
In [ ]:
new_X = # complete
new_y = # complete
Problem 6b
Calculate the accuracy of the model predictions on the new data.
In [ ]:
new_preds = # complete
print("The model has an accuracy of {:.4f}".format( # complete
Problem 6c
Can you explain why the accuracy for the new data is significantly lower than what you calculated previously?
If you can build and train a better model (using the trianing data) for classifying the new data - I will be extremely impressed.
write your answer here
Now we will optimize the model over all tuning parameters. How does one actually determine the optimal set of tuning parameters? Brute force.
We will optimize the model via a grid search that performs CV at each point in the 3D grid. The final model will adopt the point with the highest accuracy.
It is important to remember two general rules of thumb: (i) if the model is optimized at the edge of the grid, refit a new grid centered on that point, and (ii) the results should be stable in the vicinity of the grid maximum. If this is not the case the model is likely overfit.
Use GridSearchCV
to perform a 3-fold CV grid search to optimize the RF star-galaxy model. Remember the rules of thumb.
What are the optimal tuning parameters for the model?
Hint 1 - think about the computational runtime based on the number of points in the grid. Do not start with a very dense or large grid.
Hint 2 - if the runtime is long, don't repeat the grid search even if the optimal model is on an edge of the grid
In [ ]:
from sklearn.model_selection import GridSearchCV
# complete
In [ ]:
print('The best model has {}'.format( # complete