In [ ]:
from __future__ import division, print_function, absolute_import
We will now follow the steps from the machine learning workflow lecture to develop an end-to-end machine learning model using actual astronomical data. As a reminder the workflow is as follows:
Some of these steps will be streamlined to allow us to fully build a model within the alloted time.
Science background: Many (nearly all?) of the science applications for LSST data will rely on the accurate separation of stars and galaxies in the LSST imaging data. As an example, imagine measuring galaxy clustering without knowing which sources are galaxies and which are stars.
During this exercise, we will utilize supervised machine-learning methods to separate extended (galaxies) and point sources (stars, QSOs) in imaging data. These methods are highly flexible, and as a result can classify sources at higher fidelity than methods that simply make cuts in a low-dimensional space.
In [ ]:
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt
%matplotlib inline
The training set for this exercise uses Sloan Digital Sky Survey (SDSS) data. For features, we will start with each $r$-band magnitude measurement made by SDSS. This yields 8 features (twice that of the Iris data set, but significantly fewer than the 454 properties measured for each source in SDSS).
Step 1 in the ML workflow is data preparation - we must curate the training set. As a reminder:
A machine-learning model is only as good as its training set.
This point cannot be emphasized enough. Machine-learning models are data-driven, they do not capture any physical theory, and thus it is essential that the training set satisfy several criteria.
Two of the most important criteria for a good training set are:
So, step 1 (this is a must), we are going to examine the training set to see if anything suspicious is going on. We will use astroquery
to directly access the SDSS database, and store the results in an astropy
Table.
Note The SDSS API for astroquery
is not standard for the package, which leads to a warning. This is not, however, a problem for our purposes.
In [ ]:
from astroquery.sdss import SDSS # enables direct queries to the SDSS database
While it is possible to look up each of the names of the $r$-band magnitudes in the SDSS PhotoObjAll
schema, the schema list is long, and thus difficult to parse by eye. Fortunately, we can identify the desired columns using the database itself:
select COLUMN_NAME
from INFORMATION_SCHEMA.Columns
where table_name = 'PhotoObjAll' AND
COLUMN_NAME like '%Mag/_r' escape '/'
which returns the following list of columns: psfMag_r, fiberMag_r, fiber2Mag_r, petroMag_r, deVMag_r, expMag_r, modelMag_r, cModelMag_r
.
We now select these magnitude measurements for 10000 stars and galaxies from SDSS. Additionally, we join these photometric measurements with the SpecObjAll
table to obtain their spectroscopic classifications, which will serve as labels for the machine-learning model.
Note - the SDSS database contains duplicate observations, flagged observations, and non-detections, which we condition the query to exclude (as explained further below). We also exclude quasars, as the precise photometric classification of these objects is ambiguous: low-$z$ AGN have resolvable host galaxies, while high-$z$ QSOs are point-sources. Query conditions:
p.mode = 1
select only the primary photometric detection of a sources.sciencePrimary = 1
select only the primary spectroscopic detection of a source (together with above, prevents duplicates)p.clean = 1
the SDSS clean
flag excludes flagged observations and sources with non-detectionss.class != 'QSO'
removes potentially ambiguous QSOs from the training set
In [ ]:
sdss_query = """SELECT TOP 10000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
ORDER BY p.objid ASC
"""
sdss_set = SDSS.query_sql(sdss_query)
sdss_set
To reiterate a point from above: data-driven models are only as good as the training set. Now that we have a potential training set, it is essential to inspect the data for any peculiarities.
Problem 1a
Can you easily identify any important properties of the data from the above table?
If not - is there a better way to examine the data?
Hint - emphasis on easy.
Solution 1a
Write your answer here
Problem 1b
Visualize the 8 dimensional feature set [this is intentionally open-ended...]
Does this visualization reveal anything that is not obvious from the table?
Can you identify any biases in the training set?
Remember - always worry about the data
Hint astropy Tables
can be converted to pandas DataFrames
with the .to_pandas()
operator.
In [ ]:
# complete
Solution 1b
Write your answer here
Finally, to finish off our preparation of the data - we need to create an independent test that will be used to evalute the accuracy/generalization properies of the model after everything has been tuned. Often, independent test sets are generated by witholding a fraction of the training set. No hard and fast rules apply for the fraction to be withheld, though typical choices vary between $\sim{0.2}-0.5$. For this problem we will adopt 0.3.
sklearn.model_selection
has a handy function train_test_split
, which will simplify this process.
Problem 1c Split the 10k spectroscopic sources 70-30 into training and test sets. Save the results in arrays called: train_X, train_y, test_X, test_y
, respectively. Use rs
for the random_state
in train_test_split
.
Hint - recall that sklearn
utilizes X, a 2D np.array()
, and y as the features and labels arrays, respecitively.
In [ ]:
from sklearn.model_selection import train_test_split
rs = 2 # we are in second biggest metropolitan area in the US
# complete
X = np.array( # complete
y = np.array( # complete
train_X, test_X, train_y, test_y = train_test_split( X, y, # complete
It has been said that all machine learning is an exercise in feature engineering.
Feature engineering - the process of creating new features, combining features, removing features, collecting new data to supplement existing features, etc. is essential in the machine learning workflow. As part of the data preparation stage, it is useful to apply domain knowledge to engineer features prior to model construction. [Though it is important to know that feature engineering may be needed at any point in the ML workflow if the model does not provide desired results.]
Due to a peculiarity of our SDSS training set, we need to briefly craft a separate problem to demonstrate the importance of feature engineering.
For this aside, we will train the model on bright ($r' < 18.5$ mag) sources and test the model on faint ($r' > 19.5$ mag) sources. As you might guess the model will not perform well. Following some clever feature engineering, we will be able to improve this.
aside-to-the-aside
This exact situation happens in astronomy all the time, and it is known as sample selection bias. In brief, any time a larger aperture telescope is built, or instrumentation is greatly improved, a large swath of sources that were previously undetectable can now be observed. These fainter sources, however, may contain entirely different populations than their brighter counterparts, and thus any models trained on the bright sources will be biased when making predictions on the faint sources.
We train and test the model with 10000 sources using an identical query to the one employed above, with the added condition restricting the training set to bright sources and the test set to faint sources.
In [ ]:
bright_query = """SELECT TOP 10000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
AND p.cModelMag_r < 18.5
ORDER BY p.objid ASC
"""
bright_set = SDSS.query_sql(bright_query)
bright_set
In [ ]:
faint_query = """SELECT TOP 10000
p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r,
p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r,
s.class
FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
AND p.cModelMag_r > 19.5
ORDER BY p.objid ASC
"""
faint_set = SDSS.query_sql(faint_query)
faint_set
Problem 2a
Train a $k$ Nearest Neighbors model with $k = 11$ neighbors on the 10k source training set. Note - for this particular problem, the number of neighbors does not matter much.
In [ ]:
from sklearn.neighbors import KNeighborsClassifier
feats = # complete
bright_X = # complete
bright_y = # complete
KNNclf = # complete
Problem 2b
Evaluate the accuracy of the model when applied to the sources in the faint test set.
Does the model perform well?
Hint - you may find sklearn.metrics.accuracy_score
useful for this exercise.
In [ ]:
from sklearn.metrics import accuracy_score
faint_X = # complete
faint_y = # complete
faint_preds = # complete
print("The raw features produce a KNN model with accuracy ~{:.4f}".format( # complete
Solution 2b
Write your answer here
Leveraging the same domain knowledge discussed above, namely that galaxies cannot be modeled with a PSF, we can "normalize" the magnitude measurements by taking their difference relative to psfMag_r
. This normalization has the added advantage of removing any knowledge of the apparent brightness of the sources, which should help when comparing independent bright and faint sets.
Problem 2c
Normalize the feature vector relative to psfMag_r
, and refit the $k$NN model to the 7 newly engineered features.
Does the accuracy improve when predicting the class of sources in the faint test set?
Hint - be sure you apply the eaxct same normalization to both the training and test set
In [ ]:
bright_Xnorm = # complete
KNNclf = # complete
faint_predsNorm = # complete
print("The normalized features produce an accuracy ~{:.4f}".format( # complete
Solution 2c
Wow! Normalizing the features produces a huge ($\sim{35}\%$) increase in accuracy. Clearly, we should be using normalized magnitude features moving forward.
In addition to demonstrating the importance of feature engineering, this exercise teaches another important lesson: contextual features can be dangerous.
Contextual astronomical features can provide very strong priors: stars are more likely close to the galactic plane, supernovae occur next to/on top of galaxies, bluer stars have have lower metallicity, etc. Thus, including contextual information may improve overall model performance.
However, all astronomical training sets are heavily biased. Thus, the strong priors associated with contextual features can lead to severely biased model predictions.
Generally, I (AAM) remove all contextual features from my ML models for this reason. If you are building ML models, consider contextual information as it may help overall performance, but... be weary.
After the data have been properly curated, the next important choice in the ML workflow is the selection of ML algorithm. With experience, it is possible to develop intuition for the best ML algorithm given a specific problem.
Short of that? Try three (or four, or five) different models and choose whichever works the best.
For the star-galaxy problem, we will use the Random Forest (RF) algorithm (Breiman 2001) as implemented by scikit-learn
.
RandomForestClassifier
is part of the sklearn.ensemble
module.
RF has a number of nice properties for working with astronomical data:
which is why we will adopt it here.
Problem 3a
Build a RF model using the normalized features from the training set.
Include 25 trees in the forest using the n_estimators
paramater in RandomForestClassifier
.
In [ ]:
import # complete
rs = 626 # aread code for Pasadena
train_Xnorm = # complete
RFclf = # complete
scikit-learn
really makes it easy to build ML models.
Another nice property of RF is that it naturally provides an estimate of the most important features in the model.
[Once again - feature engineering comes into play, as it may be necessary to remove correlated features or unimportant features during the model construction in order to reduce run time or allow the model to fit in the available memory.]
In this case we don't need to remove any features [RF is relatively immune to correlated or unimportant features], but for completeness we measure the importance of each feature in the model.
RF feature importance is measured by randomly shuffling the values of a particular feature, and measuring the decrease in the model's overall accuracy. The relative feature importances can be accessed using the .feature_importances_
attribute associated with the RandomForestClassifer()
class. The higher the value, the more important feature.
Problem 3b
Calculate the relative importance of each feature.
Which feature is most important? Can you make sense of the feature ordering?
Hint - do not dwell too long on the final ordering of the features.
In [ ]:
# complete
print("The relative importance of the features is: \n{:s}".format( # complete
Solution 3b
psfMag_r - deVMag_r
is the most important feature. This makes sense based on the separation of stars and galaxies in the psfMag_r
-deVMag_r
plane (see the visualization results above).
Note - the precise ordering of the features can change due to their strong correlation with each other, though the fiberMag
features are always the least important.
To evaluate the performance of the model we establish a baseline (or figure of merit) that we would like to exceed. This in essence is the essential "engineering" step of machine learning [and why I (AAM) often caution against ML for scientific measurements and advocate for engineering-like problems instead].
If the model does not improve upon the baseline (or reach the desired figure of merit) then one must iterate on previous steps (feature engineering, algorithm selection, etc) to accomplish the desired goal.
The SDSS photometric pipeline uses a simple parametric model to classify sources as either stars or galaxies. If we are going to the trouble of building a complex ML model, then it stands to reason that its performance should exceed that of the simple model. Thus, we adopt the SDSS photometric classifier as our baseline.
Tthe SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data:
$$\mathtt{psfMag} - \mathtt{cmodelMag} > 0.145.$$Sources that satisfy this criteria are considered galaxies.
Problem 4a
Determine the baseline for the ML model by measuring the accuracy of the SDSS photometric classifier on the training set.
Hint - you may need to play around with array values to get accuracy_score
to work.
In [ ]:
# complete
print("The SDSS phot model produces an accuracy ~{:.4f}".format( # complete
The simple SDSS model sets a high standard! A $\sim{96}\%$ accuracy following a single hard cut is a phenomenal performance.
Problem 4b Using 10-fold cross validation, estimate the accuracy of the RF model.
In [ ]:
from sklearn.model_selection import # complete
RFpreds = # complete
print("The CV accuracy for the training set is {:.4f}".format( # complete
Phew! Our hard work to build a machine learning model has been rewarded, by creating an improved model: $\sim{96.9}\%$ accuracy vs. $\sim{96.4}\%$.
[But - was our effort worth only a $0.5\%$ improvement in the model?]
All machine-learning models have tuning parameters. In brief, these parameters capture the smoothness of the model in the multidimentional-feature space. Whether the model is smooth or coarse is application dependent -- be weary of over-fitting or under-fitting the data. Generally speaking, RF (and most tree-based methods) have 3 flavors of tuning parameter:
n_estimators
(default: 10) in sklearn
max_features
(default: sqrt(n_features)
) in sklearn
sklearn
(My preference is min_samples_leaf
(default: 1) which sets the minimum number of sources allowed in a terminal node, or leaf, of the tree)Just as we previously evaluated the model using CV, we must optimize the tuning paramters via CV. Until we "finalize" the model by fixing all the input parameters, we cannot evalute the accuracy of the model with the test set as that would be "snooping."
On Tuesday we were introduced to GridSearchCV
, which is an excellent tool for optimizing model parameters.
Before we get to that, let's try to develop some intuition for how the tuning parameters affect the final model predictions.
Problem 5a
Determine the 5-fold CV accuracy for models with $N_\mathrm{tree}$ = 1, 10, 100.
How do you expect changing the number of trees to affect the results?
In [ ]:
rs = 1936 # year JPL was founded
CVpreds1 = # complete
# complete
# complete
print("The CV accuracy for 1, 10, 100 trees is {:.4f}, {:.4f}, {:.4f}".format( # complete
While (in this case) the affect is small, it is clear that $N_\mathrm{tree}$ affects the model output.
Now we will optimize the model over all tuning parameters. How does one actually determine the optimal set of tuning parameters?
Brute force.
This data set and the number of tuning parameters is small, so brute force is appropriate (alternatives exist when this isn't the case). We can optimize the model via a grid search that performs CV at each point in the 3D grid. The final model will adopt the point with the highest accuracy.
It is important to remember two general rules of thumb: (i) if the model is optimized at the edge of the grid, refit a new grid centered on that point, and (ii) the results should be stable in the vicinity of the grid maximum. If this is not the case the model is likely overfit.
Problem 5b
Use GridSearchCV
to perform a 3-fold CV grid search to optimize the RF star-galaxy model. Remember the rules of thumb.
What are the optimal tuning parameters for the model?
Hint 1 - think about the computational runtime based on the number of points in the grid. Do not start with a very dense or large grid.
Hint 2 - if the runtime is long, don't repeat the grid search even if the optimal model is on an edge of the grid
In [ ]:
rs = 64 # average temperature in Los Angeles
from sklearn.model_selection import GridSearchCV
grid_results = # complete
print("The optimal parameters are:")
for key, item in grid_results.best_params_.items(): # warning - slightly different meanings in Py2 & Py3
print("{}: {}".format(key, item))
Now that the model is fully optimized - we are ready for the moment of truth!
Problem 5c
Using the optimized model parameters, train a RF model and estimate the model's generalization error using the test set.
How does this compare to the baseline model?
In [ ]:
RFopt_clf = # complete
test_preds = # complete
print('The optimized model produces a generalization error of {:.4f}'.format( # complete
Solution 5c
Write your answer here
We will now examine the performance of the model using some alternative metrics.
Note - if these metrics are essential for judging the model performance, then they should be incorporated to the workflow in the evaluation stage, prior to examination of the test set.
Problem 5d
Calculate the confusion matrix for the model, as determined by the test set.
Is there symmetry to the misclassifications?
In [ ]:
from sklearn.metrics import # complete
# complete
Solution 5d
Write your answer here
Problem 5e
Calculate and plot the ROC curves for both stars and galaxies.
Hint - you'll need probabilities in order to calculate the ROC curve.
In [ ]:
from sklearn.metrics import roc_curve
test_preds_proba = # complete
# complete
fpr, tpr, thresholds = roc_curve( # complete
plt.plot( # complete
plt.legend()
In [ ]:
from sklearn.metrics import roc_curve, roc_auc_score
test_preds_proba = RFopt_clf.predict_proba(test_Xnorm)
test_y_stars = np.zeros(len(test_y), dtype = int)
test_y_stars[np.where(test_y == "STAR")] = 1
test_y_galaxies = test_y_stars*-1. + 1
fpr, tpr, thresholds = roc_curve(test_y_stars, test_preds_proba[:,1])
plt.plot(fpr, tpr, label = r'$\mathrm{STAR}$', color = "MediumAquaMarine")
fpr, tpr, thresholds = roc_curve(test_y_galaxies, test_preds_proba[:,0])
plt.plot(fpr, tpr, label = r'$\mathrm{GALAXY}$', color = "Tomato")
plt.legend()
Problem 5f
Suppose you want a model that only misclassifies 1% of stars as galaxies.
What classification threshold should be adopted for this model?
What fraction of galaxies does this model miss?
Can you think of a reason to adopt such a threshold?
In [ ]:
# complete
Solution 5f
When building galaxy 2-point correlation functions it is very important to avoid including stars in the statistics as they will bias the final measurement.
Challenge 1
Calculate the accuracy with which the model classifies QSOs based on the 10k QSOs selected with the above command. How does that accuracy compare to that estimated by the test set?
In [ ]:
# complete
Challenge 2
Can you think of any reasons why the performance would be so much worse for the QSOs than it is for the stars?
Can you obtain a ~.97 accuracy when classifying QSOs?
In [ ]:
# complete
Challenge 3
Perform an actual test of the model using "field" sources. The SDSS photometric classifier is nearly perfect for sources brighter than $r = 21$ mag. Download a random sample of $r < 21$ mag photometric sources, and classify them using the optimized RF model. Adopting the photometric classifications as ground truth, what is the accuracy of the RF model?
Hint - you'll need to look up the parameter describing photometric classification in SDSS
In [ ]:
# complete