In [1]:
from __future__ import division, print_function, absolute_import, unicode_literals

Supervised Machine Learning Break Out:

Separating Stars and Galaxies from SDSS

Version 0.1

Many (nearly all?) of the science applications for LSST data will rely on the accurate separation of stars and galaxies in the LSST imaging data. As an example, imagine measuring galaxy clustering without knowing which sources are galaxies and which are stars. During this exercise, we will utilize supervised machine-learning methods to separate extended (galaxies) and point sources (stars, QSOs) in imaging data. These methods are highly flexible, and as a result can classify sources at higher fidelity than methods that simply make cuts in a low-dimensional space.


By AA Miller (c) 2016 Jul 11


In [2]:
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt
%matplotlib inline

Problem 1) Obtain and Examine Training Data

As a reminder, for supervised-learning problems we use a training set, sources with known labels, i.e. they have been confirmed as normal stars, QSOs, or galaxies, to build a model to classify new observations where we do not know the source label.

The training set for this exercise will be pulled from SDSS. For features, we will start with each Magnitude measured in the $r$-band. This results in a total of 8 features (more than the Iris data set, but significantly fewer than everything in SDSS).

The first step when pursuring a machine learning problem is to examine the potential training set. A machine-learning model is only as good as its training set. This point cannot be emphasized enough. Machine-learning models are data-driven, they do not capture any physical theory, and thus it is essential that the training set satisfy several criteria. Two of the most important criteria for a good training set are:

  • the training set should be unbiased [this is actually really hard to achieve in astronomy since most surveys are magnitude limited]
  • the training set should be representative of the (unobserved or field) population of sources [a training set with no stars will yield a model incapable of discovering point sources]

As a first step (and this is always a good idea), we are going to examine the training set to see if anything suspicious is going on. We will use astroquery to directly access the SDSS database, and store the results in an astropy Table.

Note The SDSS API for astroquery is not standard for the package, which leads to a warning. This is not, however, a problem for our purposes.


In [3]:
# execute this cell
from astroquery.sdss import SDSS  # enables direct queries to the SDSS database


/Users/adamamiller/anaconda/lib/python2.7/site-packages/astroquery/sdss/__init__.py:28: UserWarning: Experimental: SDSS has not yet been refactored to have its API match the rest of astroquery (but it's nearly there).
  warnings.warn("Experimental: SDSS has not yet been refactored to have its API "

While it is possible to look up each of the names of the $r$-band magnitudes in the SDSS PhotoObjAll schema, the schema list is long, and thus there is a chance of missing one column. Better to identify the desired columns using the database itself:

SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.Columns
WHERE table_name = 'PhotoObjAll' AND 
COLUMN_NAME LIKE '%Mag/_r' ESCAPE '/'

which returns the following list of columns: psfMag_r, fiberMag_r, fiber2Mag_r, petroMag_r, deVMag_r, expMag_r, modelMag_r, cModelMag_r.

We now select these magnitude measurements for 10000 stars and galaxies from SDSS. Additionally, we join these results with the SpecObjAll table to obtain their spectroscopic classifications, which will serve as labels for the machine-learning model.

Note - the SDSS database contains duplicate observations, flagged observations, and non-detections, which we condition the query to exclude (as explained further below). We also exclude quasars, as the precise photometric classification of these objects is ambiguous: low-$z$ AGN have resolvable host galaxies, while high-$z$ QSOs are point-sources. Query conditions:

  • p.mode = 1 select only the primary photometric detection of a source
  • s.sciencePrimary = 1 select only the primary spectroscopic detection of a source (together with above, prevents duplicates)
  • p.clean = 1 the SDSS clean flag excludes flagged observations and sources with non-detections
  • s.class != 'QSO' removes potentially ambiguous QSOs from the training set

In [4]:
TSquery = """SELECT TOP 10000 
             p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r, 
             p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r, 
             s.class
             FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
             WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class != 'QSO'
             ORDER BY p.objid ASC
               """
SDSSts = SDSS.query_sql(TSquery)
SDSSts


Out[4]:
<Table length=10000>
psfMag_rfiberMag_rfiber2Mag_rpetroMag_rdeVMag_rexpMag_rmodelMag_rcModelMag_rclass
float64float64float64float64float64float64float64float64str6
18.5031918.6527519.3350917.5453917.3152617.5813217.5813217.48715GALAXY
19.0265919.3244119.8089219.0582719.0346819.0311119.0311119.03111STAR
19.880919.7789520.4662319.353419.186419.3549319.3549319.24559GALAXY
22.0356322.0614122.6841621.5179521.0355421.3175121.3175121.31751GALAXY
21.5672621.5731222.2217820.458319.9330920.3982520.3981920.20402GALAXY
18.6681318.7530919.3679217.8337217.6946817.8991417.694717.75269GALAXY
20.1906820.3394720.9322619.7766619.6345819.7487319.7487219.74873GALAXY
19.4161919.3834820.1548617.6768717.1681517.6396217.6396217.63962GALAXY
18.8887818.9117919.5741517.331717.0289617.4353917.0289817.06256GALAXY
17.7389217.878418.460116.8138416.8053817.0754216.8053816.80538GALAXY
...........................
19.2474919.5077620.1047418.6699618.6114218.7882518.6114218.61683GALAXY
18.8584719.070419.7391517.8550617.5848417.8261817.8261817.73945GALAXY
17.8811418.1602118.7331317.6917717.6486817.6440917.6440917.64409STAR
18.3622718.6372819.2284217.7615417.7211517.9165917.7211517.72115GALAXY
19.3992819.5648320.2949117.6684317.0891517.5557517.5557517.55575GALAXY
18.0584718.2100418.8964716.8545316.6230416.8430216.8430216.76329GALAXY
22.2362722.5265823.0660921.3020721.2597221.4284621.2597421.26578GALAXY
19.6655819.9733520.5365919.2735719.2206419.3635219.2206419.22064GALAXY
20.7057221.1013521.5768520.6931720.6573120.6622320.6573320.65731STAR
21.8370422.1396622.709921.3783421.2807121.4402521.4404521.40327GALAXY

To reiterate a point from above: data-driven models are only as good as the training set. Now that we have a potential training set, it is essential to inspect the data for any peculiarities.

Problem 1a Visualize the training set (an intentially open-ended question, think back to yesterday's exercises) to inspect the features and whether or not they may be useful for a machine-learning model.

Hint astropy Tables can be converted to pandas DataFrames with the .to_pandas() operator, which may be helpful.


In [190]:
import seaborn as sns
sns.pairplot(SDSSts.to_pandas(), hue = 'class', diag_kind = 'kde')


Out[190]:
<seaborn.axisgrid.PairGrid at 0x11b434750>

Problem 1b Can the acquired training set be used to separate stars and galaxies? [Write your reasoning below]

Type response to 1b here

To test the efficacy of the machine-learning model we need to separate the spectroscopic sample into an independent training and validation set. There is no set number for the precise fraction of the data to include in the validation set, and typical choices vary between $\sim{0.2}-0.4$. For this problem we will adopt 0.3.

sklearn.cross_validation has a handy function train_test_split, which will simplify this process.

Problem 1c Split the 10k spectroscopic sources 70-30 into training and validation sets. Save the results in arrays called: train_X, train_y, val_X, val_y, respectively. Use rs for the random_state when selecting a random portion of the spectroscopic set.

Hint - recall that sklearn utilizes X, a 2D np.array(), and y as the features and labels arrays, respecitively.


In [29]:
from sklearn.cross_validation import train_test_split
rs = 23  # we are in Chicago after all

feats = list(SDSSts.columns)
feats.remove('class')

X = np.array(SDSSts[feats].to_pandas())
y = np.array(SDSSts['class'])

train_X, val_X, train_y, val_y = train_test_split( X, y, test_size = 0.3, random_state = rs)

Problem 2) Feature Engineering

It has been said that all machine learning is an exercise in feature engineering. In other words, all machine-learning are worthless without the correct set of features. Here we will quickly examine the set of features selected above to test how well they separate stars and galaxies. We will use the Random Forest (RF) algorithm Breiman 2001 as implemented by sklearn. RandomForestClassifier is part of the sklearn.ensemble module.

Problem 2a Train a RF model on your training set, and determine the accuracy of the model using the validation set. Set n_estimators, the number of trees in the forest, to 25 in the model.

Hint - you may find sklearn.metrics.accuracy_score useful for this exercise.


In [75]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

RFmod = RandomForestClassifier(n_estimators = 50)
RFmod.fit(train_X, train_y)
val_preds = RFmod.predict(val_X)

print("The raw features produce a model with accuracy ~{:.4f}".format(accuracy_score(val_y, val_preds)))


The raw features produce a model with accuracy ~0.9703

Not bad for a model with no optimization!

For reference, the SDSS photometric classifier uses a single hard cut to separate stars and galaxies in imaging data:

$$\mathtt{psfMag} - \mathtt{cmodelMag} > 0.145,$$

sources that satisfy this criteria are considered galaxies.

Problem 2b Determine the accuracy of the SDSS photometric model on the 10k spectroscopic sources obtained above. This accuracy will represent the "benchmark" upon which we hope to improve with the machine learning model.


In [73]:
SDSSpred = SDSSts['psfMag_r'] - SDSSts['cModelMag_r'] > 0.145
SDSSphot_class = np.empty(len(SDSSpred), dtype = '|S6')
SDSSphot_class[SDSSpred] = 'GALAXY'
SDSSphot_class[SDSSpred == False] = 'STAR'


print("The SDSS phot model produces an accuracy ~{:.4f}".format(accuracy_score(SDSSts['class'], SDSSphot_class)))


The SDSS phot model produces an accuracy ~0.9634

One advantage of the SDSS model is that the magnitude measurements have been normalized. The SDSS spectroscopic training set is biased towards stars at bright the end, and towards galaxies at the faint end. (This can be quickly confirmed by plotting histograms, or better yet KDEs, of each.) While the universe is biased in this way, it is not clear that the SDSS targeting function properly captured this bias. Thus, we will attempt to engineer the features to remove this bias (similar to SDSS).

Problem 2c Normalize the feature vector relative to psfMag_r, and refit the RF model to see if the accuracy improves when predicting the class of sources in the validation set.

Hint - be sure you apply the eaxct same normalization to both the training and validation set


In [77]:
train_Xnorm = train_X[:,0][:,np.newaxis] - train_X[:,1:]
val_Xnorm = val_X[:,0][:,np.newaxis] - val_X[:,1:]

RFmod = RandomForestClassifier(n_estimators = 50)
RFmod.fit(train_Xnorm, train_y)
val_predsNorm = RFmod.predict(val_Xnorm)

print("The normalized features produce a model with accuracy ~{:.4f}".format(accuracy_score(val_y, val_predsNorm)))


The normalized features produce a model with accuracy ~0.9713

Problem 2d How does the accuracy of each of these 3 models compare?

If normalizing the features did not improve the accuracy of the model, can you think of any reasons why?

Hint - think about the name of the algorithm.

Type response to 2d here

While there are additional feature engineering steps one could take:

  • remove correlated features
  • add contextual features (e.g., galactic latitude? distance to nearest neighbor?)
  • create new features (ellipticity measurements?)

we will stop at this stage. Generally speaking, RF is immune to correlated features (this is not true of all machine-learning algorithms), contextual features, especially galactic latitude, can bias the model due to the manner in which the training data were obtained.

Finally, we exploit one of the most novel aspects of RF - the ability to measure the relative importance of each feature. This is accomplished by randomly shuffling the values of a particular feature, and examining the decrease in performance of the model's overall accuracy. The relative feature importances can be accessed using the .feature_importances_ attribute associated with the RandomForestClassifer() class. The higher the value, the more important feature.

Problem 2e Calculate the relative importance of each feature. Which feature is most important? Can you make sense of the feature ordering?

Hint - do not dwell too long on the final ordering of the features.


In [107]:
print(RFmod.feature_importances_)   # print the importances

indicies = np.argsort(RFmod.feature_importances_)[::-1]   # sort the features most imp. --> least imp.

# recall that all features are normalized relative to psfMag_r
featStr = ", \n".join(['psfMag_r - {:s}'.format(x) for x in list(np.array(feats)[1:][indicies])])

print("The relative importance of the features is: \n{:s}".format(featStr))


[ 0.07642965  0.0167598   0.08676947  0.26527494  0.17775416  0.16487131
  0.21214067]
The relative importance of the features is: 
psfMag_r - deVMag_r, 
psfMag_r - cModelMag_r, 
psfMag_r - expMag_r, 
psfMag_r - modelMag_r, 
psfMag_r - petroMag_r, 
psfMag_r - fiberMag_r, 
psfMag_r - fiber2Mag_r

In some cases, the final step in feature engineering is identifying the useful features and removing those that are noisy and/or provide no information. Again, RF is relatively immune to these problems, and in this case (7 features) there are few features. Thus, we do not need to remove any features. Nevertheless, we see that the psfMag_r -fiber2Mag_r feature has very little importance. This is due to the high correlation between fiberMag_r and fiber2Mag_r. It is likely we could remove one of these features without harming the model. We could even measure the model improvement (if any) via cross-validation or the validation set.

Problem 3) Optimize Model Tuning Parameters

All machine-learning models have tuning parameters. In brief, these parameters capture the smoothness of the model in the multidimentional-feature space. Whether the model is smooth or coarse is an application dependent problem, though one should always be weary of over-fitting or under-fitting the data. Generally speaking, RF (and most tree-based methods) have 3 flavors of tuning parameter:

  1. $N_\mathrm{tree}$ - the number of trees in the forest n_estimators (default: 10) in sklearn
  2. $m_\mathrm{try}$ - the number of (random) features to explore as splitting criteria at each node max_features (default: sqrt(n_features)) in sklearn
  3. Pruning criteria - defined stopping criteria for ending continued growth of the tree, there are many choices for this in sklearn (My preference is min_samples_leaf (default: 1) which sets the minimum number of sources allowed in a terminal node, or leaf, of the tree)

An important lesson at this stage: to avoid overfitting the validation set cannot be used to optimize the tuning parameters. This leaves two potential options: further split the training set into a training and test set or via cross validation (CV). As a reminder, in CV, the training set is partitioned, and sources outside each individual partition are used to predict the class of sources within the partition. This procedure is then repeated for each partition, so that every source in the training set has exactly one prediction, allowing a measurement of the cross-validation error. The most common form of CV is known as K-fold, where the training set is divided into $k$ partitions of equal size. In order of increasing computational time, typical choices for $k$ are 2, 3, 10, or $N$, where $N$ is the total number of sources in the training set, this is known as leave-one-out CV.

Fortunately, scikit-learn has a cross validation module that simplifies CV. The CV accuracy can be obtained via cross_validation.cross_val_score(), which requires the model, training features, and training classes as arguments. The default is $k = 3$ folds. Finally, note that this method uses Stratified K-folds, which is slightly different from the procedure described above.

Problem 3a Determine the 5-fold CV accuracy for the training set.


In [124]:
from sklearn.cross_validation import cross_val_score

CVscores = cross_val_score(RFmod, train_Xnorm, train_y, cv = 5)

print(CVscores)
print("The CV accuracy for the training set is {:.4f}".format(np.mean(CVscores)))


[ 0.96288365  0.97073519  0.96714286  0.96926376  0.97069335]
The CV accuracy for the training set is 0.9681

Now, let's try to develop some intuition for the effects of changing the tuning parameters.

Problem 3b Determine the 5-fold CV accuracy for models with $N_\mathrm{tree}$ = 1, 10, 100.


In [125]:
CVscores1 = cross_val_score(RandomForestClassifier(n_estimators = 1), train_Xnorm, train_y, cv = 5)

CVscores10 = cross_val_score(RandomForestClassifier(n_estimators = 10), train_Xnorm, train_y, cv = 5)

CVscores100 = cross_val_score(RandomForestClassifier(n_estimators = 100), train_Xnorm, train_y, cv = 5)

print("The CV accuracy for 1, 10, 100 trees is {:.4f}, {:.4f}, {:.4f}".format(np.mean(CVscores1), np.mean(CVscores10), np.mean(CVscores100)))


The CV accuracy for 1, 10, 100 trees is 0.9451, 0.9671, 0.9683

From the above results, we can see that 1 tree is likely too few, while the results seem to be stabilizing with $N_\mathrm{tree} \gtrsim 10$. So - how does one actually determine the optimal set of tuning parameters? Brute force. At least in our case we will use brute force, as the data set and number of tuning parameters is small (there are alternative approaches that are less time consuming when this isn't the case). We will perform a grid search over the three tuning parameters, measuring the CV accuracy at each point within the 3D grid, and picking the point with the highest accuracy. Two general rules of thumb: (i) if the highest accuracy occurs at the edge of the grid, it is best to refit a new grid centered on that point, and (ii) the results should be stable in the vicinity of the grid maximum. If this is not the case, then it is likely that the model has been overfit.

Once again, scikit-learn has made our lives significantly easier by providing a convenience function GridSearchCV, which will enable a single line call to optimize the model. One thing to keep in mind, which likely won't be a problem here, depending on how GridSearchCV is called it is possible to run into memory issues.

Problem 3c Perform a 3-fold CV grid search to optimize the RF star-galaxy model. Remember the rules of thumb. What are the optimal tuning parameters for the model?

Hint - think about the computational runtime based on the number of points in the grid. Do not start with a very dense or large grid.


In [137]:
from sklearn.grid_search import GridSearchCV

grid_results = GridSearchCV(RandomForestClassifier(), 
                            {'n_estimators': [30, 100, 300], 'max_features': [1, 3, 7], 'min_samples_leaf': [1,10]},
                            cv = 5)
grid_results.fit(train_Xnorm, train_y)


The optimal parameters are:
max_features: 3
n_estimators: 300
min_samples_leaf: 10

In [138]:
print("The optimal parameters are:")
for key, item in grid_results.best_params_.items(): # warning - slightly different meanings in Py2 & Py3
    print("{}: {}".format(key, item))


The optimal parameters are:
max_features: 3
n_estimators: 300
min_samples_leaf: 10

Problem 4) Evaluating the Accuracy of the Model

Now that the model has been fully optimized, we can estimate its accuracy with the validation set. This estimate provides insight for the model accuracy expected for photometric data where we do not have spectroscopy, and thus cannot validate the predictions.

Problem 4a Train a RF model using the optimized tuning parameters from 3c, and estimate the classification accuracy of the model using the validation set.


In [160]:
RFmod = RandomForestClassifier(n_estimators=300, max_features=3, min_samples_leaf=10, n_jobs=-1).fit(train_Xnorm, train_y)

val_predsNorm = RFmod.predict(val_Xnorm)

print("The accuracy of the optimized model is ~{:.4f}".format(accuracy_score(val_y, val_predsNorm)))


The accuracy of the optimized model is ~0.9743

A standard method for evaluating classification models is the confusion matrix, which summarizes both the accuracy for each individual class, as well the most-likely misclassifications for each class. [For the 2 class problem, such as ours, these quantities are essentially one and the same, but for multi-class problems this is highly helpful.] In examining the matrix, "confusion" for the classifier is summarized. In a confusion matrix, one axis shows the true class and the other shows the predicted class. For a perfect classifier all of the power will be along the diagonal, while confusion is represented by off-diagonal signal.

Like almost everything else we have encountered during this exercise, scikit-learn makes it easy to compute a confusion_matrix within the sklearn.metrics module.

Problem 4b Calculate (and examine) the confusion matrix for the training set from Part 1.


In [143]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(val_y, val_predsNorm)
print(cm)


[[1346   64]
 [  23 1567]]

Visual representations are often the most useful means for interpreting the confusion matrix (this is especially true for multiclass problems).

Problem 4c Plot the normalized confusion matrix for the validation set.

Hint - you might find the sklearn confusion matrix tutorial helpful.


In [159]:
cmNorm = cm.astype(float)/cm.sum(axis = 1)[:,np.newaxis]

plt.imshow(cmNorm, interpolation = 'Nearest', cmap = 'Blues', vmin = 0, vmax = 1)
plt.colorbar()
tick_names = ["GALAXY", "STAR"]
tick_marks = np.arange(len(tick_names))
plt.xticks(tick_marks, tick_names, rotation=45)
plt.yticks(tick_marks, tick_names)
plt.ylabel('True label')
plt.xlabel('Predicted label')


Out[159]:
<matplotlib.text.Text at 0x11b7c8810>

Note - if your plot looks a little funny, that is likely because you imported seaborne earlier. Remember, import seaborne also changes defaults, but you can change the default appearance of seaborne.

For this problem, where the accuracy is very high and there are only two classes, the confusion matrix is not necessarily that insightful. Another common machine-learning evalutation tool is the Receiver Operating Characteristic (ROC) curve, which is espeically useful for threshold tuning.

One of the useful properties of RF models, which has not been mentioned yet, is that RF provides probabilistic predictions. In short, a source that is classified as a star in 95% of trees is much more likely to be a star than a source classified as such in only 53% of trees. [It is better to think of these probabilities as relative rankings based on the training set rather than true probabilities. Biases in the training set, which will almost certainly always exist in astronomy, result in "probabilities" that are not properly calibrated.] These "probabilities" are highly useful because they allow us to tune the purity and completeness of the final model, and they can be access for some (not all) sklearn models using the .predict_proba() function.

The model output is tuned via the ROC curve (or the somewhat related precision-recall curve), which plots the the true positive rate (TPR) against the false positive rate (FPR):

$$ \mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}},$$$$ \mathrm{FPR} = \frac{\mathrm{FP}}{\mathrm{FP} + \mathrm{TN}},$$

where $\mathrm{TP}$ is the number of true positives, $\mathrm{FN}$ is the number of false negatives, $\mathrm{FP}$ is the number of false positives, and $\mathrm{TN}$ is the number of true negatives. The ideal model has high TPR and low FPR. The ROC curve comes from varying the classification decision threshold from 1 to 0.

Once again, scikit-learn comes to the rescue by making it easy to calculate ROC curves with roc_curve in the sklearn.metrics module.

Problem 4d Calculate and plot the ROC curve for stars as determined by the validation set.

Hint - for this problem, you'll want to recast y as an integer array, so that the definition of TP and FP is unambiguous.


In [170]:
from sklearn.metrics import roc_curve

train_yInt = np.zeros(len(train_y), dtype = int)
train_yInt[train_y == 'STAR'] = 1
val_yInt = np.zeros(len(val_y), dtype = int)
val_yInt[val_y == 'STAR'] = 1

RFmod = RandomForestClassifier(n_estimators=300, max_features=3, min_samples_leaf=10)
RFmod.fit(train_Xnorm, train_yInt)
val_predsProba = RFmod.predict_proba(val_Xnorm)

fpr, tpr, thresh = roc_curve(val_yInt, val_predsProba[:,1])

plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')


Out[170]:
<matplotlib.text.Text at 0x11b292650>

The ROC curve should come very close to (0, 1), which is the sign of a good model (and with ~97% accuracy we already knew we have a pretty good model). We will now explore the results of thresholding. Suppose you are searching for supernovae in a time domain survey, and it is essential you examine transients in as many galaxies as possible. While searching, there will be many false positives in the form of stars, thus, you need a model that rejects stars (high TPR) with a very low FPR (does not misclassify galaxies).

Problem 4e Determine the classification threshold and TPR corresponding to a FPR = 0.01.


In [172]:
fpr01_idx = (np.abs(fpr-0.01)).argmin()

tpr01 = tpr[fpr01_idx]
thresh01 = thresh[fpr01_idx]

print("At FPR = 0.01, the TPR = {:.3f} corresponding to decision threshold = {:.3f}".format(tpr01, thresh01))


At FPR = 0.01, the TPR = 0.434 corresponding to decision threshold = 0.985

That's it! You now have the basic outline necessary to create a full-blown, publication-ready machine-learning model.

Now, before you run off importing sklearn left and right and re-writing all your code to use machine-learning, remember: you should always, always, always, always (how many times can I copy paste the word "always"?), always be extremely concerned about the bias present in your training set. (Even after the paper is accepted, the community loves it, and you've discovered some new great thing - you should be worried about the training set.) In my (AAM) personal experience, the vast majority of my time as a machine-learning practicioner is spent chasing down unusual things in the training set. So, proceed with caution.

If you have time - try the challenge problem below, which should reinforce my words of caution.

Challenge Problem) Taking the Plunge

Applying the model to field data

QSOs are unresolved sources that look like stars in optical imaging data. We will now download photometric measurements for 10k QSOs from SDSS and see how accurate the RF model performs for these sources.


In [184]:
QSOquery = """SELECT TOP 10000 
             p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r, 
             p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r, 
             s.class
             FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
             WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class = 'QSO'
             ORDER BY s.specobjid ASC
               """
QSOts = SDSS.query_sql(QSOquery)

Challenge 1 Calculate the accuracy with which the model classifies QSOs based on the 10k QSOs selected with the above command. How does that accuracy compare to that estimated by the validation set?


In [185]:
qso_X = np.array(QSOts[feats].to_pandas())
qso_y = np.ones(len(QSOts))    # we are defining QSOs as stars for this exercise
qso_Xnorm = qso_X[:,0][:,np.newaxis] - qso_X[:,1:]

qso_preds = RFmod.predict(qso_Xnorm)

print("The RF model correctly classifies ~{:.4f} of the QSOs".format(accuracy_score(qso_y, qso_preds)))


The RF model correctly classifies ~0.8763 of the QSOs

Challenge 2 Can you think of any reasons why the performance would be so much worse for the QSOs than it is for the stars? Can you obtain a ~.97 accuracy when classifying QSOs?


In [189]:
# As discussed above, low-z AGN have resolved host galaxies which will confuse the classifier, 
# this can be resolved by only selecting high-z QSOs (z > 1.5)

QSOquery = """SELECT TOP 10000 
             p.psfMag_r, p.fiberMag_r, p.fiber2Mag_r, p.petroMag_r, 
             p.deVMag_r, p.expMag_r, p.modelMag_r, p.cModelMag_r, 
             s.class
             FROM PhotoObjAll AS p JOIN specObjAll s ON s.bestobjid = p.objid
             WHERE p.mode = 1 AND s.sciencePrimary = 1 AND p.clean = 1 AND s.class = 'QSO'
             AND s.z > 1.5
             ORDER BY s.specobjid ASC
               """
QSOts = SDSS.query_sql(QSOquery)

qso_X = np.array(QSOts[feats].to_pandas())
qso_y = np.ones(len(QSOts))    # we are defining QSOs as stars for this exercise
qso_Xnorm = qso_X[:,0][:,np.newaxis] - qso_X[:,1:]

qso_preds = RFmod.predict(qso_Xnorm)

print("The RF model correctly classifies ~{:.4f} of the QSOs".format(accuracy_score(qso_y, qso_preds)))


The RF model correctly classifies ~0.9898 of the QSOs

Uber Challenge Problem) The 3-class problem

Uber Challenge Create a 3 class RF model that separates stars, galaxies, and QSOs in SDSS imaging data. Limit the size of your training set to $< 25000$ sources, but that's it as far as suggestions and hints are concerned. Can you achieve a classification accuracy around 97%?


In [ ]:
import # brokenBrain