In [ ]:
from __future__ import division, print_function, absolute_import, unicode_literals
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt
%matplotlib inline
A quick note in preparation for this exercise. While it is possible to solve all the problems in this notebook using scikit-learn
v0.14 and above, it is recommended to have v0.16 or later installed. This will make your life a little earlier. To test which version of scikit-learn
you have, you can use the following command (in Python
):
import sklearn
sklearn.__version__
If you are running an older version, and you have installed the anaconda python distribution, then you can update your software by entering the following in a terminal:
conda update scikit-learn
Yesterday we saw the importance of RR Lyrae stars, and earlier today we heard about active galactic nuclei (AGN). The most luminous AGN, often referred to as quasars (or QSOs), can be observed almost to the edge of the visible universe (the most distant known QSO is at $z \approx 7$). They are powered via accretion onto supermassive black holes. Quasars are also high-amplitude variables, meaning they can be readily identified in time-domain surveys. QSO variability has been studied in great detail, and our goal today is to identify both RRL and QSOs using SDSS color information.
Unlike yesterday, our methodology will rely on the use of machine learning techniques. These methods are more flexible than the hard cuts that were placed on the data set yesterday, and as a result, more sophisticated models to separate RRL, QSOs, and normal stars can be developed.
By AA Miller (c) 2016 Jul 14
Today we will be working on a supervised-learning problem: using a training set, sources with known labels, i.e. they have been confirmed as normal stars, QSOs, or RRL, we will train the model to classify new observations where we do not know the source label. [This is very different from unsupervised learning, where a model is applied to data without labels and clusters are identified in the multi-dimensional feature space.]
The training set for today's exercise has been curated already, and it is located in ~/data/colors_training_set.csv
. The features that we will be using to build the classification model are the same as yesterday, namely: we will be using de-reddened SDSS colors. [Recall that features
are the measured properties, these can be both categorical and real valued numbers, that the machine-learning model uses to map from measured information to final classification. As we have four colors, that means we will be performing classification in a 4-D space. Tomorrow we will work on a problem with a much higher number of dimensions.]
The first step when pursuring a machine learning problem is to examine the potential training set. A machine-learning model is only as good as its training set. This point cannot be emphasized enough. Machine-learning models are data-driven, they do not capture any physical theory, and thus it is essential that the training set satisfy several criteria. Two of the most important criteria for a good training set are:
As a first step (and this is always a good idea), we are going to examine the training set to see if anything suspicious is going on. There are many (many, many) ways to read and manipulate comma separated value (.csv) files in Python
. To remain consistent with our earlier work, we will read the data into an astropy
table file.
In [ ]:
# execute this cell
ts = Table.read("../data/colors_training_set.csv")
ts
Notice that the IPython notebook
has a nice interface for the table data and that our features are listed in a nice easy to read format. There are a couple nice features about astropy tables. One is that we can select any individual feature by its column name:
ug = ts["dered_ug"]
Furthermore, it's also possible to slice through the entire table using Python
conditional arrays. For instance, suppose we only want the features for the normal stars in the training set, we can access that in the following way:
stars_cond = (ts["source_class"] == 'star')
stars_dat = ts[stars_cond]
There are plenty of additional conveniences for manipulating the data in an astropy table, but we shouldn't need them to complete this problem.
Our goal now, to the degree that this is possible, is to evaluate if the training set is unbiased and representative of the field population. Given that you do not know the details for how the training set was constructed, it is more or less impossible for you to evaluate any bias. [In short, the answer is that the training set is biased because the sources are selected from a subset of SDSS spectroscopic observations. The details of SDSS spectroscopic observations are beyond this exercise, but the SDSS selection of spectroscopic targets is biased.] We will assume that the training set is unbiased (for our purposes this assumption is okay). Now we need to test if the training set is representative of the field. In any wide-field survey, the number of normal stars will greatly exceed the number of RRL, while relatively shallow surveys, such as PTF, will detect far more stars than QSOs.
Problem A1 Determine the total number of normal stars, QSOs, and RRL in the training set, and confirm that the training set is dominated by stars.
In [ ]:
Nstar = sum(ts[
Nqso =
Nrrl =
print("There are {:d} stars, {:d} QSOs, and {:d} RRL.".format( # complete
As stated earlier, a machine-learning model is only as good as its training set. We now need to confirm that our features accurately separate stars from QSOs and RRL. In data sets with an exceptionally large number of features this can be difficult. As we have only four features, we can actually use histograms to evaluate the distribution of the features as a function of class.
Problem B1 Plot the histogram distribution of $(u - g)_0$ colors for the QSOs, RRL, and normal stars.
Hint - it will be easiest to see the different histograms by setting histtype = 'step'
within the call to plt.hist()
.
Second hint - Since the number of sources is very different for each class, it helps to normalize each histogram such that the sum of all bars = 1. To do this, use normed=True
in the call to plt.hist()
.
In [ ]:
# complete the code below
plt.hist(ts["dered_ug"][ # complete
plt.xlabel( # complete
plt.legend(fancybox = 'True')
Do you see a difference between the three classes?
Now let's examine the three remaining features.
Problem B2 For the 3 remaining features: $(g - r)_0$, $(r - i)_0$, $(i - z)_0$, plot the histogram distributions for QSOs, RRL, and normal stars.
In [ ]:
# complete the code below
plt.figure()
plt.hist( # complete
# complete
plt.xscale( # complete
plt.legend(fancybox = 'True')
plt.figure()
plt.hist( # complete
# complete
plt.xscale( # complete
plt.legend(fancybox = 'True')
plt.figure()
plt.hist( # complete
# complete
plt.xscale( # complete
plt.legend(fancybox = 'True')
Problem B3 Based on these histograms do you think it will be possible to separate the QSOs, RRL, and normal stars? From the histograms, which feature do you think is going to be the most important for determining which class a given star belongs to? (remember your answer to these questions for later)
Yesterday we found that RRL stars live in a special confined region of the color-color diagram. For feature sets that are extremely large, plotting (and examining) histograms or 2D feature distributions can be quite cumbersome. This data set only has 4 features, and we also know that the $(u - g)_0$, $(g - r)_0$ color-color diagram is useful for separating different types of sources. Thus, we will now explore where the data live within that color-color diagram.
Problem C1 Plot the $(u - g)_0$ and $(g - r)_0$ color-color diagram (CC diagram). Be sure to highlight normal stars, QSOs, and RRL with different colors.
Hint - this will work best if you plot normal stars first. Also, set the marker edge color to nothing to lighten up the plot, mec = "None"
, and consider using the alpha channels to highlight the density of different source types.
In [ ]:
plt.plot( # complete
# complete
plt.xlabel( # complete
plt.ylabel( # complete
plt.legend(loc = 4, numpoints = 1, fancybox = 'True')
Notice that while there is significant overlap between the three classes in this 2D space, there are clearly defined regions where the individual classes dominate.
Problem C2 At this stage, would you like to reconsider your answer to Problem B3?
We will utilize the scikit-learn
library to develop our machine-learning model. While there are many different machine-learning packages available, the scikit-learn
implementation of Random Forest is very fast and (relatively) straightforward.
As a brief reminder, Random Forest utilizes an ensemble of decision trees to provide final model predictions with low bias and low variance (see Breiman 2001). Each tree in the ensemble utilizes bagging, wherein a bootstrap sample of the training set is used to define the tree. At each node within the tree, a randomly selected subset of the total number of features is used to identify the best splitting criteria for the node. The total number of randomly selected features, Mtry
, and the total number of trees in the ensemble, Ntree
, are the two most important tuning parameters for Random Forest. While additional tuning parameters exist, and can be important for some problems, today we will only focus on Mtry
and Ntree
. This has been just a brief introduction to the algorithm, we recommend reading additional resources (e.g., Breiman 2001, Hastie et al. , Richards et al. , Brink et al.) if you intend to use Random Forest in your own research.
The scikit-learn
implemenation of Random Forest, and almost every machine learning model offered in the package, is simple and straightforward. The basic steps are as follows: first the model is initiated (typically as an object upon which further operations will be applied), then the model is "trained" (which is accomplished using the .fit()
operator), and finally the model can be applied to new data (using .predict()
). There are several additional bells and whistles, and we'll get into those later, but this includes all the basics for model construction.
To construct the model, the training data must include the features and source classes separately. The features are input as a 2-dimensional array, X
, where each row corresponds to a different source and each column corresponds to a different feature. The classes are a 1-D array, y
with each entry matching the features in the corresponding row in X
. We will now construct the feature and class arrays.
Problem A1 Create a feature array X
and class array y
corresponding to the color features and source class from Problem 1.
Hint - we do not want you to get stuck on this part of the problem. If you do not figure this out in less than 10 min, scroll to the bottom for the answer for making X
. Recall that y
is just an array with the classification for each source.
In [ ]:
features = ['dered_ug', 'dered_gr', 'dered_ri', 'dered_iz']
X = np.empty( # complete with dimensions of X
for featnum, feat in enumerate(features):
# complete
y = np.array( # complete
print(X)
print(y)
Double check the structure of X
and y
to make sure they conform to what you expect.
Now that the data is in the proper format we are going to initiate the Random Forest. This is done using RandomForestClassifier
from the sklearn.ensemble
library in Python
:
from sklearn.ensemble import RandomForestClassifier
RFmod = RandomForestClassifier(n_estimators = 25)
Notice that we have set n_estimators = 25
, or in other words Ntree = 25
. Otherwise we will be using the default values for the model, but do read the documentation to see all the possible tuning parameters.
Problem A2 Initiate a RF classifier, and train it using the testing data from above. (Also - be amazed at how quickly the model is fit)
Hint - this is easy, there isn't a magic trick here. Just don't forget to fit the model to the training data.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
RFmod = # complete
RFmod.fit( # complete
You have now fit a RF model. While that task is now completed, we do not recommend going home as there is still additional analysis to be performed. For instance, while we have fit the model, we have not yet made any predictions for the sources in our PTF field of interest, and, (perhaps) more importantly we have not made any evaluations of the quality of our model.
One consequence of using RF is that estimates of the accuracy of the model are generated "for free." Each tree in the forest is constructed using bagging, meaning a fraction of sources in the training set are excluded from each tree. These sources are referred to as being "out of bag." Once a tree is constructed, predictions for the out of bag sources can be made and compared to their true classes. This provides a measurement of the out-of-bag error. A crucial part of this assessment is that predictions are only being made for sources (temporarily) outside the training set.
The misclassification rate for sources that are included in a training set is referred to the training error. The training error rarely provides a good prediction for how the model will generalize as it is applied to newly observed sources. To see why this is the case, consider the following example: given a complex dataset in a large N-dimensional space, I can construct a single decision tree that continually partitions the data until every final node in the tree is pure. This model will produce a training error of 0%. However, unless my training set is completely unbiased and perfectly representative of the field, there will be some (and possibly many) sources from the field for which this model fails. This is an example of over-fitting.
One way to protect against over-fitting is by minimizing the out-of-bag error. Since the out-of-bag sources are not in the training set, the model will have to generalize well to accurately predict the class of those sources. This point is so important, we will emphasize it again: it is essential that tests of model accuracy are performed using sources that are not included in the training set.
Measuring the out-of-bag accuracy is easy in scikit-learn
, simply set oob_score = True
when initiating the RF. Earlier, we said these estimates come for free. This is not precisely the case, as it requires time and computational power to predict the class of the out-of-bag sources, which is why the default is oob_score = False
. In practice, RF model construction is much slower than making predictions, meaning the true overhead for calculating out-of-bag error is small.
Problem B1 Calculate the out-of-bag error for the training set from Part 1.
Hint - consult the scikit-learn
Random Forest documentation if you have absolutely no idea how to do this. Also, note that the error = 1 - accuracy.
In [ ]:
RFmod = RandomForestClassifier( #complete
#complete
oob_error =
print('The out-of-bag error is {:.1f}%%'.format(100*oob_error))
Nice work! You have built a model that is pretty accurate.
While the out-of-bag error is a nice feature of RF, there is a standard procedure for assessing model accuracy, which has the benefit of being applicable to any model, known as cross validation (CV). In CV, the training set is partitioned, and sources outside each individual partition are used to predict the class of sources within the partition. This procedure is then repeated for each partition, so that every source in the training set has exactly one prediction, allowing a measurement of the cross-validation error. The most common form of CV is known as K-fold, where the training set is divided into $k$ partitions of equal size. In order of increasing computational time, typical choices for $k$ are 2, 3, 10, or $N$, where $N$ is the total number of sources in the training set.
The cross validation
module in scikit-learn
makes it easy to access CV error and make CV predictions. The CV accuracy can be obtained via cross_validation.cross_val_score()
, which requires the model, training features, and training classes as arguments. The default is $k = 3$ folds. Finally, note that this method uses Stratified K-folds, which is slightly different from the procedure described above.
Problem B2 Determine the 10-fold CV error for the training set from part 1.
In [ ]:
from sklearn import cross_validation
cv_accuracy = cross_validation. #complete
cv_error =
print('The cross-validation error is {:.1f}%%'.format(100*cv_error))
Now that we are familiar with the more general technique of cross-validation, let's determine which class we can most accurately classify.
Problem B3 Using cross_validation.cross_val_predict()
determine which class is easiest to classify for a model trained using only SDSS colors.
In [ ]:
y_cv_preds = # complete
star_acc =
qso_acc =
rrl_acc =
We just found that the classifier does a much better job classifying stars and RRL than it does for QSOs. The main reason for this is that high-redshift QSOs lie far outside the main QSO locus. These high-$z$ QSOs can be seen as the "tail" with $(u - g)_0 > 1$. In addition to knowing the accuracy for the individual classes, it is also useful to know class predictions for the misclassified sources, or in other words where there is "confusion" for the classifier. The best way to summarize this information is with a confusion matrix. In a confusion matrix, one axis shows the true class and the other shows the predicted class. For a perfect classifier all of the power will be along the diagonal, while confusion is represented by off-diagonal signal.
Like almost everything else we have encountered during this exercise, scikit-learn
makes it easy to compute a confusion matrix. This can be accomplished with the following:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_prep)
Problem C1 Calculate the confusion matrix for the training set from Part 1.
In [ ]:
# compute the confusion matrix
print(cm)
From this representation, we see right away that most of the QSOs that are being misclassifed are being scattered into the normal stars class. However, this representation could still be improved: it'd be helpful to normalize each value relative to the total number of sources in each class, and better still, it'd be good to have a visual representation of the confusion matrix. This visual representation will be readily digestible. Now let's normalize the confusion matrix.
Problem C2 Calculate the normalized confusion matrix. Be careful, you have to sum along one axis, and then divide along the other.
Anti-hint: This operation is actually straightforward using some array manipulation that we have not covered up to this point. Thus, we have performed the necessary operations for you below. If you have extra time, you should try to develop an alternate way to arrive at the same normalization.
In [ ]:
normalized_cm = cm.astype('float')/cm.sum(axis = 1)[:,np.newaxis]
normalized_cm
The normalization makes it easier to compare the classes, since each class has a different number of sources. Now we can procede with a visual representation of the confusion matrix. This is best done using imshow()
within pyplot. You will also need to plot a colorbar, and labeling the axes will also be helpful.
Problem C3 Plot the confusion matrix. Be sure to label each of the axeses.
Hint - you might find the sklearn
confusion matrix tutorial helpful for making a nice plot.
In [ ]:
plt.imshow(normalized_cm, interpolation = 'nearest', # complete
plt.ylabel( # complete
plt.xlabel( # complete
plt.tight_layout()
Nice work. You have now developed and evaluated a machine-learning model. The final step will be to apply the model to unobserved data.
Now that we have a model, and we know that the error rate from this model is ~9%, we can apply the model to the sources in our PTF field that have SDSS colors. After that, we will be able to access how many RRL and QSO candidates are present in the field.
After a machine-learning model has been fit to training set data, it's possible to predict classes for "unseen" data using the .predict()
method, which requires a set of features as input.
Problem A1 Load the SDSS colors for the PTF sources, and predict the class of these sources using a RF model with Ntree = 25
.
In [ ]:
ptf_sdss_data = Table.read( # complete
ptf_feats = # complete
# complete
ptf_preds = # complete
Problem A2 How many candidate QSOs, RRL, and stars do you identify?
In [ ]:
print('There are {} QSO, {} RRL, and {} star candidates'.format( #complete
Finally, as a check to make sure our results are reasonable plot a $(u - g)_0$, $(g - r)_0$ CC diagram showing the locations of sources classified as QSOs, RRL, or normal stars.
Problem A3 Plot a CC diagram of the PTF sources, showing the RF predicted classifications. Compare this plot to CC diagram of the training set generated earlier.
In [ ]:
plt.plot( # complete
# complete
plt.xlabel( # complete
plt.ylabel( # complete
plt.legend(loc = 4, numpoints = 1, fancybox = 'True')
If you finish early, work on the following problem, or continue working on this as homework for this evening.
Challenge Problem 1 Another nice feature of RF, which was not mentioned earlier, is that it enables a measurement of the relative importance of each of the features in a machine learning model. This is accomplished by randomly shuffling the values of a particular feature, and examining the decrease in performance of the model's overall accuracy. The relative feature importances can be accessed using the .feature_importances_
module associated with the machine learning class. Calculate the importance of each of teh SDSS colors. Which feature is most important? Does this agree with your previous answer?
Hint - if you get stuck, this feature importance example basically gives away the answer.
In [ ]:
RFmod. #complete
While we achieved a CV error rate of $\sim 9\%$, it is possible that with the same observations we can still significantly improve performance. In particular, we only used a model with 25 trees, and default parameters otherwise.
Challenge Problem 2 Search for the optimal tuning parameters, by varying Ntree
and mtry
over a grid to minimize the CV error.
Hint - like (almost) everything else from today, scikit-learn
has developed tools to help with a grid search over parameter space to dervive an optimal model.
In [ ]:
Challenge Problem 3 For each of the RRL candidates identified from the RF model, measure the best-fit period for the corresponding PTF light curves of those sources. Plot the phase-folded light curves of the RRL candidates. After this exercise, how many (and which) of these sources do you now think are genuine RRL stars? How does this list compare to the one from yesterday?
In [ ]:
Note - if you needed the hint to answer Problem 2, part A1, the code necessary to finish that portion of the notebook is given below.
In [ ]:
# answer to Problem 2, part A1
#features = ['dered_ug', 'dered_gr', 'dered_ri', 'dered_iz']
#X = np.empty((len(ts), len(features)))
#for featnum, feat in enumerate(features):
# X[:,featnum] = ts[feat]
#print(X)
#print(y)