In [ ]:

    
from __future__ import division, print_function, absolute_import, unicode_literals
import numpy as np
from astropy.table import Table
import matplotlib.pyplot as plt
import astropy.units as u
import astropy.coordinates as coords
from ztf_summerschool import source_lightcurve
import shelve
%matplotlib inline

An **essential** note in preparation for this exercise. We will use scikit-learn to provide classifications of the PTF sources that we developed on the first day of the summer school. Calculating the features for these light curves can be computationally intensive and may require you to run your computer for several hours. It is essential that you complete this portion of the exercise prior to Friday afternoon. Or, in other words, this is homework for Thursday night.

Fortunately, there is an existing library that will calculate all the light curve features for you. It is not included in the anaconda python distribution, but you can easily install the library using pip, from the command line (i.e. outside the notebook):

pip install FATS

After a short download, you should have the FATS (Feature Analysis for Time Series) library loaded and ready to use.

Note within a note The FATS library is not compatible with Python 3, thus is you are using Python 3 feel free to ignore this and we will give you an array with the necessary answers.

Hands-On Exercise 6: Building a Machine Learning Classifier to Identify RRL and QSO Candidates Via Light Curves

Version 0.1

We have spent a lot of time discussing RR Lyrae stars and QSOs. The importance of these sources will not be re-hashed here, instead we will jump right into the exercise.

Today, we will measure a large number of light curve features for each PTF source with a light curve. This summer school has been dedicated to the study of time-variable phenomena, and today, finally, everything will come together. We will use machine learning tools to classify variable sources.

Problem 1) Calculate the features

The training set for today has already been made. The first step is to calculate the features for the PTF sources we are hoping to classify. We will do this using the FATS library in python. The basic steps are simple: the light curve, i.e. time, mag, uncertainty on the mag, is passed to FATS, and features are calculated and returned. Prior to calculating the features, FATS preprocesses the data by removing $5\sigma$ outliers and observations with anomolously large uncertainties. After this, features are calculated.

We begin by reading in the data from the first day.



In [ ]:

    
shelf_file = " " # complete the path to the appropriate shelf file here
shelf = shelve.open(shelf_file)

shelf.keys()

Part A - Calculate features for an individual source

To demonstrate how the FATS library works, we will begin by calculating features for the source with $\alpha_{\rm J2000} = 312.23854988, \delta_{\rm J2000} = -0.89670553$. The data structure for FATS is a little different from how we have structured data in other portions of this class. In short, FATS is looking for a 2-d array that contains time, mag, and mag uncertainty. To get the required formatting, we can preprocess the dats as follows:

import FATS
[mag, time, error] = FATS.Preprocess_LC(lc_mag, lc_mjd, lc_magerr).Preprocess()

where the result from this call is a 2d array ready for feature calculations. lc_mag, lc_mjd, and lc_magerr are individual arrays for the source in question that we will pass to FATS.

Problem A1 Perform preprocessing on the source with $\alpha_{\rm J2000} = 312.23854988, \delta_{\rm J2000} = -0.89670553$ from the shelf file, then plot the light curve both before and after preprocessing using different colors to see which epochs are removed during preprocessing.

Hint - this won't actually affect your code, because FATS properly understands NumPy masks, but recall that each source in the shelf file has a different mask array, while none of the MJDs in that file have a mask.



In [ ]:

    
import FATS

reference_catalog = '../data/PTF_Refims_Files/PTF_d022683_f02_c06_u000114210_p12_sexcat.ctlg'
outfile = reference_catalog.split('/')[-1].replace('ctlg','shlv')

lc_mjd, lc_mag, lc_magerr = source_lightcurve("../data/"+outfile, # complete

[mag, time, error] = FATS.Preprocess_LC(  # complete

plt.errorbar( # complete
plt.errorbar( # complete

Problem A2 What do you notice about the points that are flagged and removed for this source?

Answer type your answer here

This particular source shows the (potential) danger of preprocessing. Is the "flare" from this source real, or is it the result of incorrectly calibrated observations? In practice, determining the answer to this question would typically involve close inspection of the actual PTF image where the flare is detected, and possibly additional observations as well. In this particular case, given that there are other observations showing the decay of the flare, and that $(g - r) = 1.7$, which is consistent with an M-dwarf, the flare is likely real. In sum, preprocessing probably would lead to an incorrect classification for this source. Nevertheless, for most applications it is necessary to produce reasonable classifications.

Part B - Calculate features and check that they are reasonable

Now we will focus on the source with $\alpha_{\rm J2000} = 312.50395, \delta_{\rm J2000} = -0.70654$ , to calculate features using FATS. Once the data have been preprocessed, features can be calcualted using the FeatureSpace module (note - FATS is designed to handle data in multiple passbands, and as such the input arrays passed to FATS must be specified):

lc = np.array([mag, time, error])
feats = FATS.FeatureSpace(Data=['magnitude', 'time', 'error']).calculateFeature(lc)

Following these commands, we now have an object feats that contains the features of our source. As there is only one filter with a light curve for this source, FATS will not be able to calculate the full library of features.

Problem B1 Preprocess the light curve for the source at $\alpha_{\rm J2000} = 312.50395, \delta_{\rm J2000} = -0.70654$, and use FATS to calculate the features for this source.



In [ ]:

    
lc_mjd, lc_mag, lc_magerr = # complete
# complete
# complete

lc = # complete
feats = FATS.FeatureSpace( # complete

The features object, feats can be retrieved in three different ways: dict returns a dictionary with the feature names and their correpsonding values, array returns the feature values, and features returns an array with the names of the individual features.

Execute the cell below to determine how many features there are, and to examine the features calculated by FATS.



In [ ]:

    
# execute this cell
print('There are a total of {:d} features for single band LCs'.format(len(feats.result(method='array'))))

print('Here is a dictionary showing the features:')

feats.result(method='dict')

For now, we will ignore the precise definition of these 59 (59!) features. But, let's focus on the first feature in the dictionary, Amplitude, to perform a quick check that the feature calculation is proceeding as we would expect.

Problem B2 Plot the light curve the source at $\alpha_{\rm J2000} = 312.50395, \delta_{\rm J2000} = -0.70654$, and check if the amplitude agrees with that calculated by FATS.

Note - the amplitude as calculated by FATS is actually the half amplitude, and it is calculated by taking half the difference between the median of the brightest 5% and the median of the faintest 5% of the observations. A quick eyeball test is sufficient.



In [ ]:

    
plt.errorbar( # complete

Now, let's check one other feature to see if these results make sense. The best fit lomb-scargle period is stored as PeriodLS in the FATS feature dictionary. The feature period_fit reports the false alarm probability for a given light curve. This sources has period_fit $\sim 10^{-6}$, so it's fairly safe to say this is a periodic variable, but this should be confirmed.

Problem B3 Plot the phase folded light curve the source at $\alpha_{\rm J2000} = 312.50395, \delta_{\rm J2000} = -0.70654$, using the period determined by FATS.



In [ ]:

    
plt.errorbar( # complete

Does this light curve look familiar at all? Why, it's our favorite star!

Part C - Calculate features for all PTF sources with light curves

Finally, and this is the most important portion of this exercise, we need to calculate features for all of the PTF sources for which we have light curves. Essentially, a for loop needs to be created to cover every light curve in the shelf file, but there are a few things you must keep in mind:

It is essential that the features be stored in a sensible way that can be passed to scikit-learn. Recall that features are represented as 2d arrays where each row corresponds to one source and each column corresponds to a given feature.
Finally, if you can easily figure it out, it would be good to store your data in a file of some kind so the features can be easily read by your machine in the future.

Problem C1 Measure features for all the sources in the PTF field that we have been studying this week. Store the results in an array Xfeats that can be used by scikit-learn.

Note - FATS will produce warnings for every single light curve in the loop, which results in a lot of output text. Thus, we have employed %%capture at the start of this cell to supress that output.

Hint - you may find it helpful to include a progress bar since this loop will take $\sim$2 hr to run. See the Making_a_Lightcurve notebook for an example. This is not necessary, however.



In [ ]:

    
%%capture
# not too many hints this time
Xfeats = # complete or incorporate elsewhere
# for loop goes here

# 2 lines below show an example of how to create an astropy table and then save the feature calculation as a csv file
#    -- if you use these lines, be sure that the variable names match your for loop
#feat_table = Table(Xfeats, names = tuple(feats.result(method='features')))
#feat_table.write('PTF_feats.csv', format='csv')

Problem 2) Build the machine learning model

In many ways, the most difficult steps are now complete. We will now build a machine learning model (one that is significantly more complicated than the model we built yesterday, but the mechanics are nearly identical), and then predict the classification of our sources based on their light curves.

The training set is stored in a csv file that you have already downloaded: ../data/TS_PTF_feats.csv. We will begin by reading in the training set to an astropy Table.



In [ ]:

    
ts = Table.read("../data/TS_PTF_feats.csv")

ts

As is immediately clear - this dataset is more complicated than the one we had yesterday. We are now calcualting 59 different features to characterize the light curves. 59 is a large number, and it would prove cumbersome to actually plot histograms for each of them. Also, if some of the features are uniformative, which often happens in problems like this, plotting everything can actually be a waste of time.

Part A - Construct the Random Forest Model

We will begin by constructing a random forest model from the training set, and then we will infer which features are the most important based on the rankings provided by random forest. [Note - this was the challenge problem from the end of yesterday's session. Refer to that if you do not know how to calculate the feature importances.]

Problem A1 Construct a random forest using the training set, and determine the three most important features as measured by the random forest.

Hint - it may be helpful to figure out the indicies corresponding to the most important features. This can be done using np.argsort() which returns the indicies associated with the sorted argument to the call. The sorting goes from smallest number to highest. We want the most important features, so the indicies corresponding to that can be obtained by using [::-1], which flips the order of a NumPy array. Thus, you can obtain indicies sorting the features from most important to least important with the following command:

np.argsort(RFmod.feature_importances_)[::-1]

This may, or may not depending on your approach, help you identify the 3 most important features.



In [ ]:

    
# the trick here is to remember that both the features and the class labels are included in ts

y = # complete
X = # complete
# complete
# complete

from sklearn.ensemble import RandomForestClassifier
RFmod = RandomForestClassifier(n_estimators = 100)
RFmod.fit( # complete

feat_order = np.array(ts.colnames)[np.argsort(RFmod.feature_importances_)[::-1]]

print('The 3 most important features are: {:s}, {:s}, {:s}'.format( # complete

As before, we are going to ignore the meaning of the three most important features for now (it is highly recommended that you check out Nun et al. 2015 to learn the definition of the features, but we don't have time for that at the moment).

To confirm that these three features actually help to separate the different classes, we will examine the histogram distributions for the three classes and these three features.

Problem A2 Plot the histogram distribution of each class for the three most important features, as determined by random forest.

Hint - this is very similar to the histogram plots that were created yesterday, consult your answer there for helpful hints about weighting the entries so that the sum of all entries = 1. It also helps to plot the x-axis on a log scale.



In [ ]:

    
Nqso = 
Nrrl = 
Nstar = 

for feat in feat_order[0:3]:
    plt.figure()
    plt.hist(  # complete
        # complete
        # complete
    plt.xscale('log')
    plt.legend(fancybox = True)

Part B - Evaluate the accuracy of the model

Like yesterday, we are going to evaluate how well the model performs via cross validation. While we have looked at the feature distribution for this data set, we have not closely examined the classes thus far. We will begin with that prior to estimating the accuracy of the model via cross validation.

Recall that a machine learning model is only as good as the model training set. Since you were handed the training set without details as to how it was constructed, you cannot easily evaluate whether or not it is biased (that being said, the training set is almost certainly biased). We can attempt to determine if this training set is representative of the field, however. In particular, yesterday we constructed a model to determine whether sources were stars, RRL variables, or QSOs based on their SDSS colors. This model provides a rough, but certainly not perfect, estimate of the ratio of these three classes to each other.

Problem B1 Calculate the number of RRL, QSOs, and stars in the training set and compare these numbers to the ratios for these classes as determined by the predictions from the SDSS colors-based classifier that was constructed yseterday.



In [ ]:

    
print('There are {:d} QSOs, {:d} RRL, and {:d} stars in the training set'.format(# complete

While the ratios between the classes are far from identical to the distribution we might expect based on yesterday's classification, it is fair to say that the distribution is reasonable. In particular, there are more QSOs than RRL, and there are significantly more stars than either of the other two classes. Nevertheless, the training set is not large. Note - large in this sense is very relative, if you are searching for something that is extremely rare, such as the EM counterpart to a gravitational wave event, a training set of 2 might be large, but there are thousands of known QSOs and RRL, and a $\sim$billion known stars. Thus, this training set is small.

Problem B2 Do you think the training set is representative of the diversity in each of the three classes?

Answer type your response to this question here

At this point, it should be clear that the training set for this exercise is far from perfect. But guess what? That means this training set has something in common with every other astronomical study that utilizes machine learning. At the risk of beating the horse well beyond the grave, it is important to highlight once again that building a training set is extremely difficult work. There are almost always going to be (known and unknown) biases present in any sample used to train a model, and as a result predicted accuracies from cross validation are typically going to be over-optimistic. Nevertheless, cross validation remains one of the best ways to quantitatively examine the quality of a model.

Problem B3 Determine the overall accuracy of the time-domain machine learning model using 5-fold cross validation.



In [ ]:

    
from sklearn import cross_validation  # recall that this will only work with sklearn v0.16+
RFmod = RandomForestClassifier( # complete
cv_accuracy = # complete

print("The cross-validation accuracy is {:.1f}%%".format(100*np.mean(cv_accuracy)))

As noted earlier - this accuracy is quite high, and it is likely over-optimistic. While the overall accuracy provides a nice summary statistic, it is always useful to know where the classifier is making mistakes. As noted yesterday, this is most easily accomplished with a confusion matrix.

Problem B4 Plot the normalized confusion matrix for the new time-domain classifier. Identify the most common error made by the classifier.

Hint - you will (likely) need to run cross-validation again so you can produce cross-validated class predictions for each source in the training set.



In [ ]:

    
from sklearn.metrics import confusion_matrix

y_cv_preds = # complete
cm =  # complete

plt.imshow( # complete
plt.colorbar()

plt.ylabel( # complete
plt.xlabel( # complete
plt.tight_layout()

Problem 3) Use the model to identify RRL and QSO candidates in the PTF field

Now, virtually all of the hard work is done, we simply need to make some final predictions on our dataset.

Part A - Apply the machine learning model

Problem A1 Using the features that you calculated in Problem 1, predict the class of all the PTF sources.



In [ ]:

    
RFmod = # complete
# complete
PTF_classes = # complete

Problem A2 Determine the number of candidate sources belonging to each class.



In [ ]:

    
print('There are {:d} candidate QSOs, {:d} candidate RRL, and {:d} candidate stars.'.format(Nqso_cand, Nrrl_cand, Nstar_cand))

These numbers are quite different from what we found yesterday.

Problem A3 Can you identify any potential reasons why the number of sources in each class would be so different from yesterday?

Problem 4 - Challenge

If you finish early, work on the following problem, or continue working on this as homework for this evening.

Challenge Problem 1 Download some (publically available) PTF data, build a machine learning model, and write a scientific paper that summarizes your findings.

Hint - you probably need more than 2 hours to finish this problem, but if you have an idea, you should give it a try.

References

[1] Nun et al. 2015, arXiv:1506.00010