Building a Supervised Machine Learning Model

The objective of this hands-on activity is to create and evaluate a Real-Bogus classifier using ZTF alert data. We will use the same data from the Day 2 clustering exercise (see that notebook to download the data).

  1. Load data
  2. Examine features
  3. Curate a test and training set
  4. Train classifiers
  5. Compare the performance of different learning algorithms

What's Not Covered

There are many topics to cover, and due to time constraints, we cannot cover them all. Omitted is a discussion of cross validation and hyperparameter tuning. I encourage you to click through and read those articles by sklearn.

0a. Imports

These are all the imports that will be used in this notebook. All should be available in the DSFP conda environment.


In [ ]:
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

0b. Data Location

Please specify paths for the following:


In [ ]:
F_META = '../Day2/dsfp_ztf_meta.npy'
F_FEATS = '../Day2/dsfp_ztf_feats.npy'
D_STAMPS = '../Day2/dsfp_ztf_png_stamps'

1. Load Data

Please note that I have made some adjustments to the data.

  • I have created a dataframe (feats_df) from feats_np
  • I have also dropped columns from feats_df. Some were irrelevant to classification, and some contained a lot of NaNs.

In [ ]:
meta_np = np.load(F_META)
feats_np = np.load(F_FEATS)

COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
             'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
             'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
             'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
             'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
             'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']

# NOTE FROM Umaa: I've decided to eliminate the following features. Dropping them.
#
COL_TO_DROP = ['ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 
               'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 
               'classtar', 'ssdistnr', 'ssmagnr', 'aimagerat', 'bimagerat', 
               'magapbig', 'sigmagapbig', 'scorr']
feats_df = pd.DataFrame(data=feats_np, index=meta_np['candid'], columns=COL_NAMES)
print("There are {} columns left.".format(len(feats_df.columns)))
print("They are: {}".format(list(feats_df.columns)))
feats_df.drop(columns=COL_TO_DROP, inplace=True) 
feats_df.describe()

In [ ]:
# INSTRUCTION: How many real and bogus examples are in this labeled set
#

2. Plot Features

Examine each individual feature. You may use the subroutine below, or code of your own devising. Note the features that are continuous vs categorial, and those that have outliers. There are certain features that have sentinel values. You may wish to view some features on a log scale.

Question: Which features seem to have pathologies? Which features should we exclude?


In [ ]:
# Histogram a Single Feature
#
def plot_rb_hists(df, colname, bins=100, xscale='linear', yscale='linear'):
    mask_real = feats_df['label'] == 1
    mask_bogus = ~mask_real
    plt.hist(df[colname][mask_real], bins, color='b', alpha=0.5, density=True)
    plt.hist(df[colname][mask_bogus], bins, color='r', alpha=0.5, density=True)
    plt.xscale(xscale)
    plt.yscale(yscale)
    plt.title(colname)
    plt.show()

# INSTRUCTION: Plot the individual features.  
#

Answer:

  • 'nbad' and 'nneg' are discrete features
  • 'seeratio' has a sentinel value of -999
  • There are some features that have high ranges than others. For classifiers that are sensitive to scaling, we will need to scale the data
  • There does not appear to be good separability between real and bogus on any individual features.

3. Curate a Test and Training Set

We need to reserve some of our labeled data for evaluation. This means we must split up the labeled data we have into the set used for training (training set), and the set used for evaluation (test set). Ideally, the distribution of real and bogus examples in both the training and test sets are roughly identical. One can use sklearn.model_selection.train_test_split and use the stratify option.

For ZTF data, we split the training and test data by date. That way repeat observations from the same night (which might be nearly identical) cannot be split into the training and test set, and artificially inflate test performance. Also, due to the change in survey objectives, it's possible that the test set features have drifted away from the training sets.

Provided is nid.npy which contains the Night IDs for ZTF. Split on nid=550 (July 5, 2018). This should leave you with roughly 500 examples in your test set.


In [ ]:
feats_plus_label = np.array(feats_df)
nids = meta_np['nid']

# INSTRUCTION: nid.npy contains the nids for this labeled data.
# Split the data into separate data structures for train/test data at nid=500.
# Verify that you have at least 500 reals in your test set.

4. Train a Classifier

Part 1: Separate Labels from the Features

Now store the labels separately from the features.


In [ ]:
# INSTRUCTION: Separate the labels from the features

Part 2. Scaling Data

With missing values handled, you're closing to training your classifiers. However, because distance metrics can be sensitive to the scale of your data (e.g., some features span large numeric ranges, but others don't), it is important to normalize data within a standard range such as (0, 1) or with z-score normalization (scaling to unit mean and variance). Fortunately, sklearn also makes this quite easy. Please review sklearn's preprocessing module options, specifically StandardScaler which corresponds to z-score normalization and MinMaxScaler. Please implement one.

FYI - Neural networks and Support Vector Machines (SVM) are sensitive to the scale of the data. Decision trees (and therefore Random Forests) are not (but it doesn't hurt to use scaled data).


In [ ]:
# INSTRUCTION: Re-scale your data using either the MinMaxScaler or StandardScaler from sklearn
#

Part 3. Train Classifiers

Import a few classifiers and build models on your training data. Some suggestions include a Support Vector Machine, Random Forest, Neural Net, NaiveBayes and K-Nearest Neighbor.


In [ ]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

knn3 = KNeighborsClassifier(3),
svml = SVC(kernel="linear", C=0.025, probability=True)
svmr = SVC(gamma=2, C=1, probability=True)
dtre = DecisionTreeClassifier()
rafo = RandomForestClassifier(n_estimators=100)
nnet = MLPClassifier(alpha=1)
naiv = GaussianNB()

# INSTRUCTION: Train at least three classifiers and run on your test data. Here's an example to get your started.  
# 


rafo.fit(train_feats, train_labels)
rafo_scores = rafo.predict_proba(test_feats)[:,1]


# INSTRUCTION: Print out the following metrics per classifier into a table: accuracy, auc, f1_score, etc.
#

Part 4. Plot Real and Bogus Score Distributions

Another way to test performance is to plot a histogram of the test set RB scores, comparing the distributions of the labeled reals vs. boguses. The scores of the reals should be close to 1, while the scores of the boguses should be closer to 0. The more separated the distribution of scores, the better performing your classifier is.

Compare the score distributions of the classifiers you've trained. Trying displaying as a cumulative distribution rather than straight histogram.

Optional: What would the decision thresholds be at the 5% false negative rate (FNR)? What would the decision threshold be at the 1% false positive rate?


In [ ]:
# INSTRUCTION: create masks for the real and bogus examples of the test set


# INSTRUCTION: First compare the classifiers' scores on the test reals only
#

scores_list = []
legends = ['Classifier 1', 'Classifier 2', 'Classifier 3'] # REPLACE ME
colors = ['g', 'b', 'r'] 
rbbins = np.arange(0,1,0.001)

# Comparison on real examples

plt.figure()
ax = plt.subplot(111)
for i, scores in enumerate(scores_list):
    # ax.hist() - COMPLETE ME
ax.set_xlabel('RB Score')
ax.set_ylabel('Count')
ax.set_xbound(0, 1)
ax.legend(legends, loc=4)
plt.show()


# Now perform the same comparison on bogus examples
#

Refining

Now it's time to go back and start experimenting with different parameters. If you have time, create a validation set and perform a Grid Search over a set of possible hyperparameters. Compare different classifier results using the histograms above.