Building a Supervised Machine Learning Model

The objective of this hands-on activity is to create and evaluate a Real-Bogus classifier using ZTF alert data. We will be using the same data from Day 2's clustering exercise.

Load data
Examine features
Curate a test and training set
Train classifiers
Compare the performance of different learning algorithms

What's Not Covered

There are many topics to cover, and due to time constraints, we cannot cover them all. Omitted is a discussion of cross validation and hyperparameter tuning. I encourage you to click through and read those articles by sklearn.

0a. Imports

These are all the imports that will be used in this notebook. All should be available in the DSFP conda environment.



In [1]:

    
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

0b. Data Location

Please specify paths for the following:



In [2]:

    
F_META = '../Day2/dsfp_ztf_meta.npy'
F_FEATS = '../Day2/dsfp_ztf_feats.npy'
D_STAMPS = '../Day2/dsfp_ztf_png_stamps'

1. Load Data

Please note that I have made some adjustments to the data.

I have created a dataframe (feats_df) from feats_np
I have also dropped columns from feats_df. Some were irrelevant to classification, and some contained a lot of NaNs.



In [3]:

    
meta_np = np.load(F_META)
feats_np = np.load(F_FEATS)

COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
             'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
             'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
             'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
             'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
             'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']

# NOTE FROM Umaa: I've decided to eliminate the following features. Dropping them.
#
COL_TO_DROP = ['ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 
               'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 
               'classtar', 'ssdistnr', 'ssmagnr', 'aimagerat', 'bimagerat', 
               'magapbig', 'sigmagapbig', 'scorr']
feats_df = pd.DataFrame(data=feats_np, index=meta_np['candid'], columns=COL_NAMES)
print("There are {} columns left.".format(len(feats_df.columns)))
print("They are: {}".format(list(feats_df.columns)))
feats_df.drop(columns=COL_TO_DROP, inplace=True) 
feats_df.describe()









    



There are 36 columns left.
They are: ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap', 'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky', 'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio', 'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg', 'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig', 'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']






    Out[3]:







  
    
      
      diffmaglim
      magpsf
      sigmapsf
      chipsf
      magap
      sigmagap
      sky
      magdiff
      fwhm
      mindtoedge
      magfromlim
      seeratio
      aimage
      bimage
      elong
      nneg
      nbad
      sumrat
      label
    
  
  
    
      count
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
      3190.000000
    
    
      mean
      20.151915
      19.038766
      0.114721
      29.171392
      18.882236
      0.133377
      1.759273
      -0.156530
      5.762724
      535.662895
      1.269680
      -19.033703
      1.447092
      1.121355
      1.277802
      3.702508
      0.827586
      0.942922
      0.263636
    
    
      std
      0.600403
      1.182225
      0.059628
      184.693580
      1.146221
      0.090430
      11.922192
      0.361237
      8.082798
      358.128782
      1.056042
      140.283975
      1.359441
      1.081769
      0.233597
      2.640393
      1.852272
      0.093331
      0.440673
    
    
      min
      16.387886
      11.861366
      0.000729
      0.145480
      11.904000
      0.000900
      -62.988770
      -0.999049
      -0.840000
      10.071900
      -0.689344
      -999.000000
      0.289000
      0.256000
      1.000000
      0.000000
      0.000000
      0.400054
      0.000000
    
    
      25%
      19.816405
      18.404491
      0.066442
      1.752059
      18.243625
      0.062925
      -0.078796
      -0.387930
      2.080000
      233.525776
      0.485384
      0.392570
      0.771250
      0.667000
      1.096764
      2.000000
      0.000000
      0.921823
      0.000000
    
    
      50%
      20.221339
      19.230843
      0.111248
      3.588183
      19.093750
      0.121900
      0.326771
      -0.088569
      3.170000
      475.807602
      1.019777
      0.794723
      1.018000
      0.841000
      1.201702
      3.000000
      0.000000
      0.990973
      0.000000
    
    
      75%
      20.608219
      19.917913
      0.167752
      11.765462
      19.744100
      0.185475
      0.891043
      0.057932
      6.200000
      782.824844
      1.860215
      1.171276
      1.726750
      1.268000
      1.395918
      5.000000
      0.000000
      1.000000
      1.000000
    
    
      max
      21.169477
      21.312767
      0.217133
      7037.184570
      21.160299
      0.733300
      321.040314
      0.971744
      118.790001
      1507.777222
      8.407010
      116.662155
      23.407000
      15.154000
      2.000000
      13.000000
      7.000000
      1.000000
      1.000000



In [4]:

    
# INSTRUCTION: How many real and bogus examples are in this labeled set
#
real_mask = feats_df['label'] == 1
bogus_mask = ~real_mask
print("Number of Real Examples: {}".format(np.sum(real_mask)))
print("Number of Bogus Examples: {}".format(np.sum(bogus_mask)))









    



Number of Real Examples: 841
Number of Bogus Examples: 2349

2. Plot Features

Examine each individual feature. You may use the subroutine below, or code of your own devising. Note the features that are continuous vs categorial, and those that have outliers. There are certain features that have sentinel values. You may wish to view some features on a log scale.

Question: Which features seem to have pathologies? Which features should we exclude?



In [5]:

    
# Histogram a Single Feature
#
def plot_rb_hists(df, colname, bins=100, xscale='linear', yscale='linear'):
    mask_real = feats_df['label'] == 1
    mask_bogus = ~mask_real
    plt.hist(df[colname][mask_real], bins, color='b', alpha=0.5, density=True)
    plt.hist(df[colname][mask_bogus], bins, color='r', alpha=0.5, density=True)
    plt.xscale(xscale)
    plt.yscale(yscale)
    plt.title(colname)
    plt.show()

# INSTRUCTION: Plot the individual features.  
#
for col in feats_df.columns:
    if col in ['chipsf', 'sky', 'fwhm']:
        plot_rb_hists(feats_df, col, bins=np.logspace(np.log10(0.01),np.log10(10000.0), 50), xscale='log')
    else:
        plot_rb_hists(feats_df, col)

Answer:

'nbad' and 'nneg' are discrete features
'seeratio' has a sentinel value of -999
There are some features that have high ranges than others. For classifiers that are sensitive to scaling, we will need to scale the data
There does not appear to be good separability between real and bogus on any individual features.

3. Curate a Test and Training Set

We need to reserve some of our labeled data for evaluation. This means we must split up the labeled data we have into the set used for training (training set), and the set used for evaluation (test set). Ideally, the distribution of real and bogus examples in both the training and test sets are roughly identical. One can use sklearn.model_selection.train_test_split and use the stratify option.

For ZTF data, we split the training and test data by date. That way repeat observations from the same night (which might be nearly identical) cannot be split into the training and test set, and artificially inflate test performance. Also, due to the change in survey objectives, it's possible that the test set features have drifted away from the training sets.

Provided is nid.npy which contains the Night IDs for ZTF. Split on nid=550 (July 5, 2018). This should leave you with roughly 500 examples in your test set.



In [6]:

    
feats_plus_label = np.array(feats_df)
nids = meta_np['nid']

# INSTRUCTION: nid.npy contains the nids for this labeled data.
# Split the data into separate data structures for train/test data at nid=500.
# Verify that you have at least 500 reals in your test set.

nid_mask_train = nids <= 550
nid_mask_test = ~nid_mask_train

train_plus_label = feats_plus_label[nid_mask_train,:]
test_plus_label = feats_plus_label[nid_mask_test,:]

nreals_train = np.sum(train_plus_label[:,-1] == 1)
nbogus_train = np.sum(train_plus_label[:,-1] == 0)
nreals_test = np.sum(test_plus_label[:,-1] == 1)
nbogus_test = np.sum(test_plus_label[:,-1] == 0)


print("TRAIN Num Real={}, Bogus={}".format(nreals_train, nbogus_train))    
print("TEST Num Real={}, Bogus={}".format(nreals_test, nbogus_test))









    



TRAIN Num Real=573, Bogus=2013
TEST Num Real=268, Bogus=336

4. Train a Classifier

Part 1: Separate Labels from the Features

Now store the labels separately from the features.



In [7]:

    
# INSTRUCTION: Separate the labels from the features

train_feats = train_plus_label[:,:-1]
train_labels = train_plus_label[:,-1]

test_feats = test_plus_label[:,:-1]
test_labels = test_plus_label[:,-1]

Part 2. Scaling Data

With missing values handled, you're closing to training your classifiers. However, because distance metrics can be sensitive to the scale of your data (e.g., some features span large numeric ranges, but others don't), it is important to normalize data within a standard range such as (0, 1) or with z-score normalization (scaling to unit mean and variance). Fortunately, sklearn also makes this quite easy. Please review sklearn's preprocessing module options, specifically StandardScaler which corresponds to z-score normalization and MinMaxScaler. Please implement one.

FYI - Neural networks and Support Vector Machines (SVM) are sensitive to the scale of the data. Decision trees (and therefore Random Forests) are not (but it doesn't hurt to use scaled data).



In [8]:

    
# INSTRUCTION: Re-scale your data using either the MinMaxScaler or StandardScaler from sklearn
#

scaler = StandardScaler()
train_feats = scaler.fit_transform(train_feats)
test_feats = scaler.fit_transform(test_feats)

Part 3. Train Classifiers

Import a few classifiers and build models on your training data. Some suggestions include a Support Vector Machine, Random Forest, Neural Net, NaiveBayes and K-Nearest Neighbor.



In [9]:

    
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

knn3 = KNeighborsClassifier(3),
svml = SVC(kernel="linear", C=0.025, probability=True)
svmr = SVC(gamma=2, C=1, probability=True)
dtre = DecisionTreeClassifier()
rafo = RandomForestClassifier(n_estimators=100)
nnet = MLPClassifier(alpha=1)
naiv = GaussianNB()

# INSTRUCTION: Train three classifiers and run on your test data. Here's an example to get your started.  
# Which ones seems to take longer to train?
# 


rafo.fit(train_feats, train_labels)
rafo_scores = rafo.predict_proba(test_feats)[:,1]

nnet.fit(train_feats, train_labels)
nnet_scores = nnet.predict_proba(test_feats)[:,1]

svmr.fit(train_feats, train_labels)
svmr_scores = svmr.predict_proba(test_feats)[:,1]


# INSTRUCTION: Print out the following metrics per classifier into a table: accuracy, auc, f1_score, etc.
#

Part 4. Plot Real and Bogus Score Distributions

Another way to test performance is to plot a histogram of the test set RB scores, comparing the distributions of the labeled reals vs. boguses. The scores of the reals should be close to 1, while the scores of the boguses should be closer to 0. The more separated the distribution of scores, the better performing your classifier is.

Compare the score distributions of the classifiers you've trained. Trying displaying as a cumulative distribution rather than straight histogram.

Optional: What would the decision thresholds be at the 5, 10 and 20% false negative rate (FNR)? What would the decision threshold be at the 1, 10, and 20% false positive rate?



In [10]:

    
# INSTRUCTION: create masks for the real and bogus examples of the test set

real_mask_test = test_labels == 1
bogus_mask_test = test_labels == 0

# # INSTRUCTION: First compare the classifiers' scores on the test reals only
# #

scores_list = [rafo_scores, nnet_scores, svmr_scores]
legends = ['Random Forest', 'Neural Net', 'SVM-RBF'] 
colors = ['g', 'b', 'r'] 
rbbins = np.arange(0,1,0.001)

# Comparison on Reals

plt.figure()
ax = plt.subplot(111)
for i, scores in enumerate(scores_list):
    ax.hist(scores[real_mask_test], rbbins, histtype='step', cumulative=True, density=False, color=colors[i])
ax.set_xlabel('RB Score')
ax.set_ylabel('Count')
ax.set_xbound(0, 1)
ax.legend(legends, loc=4)
plt.show()


# Comparison on Bogus
#
plt.figure()
ax = plt.subplot(111)
rbbins = np.arange(0,1,0.001)
for i, scores in enumerate(scores_list):
     ax.hist(scores[bogus_mask_test], rbbins, histtype='step', cumulative=True, density=False, color=colors[i])
ax.set_xlabel('RB Score')
ax.set_ylabel('Count')
ax.set_xbound(0, 1)
ax.legend(legends, loc=4)
plt.show()

Refining

Now it's time to go back and start experimenting with different parameters. If you have time, create a validation set and perform a Grid Search over a set of possible hyperparameters. Compare different classifier results using the histograms above.

	diffmaglim	magpsf	sigmapsf	chipsf	magap	sigmagap	sky	magdiff	fwhm	mindtoedge	magfromlim	seeratio	aimage	bimage	elong	nneg	nbad	sumrat	label
count	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000	3190.000000
mean	20.151915	19.038766	0.114721	29.171392	18.882236	0.133377	1.759273	-0.156530	5.762724	535.662895	1.269680	-19.033703	1.447092	1.121355	1.277802	3.702508	0.827586	0.942922	0.263636
std	0.600403	1.182225	0.059628	184.693580	1.146221	0.090430	11.922192	0.361237	8.082798	358.128782	1.056042	140.283975	1.359441	1.081769	0.233597	2.640393	1.852272	0.093331	0.440673
min	16.387886	11.861366	0.000729	0.145480	11.904000	0.000900	-62.988770	-0.999049	-0.840000	10.071900	-0.689344	-999.000000	0.289000	0.256000	1.000000	0.000000	0.000000	0.400054	0.000000
25%	19.816405	18.404491	0.066442	1.752059	18.243625	0.062925	-0.078796	-0.387930	2.080000	233.525776	0.485384	0.392570	0.771250	0.667000	1.096764	2.000000	0.000000	0.921823	0.000000
50%	20.221339	19.230843	0.111248	3.588183	19.093750	0.121900	0.326771	-0.088569	3.170000	475.807602	1.019777	0.794723	1.018000	0.841000	1.201702	3.000000	0.000000	0.990973	0.000000
75%	20.608219	19.917913	0.167752	11.765462	19.744100	0.185475	0.891043	0.057932	6.200000	782.824844	1.860215	1.171276	1.726750	1.268000	1.395918	5.000000	0.000000	1.000000	1.000000
max	21.169477	21.312767	0.217133	7037.184570	21.160299	0.733300	321.040314	0.971744	118.790001	1507.777222	8.407010	116.662155	23.407000	15.154000	2.000000	13.000000	7.000000	1.000000	1.000000