The objective of this hands-on activity is to create and evaluate a Real-Bogus classifier using ZTF alert data. We will be using the same data from Day 2's clustering exercise.
There are many topics to cover, and due to time constraints, we cannot cover them all. Omitted is a discussion of cross validation and hyperparameter tuning. I encourage you to click through and read those articles by sklearn.
In [1]:
import numpy as np
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import MinMaxScaler, StandardScaler
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
In [2]:
F_META = '../Day2/dsfp_ztf_meta.npy'
F_FEATS = '../Day2/dsfp_ztf_feats.npy'
D_STAMPS = '../Day2/dsfp_ztf_png_stamps'
In [3]:
meta_np = np.load(F_META)
feats_np = np.load(F_FEATS)
COL_NAMES = ['diffmaglim', 'magpsf', 'sigmapsf', 'chipsf', 'magap', 'sigmagap',
'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr', 'sky',
'magdiff', 'fwhm', 'classtar', 'mindtoedge', 'magfromlim', 'seeratio',
'aimage', 'bimage', 'aimagerat', 'bimagerat', 'elong', 'nneg',
'nbad', 'ssdistnr', 'ssmagnr', 'sumrat', 'magapbig', 'sigmagapbig',
'ndethist', 'ncovhist', 'jdstarthist', 'jdendhist', 'scorr', 'label']
# NOTE FROM Umaa: I've decided to eliminate the following features. Dropping them.
#
COL_TO_DROP = ['ndethist', 'ncovhist', 'jdstarthist', 'jdendhist',
'distnr', 'magnr', 'sigmagnr', 'chinr', 'sharpnr',
'classtar', 'ssdistnr', 'ssmagnr', 'aimagerat', 'bimagerat',
'magapbig', 'sigmagapbig', 'scorr']
feats_df = pd.DataFrame(data=feats_np, index=meta_np['candid'], columns=COL_NAMES)
print("There are {} columns left.".format(len(feats_df.columns)))
print("They are: {}".format(list(feats_df.columns)))
feats_df.drop(columns=COL_TO_DROP, inplace=True)
feats_df.describe()
Out[3]:
In [4]:
# INSTRUCTION: How many real and bogus examples are in this labeled set
#
real_mask = feats_df['label'] == 1
bogus_mask = ~real_mask
print("Number of Real Examples: {}".format(np.sum(real_mask)))
print("Number of Bogus Examples: {}".format(np.sum(bogus_mask)))
Examine each individual feature. You may use the subroutine below, or code of your own devising. Note the features that are continuous vs categorial, and those that have outliers. There are certain features that have sentinel values. You may wish to view some features on a log scale.
In [5]:
# Histogram a Single Feature
#
def plot_rb_hists(df, colname, bins=100, xscale='linear', yscale='linear'):
mask_real = feats_df['label'] == 1
mask_bogus = ~mask_real
plt.hist(df[colname][mask_real], bins, color='b', alpha=0.5, density=True)
plt.hist(df[colname][mask_bogus], bins, color='r', alpha=0.5, density=True)
plt.xscale(xscale)
plt.yscale(yscale)
plt.title(colname)
plt.show()
# INSTRUCTION: Plot the individual features.
#
for col in feats_df.columns:
if col in ['chipsf', 'sky', 'fwhm']:
plot_rb_hists(feats_df, col, bins=np.logspace(np.log10(0.01),np.log10(10000.0), 50), xscale='log')
else:
plot_rb_hists(feats_df, col)
We need to reserve some of our labeled data for evaluation. This means we must split up the labeled data we have into the set used for training (training set), and the set used for evaluation (test set). Ideally, the distribution of real and bogus examples in both the training and test sets are roughly identical. One can use sklearn.model_selection.train_test_split and use the stratify option.
For ZTF data, we split the training and test data by date. That way repeat observations from the same night (which might be nearly identical) cannot be split into the training and test set, and artificially inflate test performance. Also, due to the change in survey objectives, it's possible that the test set features have drifted away from the training sets.
Provided is nid.npy which contains the Night IDs for ZTF. Split on nid=550 (July 5, 2018). This should leave you with roughly 500 examples in your test set.
In [6]:
feats_plus_label = np.array(feats_df)
nids = meta_np['nid']
# INSTRUCTION: nid.npy contains the nids for this labeled data.
# Split the data into separate data structures for train/test data at nid=500.
# Verify that you have at least 500 reals in your test set.
nid_mask_train = nids <= 550
nid_mask_test = ~nid_mask_train
train_plus_label = feats_plus_label[nid_mask_train,:]
test_plus_label = feats_plus_label[nid_mask_test,:]
nreals_train = np.sum(train_plus_label[:,-1] == 1)
nbogus_train = np.sum(train_plus_label[:,-1] == 0)
nreals_test = np.sum(test_plus_label[:,-1] == 1)
nbogus_test = np.sum(test_plus_label[:,-1] == 0)
print("TRAIN Num Real={}, Bogus={}".format(nreals_train, nbogus_train))
print("TEST Num Real={}, Bogus={}".format(nreals_test, nbogus_test))
In [7]:
# INSTRUCTION: Separate the labels from the features
train_feats = train_plus_label[:,:-1]
train_labels = train_plus_label[:,-1]
test_feats = test_plus_label[:,:-1]
test_labels = test_plus_label[:,-1]
With missing values handled, you're closing to training your classifiers. However, because distance metrics can be sensitive to the scale of your data (e.g., some features span large numeric ranges, but others don't), it is important to normalize data within a standard range such as (0, 1) or with z-score normalization (scaling to unit mean and variance). Fortunately, sklearn also makes this quite easy. Please review sklearn's preprocessing module options, specifically StandardScaler which corresponds to z-score normalization and MinMaxScaler. Please implement one.
FYI - Neural networks and Support Vector Machines (SVM) are sensitive to the scale of the data. Decision trees (and therefore Random Forests) are not (but it doesn't hurt to use scaled data).
In [8]:
# INSTRUCTION: Re-scale your data using either the MinMaxScaler or StandardScaler from sklearn
#
scaler = StandardScaler()
train_feats = scaler.fit_transform(train_feats)
test_feats = scaler.fit_transform(test_feats)
Import a few classifiers and build models on your training data. Some suggestions include a Support Vector Machine, Random Forest, Neural Net, NaiveBayes and K-Nearest Neighbor.
In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
knn3 = KNeighborsClassifier(3),
svml = SVC(kernel="linear", C=0.025, probability=True)
svmr = SVC(gamma=2, C=1, probability=True)
dtre = DecisionTreeClassifier()
rafo = RandomForestClassifier(n_estimators=100)
nnet = MLPClassifier(alpha=1)
naiv = GaussianNB()
# INSTRUCTION: Train three classifiers and run on your test data. Here's an example to get your started.
# Which ones seems to take longer to train?
#
rafo.fit(train_feats, train_labels)
rafo_scores = rafo.predict_proba(test_feats)[:,1]
nnet.fit(train_feats, train_labels)
nnet_scores = nnet.predict_proba(test_feats)[:,1]
svmr.fit(train_feats, train_labels)
svmr_scores = svmr.predict_proba(test_feats)[:,1]
# INSTRUCTION: Print out the following metrics per classifier into a table: accuracy, auc, f1_score, etc.
#
Another way to test performance is to plot a histogram of the test set RB scores, comparing the distributions of the labeled reals vs. boguses. The scores of the reals should be close to 1, while the scores of the boguses should be closer to 0. The more separated the distribution of scores, the better performing your classifier is.
Compare the score distributions of the classifiers you've trained. Trying displaying as a cumulative distribution rather than straight histogram.
Optional: What would the decision thresholds be at the 5, 10 and 20% false negative rate (FNR)? What would the decision threshold be at the 1, 10, and 20% false positive rate?
In [10]:
# INSTRUCTION: create masks for the real and bogus examples of the test set
real_mask_test = test_labels == 1
bogus_mask_test = test_labels == 0
# # INSTRUCTION: First compare the classifiers' scores on the test reals only
# #
scores_list = [rafo_scores, nnet_scores, svmr_scores]
legends = ['Random Forest', 'Neural Net', 'SVM-RBF']
colors = ['g', 'b', 'r']
rbbins = np.arange(0,1,0.001)
# Comparison on Reals
plt.figure()
ax = plt.subplot(111)
for i, scores in enumerate(scores_list):
ax.hist(scores[real_mask_test], rbbins, histtype='step', cumulative=True, density=False, color=colors[i])
ax.set_xlabel('RB Score')
ax.set_ylabel('Count')
ax.set_xbound(0, 1)
ax.legend(legends, loc=4)
plt.show()
# Comparison on Bogus
#
plt.figure()
ax = plt.subplot(111)
rbbins = np.arange(0,1,0.001)
for i, scores in enumerate(scores_list):
ax.hist(scores[bogus_mask_test], rbbins, histtype='step', cumulative=True, density=False, color=colors[i])
ax.set_xlabel('RB Score')
ax.set_ylabel('Count')
ax.set_xbound(0, 1)
ax.legend(legends, loc=4)
plt.show()