5 December 2016
Notebook for testing sklearn algorithms for selection of $3.5<z<5$ quasars from SDSS+SpIES.
Parallels original SpIESHighzCandidateSelection, expect that I included the imag and extinctu.
Training data sets were made as follows (see 12 April 2016 README entry):
maketrainhighz2flup_2016.py
Input:
/Users/gtr/Work/data/ALLWISE/fromadam2016/GTR-ADM-QSO-ir-master_matches2016.fits
/Users/gtr/Work/data/ALLWISE/fromadam2016/GTR-ADM-QSO-ir-master_nonmatches2016.every100.fits
Output:
GTR-ADM-QSO-ir-master-sweeps_qsomatches_lup_train.photoz_in.2016.dat
GTR-ADM-QSO-ir_labels-nonflagged_qsohighz_2016.dat
GTR-ADM-QSO-ir_classifiers-nonflagged_qsohighz_2016.dat
GTR-ADM-QSO-ir_others-nonflagged_qsohighz_2016.dat
GTR-ADM-QSO-ir_all-nonflagged_qsohighz_2016.dat
highz_findbw_lup_2016.csh
computes bandwidths
and produces "starclean" files:
GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_colors_starclean.dat
GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_others_starclean.dat
GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_labels_starclean.dat
693559 objects in all
4193 quasars
689365 stars
Merged these together (15 November 2016) with TOPCAT to produce
GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean.fits
In [1]:
# Read in training data file
%matplotlib inline
from astropy.table import Table
import numpy as np
import matplotlib.pyplot as plt
data = Table.read('GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean.fits')
# X is in the format need for all of the sklearn tools, it just has the colors
X = np.vstack([ data['ug'], data['gr'], data['ri'], data['iz'], data['zs1'], data['s1s2'], data['imag'], data['extinctu']]).T
y = np.array(data['labels'])
#y = np.array(data['zspec'])
In [3]:
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.metrics import classification_report
# Split the training data into training and test sets for cross-validation
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=42)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(Xtrain) # Don't cheat - fit only on training data
XS = scaler.transform(X)
XStrain = scaler.transform(Xtrain)
XStest = scaler.transform(Xtest) # apply same transformation to test data
In [4]:
from sklearn.svm import SVC
svm = SVC(random_state=42)
svm.fit(Xtrain, ytrain)
ypredSVM = svm.predict(Xtest)
print(classification_report(ytest, ypredSVM, target_names=['QSOs', 'stars']))
In [19]:
print(classification_report(ytest, ypredSVM, target_names=['QSOs', 'stars']))
Pretty good! 73% completeness and 95% efficiency. But not as good as colors only.
In [20]:
from sklearn.svm import SVC
svm = SVC(random_state=42)
svm.fit(XStrain, ytrain)
ypredSVM2 = svm.predict(XStest)
print(classification_report(ytest, ypredSVM2, target_names=['QSOs', 'stars']))
Scaled results are better, however. 81% completeness and 95% efficiency. But would like even better efficiency if possible.
Play with some of the options. Linear and poly (with other defaults) are bad. Sigmoid is terrible!
In [58]:
from sklearn.svm import SVC
svm = SVC(kernel='sigmoid', random_state=42)
svm.fit(XStrain, ytrain)
ypredSVM2 = svm.predict(XStest)
print(classification_report(ytest, ypredSVM2, target_names=['QSOs', 'stars']))
In [21]:
# Random Forests, etc.
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
clfDTC = DecisionTreeClassifier(max_depth=None, min_samples_split=2, random_state=42)
clfRFC = RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=42, n_jobs=-1)
clfETC = ExtraTreesClassifier(n_estimators=10, max_depth=None, min_samples_split=2, random_state=42)
clfDTC.fit(Xtrain, ytrain)
clfRFC.fit(Xtrain, ytrain)
clfETC.fit(Xtrain, ytrain)
ypredDTC = clfDTC.predict(Xtest)
ypredRFC = clfRFC.predict(Xtest)
ypredETC = clfETC.predict(Xtest)
In [22]:
print clfDTC.feature_importances_
print clfRFC.feature_importances_
print clfETC.feature_importances_
print(classification_report(ytest, ypredDTC, target_names=['QSOs', 'stars']))
print(classification_report(ytest, ypredRFC, target_names=['QSOs', 'stars']))
print(classification_report(ytest, ypredETC, target_names=['QSOs', 'stars']))
None of these are as good as SVM
In [24]:
# Now scaled
clfDTC.fit(XStrain, ytrain)
clfRFC.fit(XStrain, ytrain)
clfETC.fit(XStrain, ytrain)
ypredDTC2 = clfDTC.predict(XStest)
ypredRFC2 = clfRFC.predict(XStest)
ypredETC2 = clfETC.predict(XStest)
In [26]:
print clfDTC.feature_importances_
print clfRFC.feature_importances_
print clfETC.feature_importances_
print(classification_report(ytest, ypredDTC2, target_names=['QSOs', 'stars']))
print(classification_report(ytest, ypredRFC2, target_names=['QSOs', 'stars']))
print(classification_report(ytest, ypredETC2, target_names=['QSOs', 'stars']))
Even with scaled data, none of these are as good as SVM.
In [27]:
# Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(), max_samples=0.5, max_features=1.0, random_state=42, n_jobs=-1)
bagging.fit(Xtrain, ytrain)
ypredBag = bagging.predict(Xtest)
print(classification_report(ytest, ypredBag, target_names=['QSOs', 'stars']))
Efficiency the best so far, but relatively low completeness. Try scaled.
In [28]:
bagging.fit(XStrain, ytrain)
ypredBag2 = bagging.predict(XStest)
print(classification_report(ytest, ypredBag2, target_names=['QSOs', 'stars']))
Better with scaled data. Not as complete as SVM, but more efficient. Try changing number of neighbors.
In [29]:
# Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors=7), max_samples=0.5, max_features=1.0, random_state=42, n_jobs=-1)
bagging.fit(XStrain, ytrain)
ypredBag3 = bagging.predict(XStest)
print(classification_report(ytest, ypredBag3, target_names=['QSOs', 'stars']))
Yes! This is the best so far!. 97% efficiency. Still not 81% complete, but pretty good.
In [30]:
# Boosting
from sklearn.metrics import f1_score
from sklearn.ensemble import GradientBoostingClassifier
boost = GradientBoostingClassifier(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42)
boost.fit(Xtrain, ytrain)
ypredBoost = boost.predict(Xtest)
print f1_score(ytest, ypredBoost)
print(classification_report(ytest, ypredBoost, target_names=['QSOs', 'stars']))
print boost.feature_importances_
Good, but not great. Shows that i-mag adds more than last 3 colors, but extinction does little.
In [31]:
# With scaled data
boost.fit(XStrain, ytrain)
ypredBoost2 = boost.predict(XStest)
print f1_score(ytest, ypredBoost2)
print(classification_report(ytest, ypredBoost2, target_names=['QSOs', 'stars']))
print boost.feature_importances_
Still not as good.
Continue to explore bagging parameters to see if we can do better, then use CV to determine final efficiency and contamination.
In [47]:
# Bagging study
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors=7), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
bagging.fit(Xtrain, ytrain)
ypredBag = bagging.predict(Xtest)
print(classification_report(ytest, ypredBag, target_names=['QSOs', 'stars']))
In [48]:
# Bagging study
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors=7), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
bagging.fit(XStrain, ytrain)
ypredBag = bagging.predict(XStest)
print(classification_report(ytest, ypredBag, target_names=['QSOs', 'stars']))
In [52]:
# Neighbors-optimized Bagging
depth = np.arange(3, 10) # number of neighbors, not depth, just reusing
score_test = np.zeros(len(depth))
score_train = np.zeros(len(depth))
i_best = 0
z_fit_best = None
# Loop over the depths
for i, d in enumerate(depth):
clf = BaggingClassifier(KNeighborsClassifier(n_neighbors=d), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
clf.fit(XStrain, ytrain)
ypredtrain = clf.predict(XStrain)
# We need some way to decide what is best
# Try using the f1_score and comparing the values for the different depths
score_train[i] = f1_score(ytrain, ypredtrain)
ypredtest = clf.predict(XStest)
score_test[i] = f1_score(ytest, ypredtest)
if score_test[i] >= score_test[i_best]:
i_best = i
ybest = ypredtest
best_depth = depth[i_best]
# Give the best depth and the corresponding classification report.
print best_depth
print(classification_report(ytest, ybest, target_names=['QSOs', 'stars']))
In [53]:
# Plot the f1 score vs. number of neighbors
fig = plt.figure(figsize=(6, 6))
plt.plot(depth,score_test,label='test')
plt.plot(depth,score_train,label='train')
plt.xlabel("neighbors")
plt.ylabel("score")
plt.legend()
plt.show()
Note that I think for this classifier a "good" score is lower. So, the best number of neighbors is >~9.
In [61]:
# Bagging study
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier(n_neighbors=9), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
bagging.fit(XStrain, ytrain)
ypredBagKNN = bagging.predict(XStest)
print(classification_report(ytest, ypredBagKNN, target_names=['QSOs', 'stars']))
In [55]:
# Bagging with RF
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier(RandomForestClassifier(n_estimators=10, max_depth=None, min_samples_split=2), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
bagging.fit(XStrain, ytrain)
ypredBag = bagging.predict(XStest)
print(classification_report(ytest, ypredBag, target_names=['QSOs', 'stars']))
In [59]:
# Bagging with SVM
from sklearn.ensemble import BaggingClassifier
bagging = BaggingClassifier(SVC(), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
bagging.fit(XStrain, ytrain)
ypredBag = bagging.predict(XStest)
print(classification_report(ytest, ypredBag, target_names=['QSOs', 'stars']))
Is the bagging version any different for the same random state?
In [60]:
from sklearn.svm import SVC
svm = SVC(random_state=42)
svm.fit(XStrain, ytrain)
ypredSVM2 = svm.predict(XStest)
print(classification_report(ytest, ypredSVM2, target_names=['QSOs', 'stars']))
OK, SVM alone gets highest completeness and bagging with KNN gets highest efficiency. Do CV for both to classify all and look at the results.
In [64]:
clf = SVC(random_state=42)
N=5
ypredCVSVM = cross_val_predict(clf, XS, y, cv=N)
print(classification_report(y, ypredCVSVM, target_names=['QSOs', 'stars']))
In [65]:
clf = BaggingClassifier(KNeighborsClassifier(n_neighbors=7), max_samples=0.5, max_features=1.0, n_jobs=-1, random_state=42)
N=5
ypredCVbagKNN = cross_val_predict(clf, XS, y, cv=N)
print(classification_report(y, ypredCVbagKNN, target_names=['QSOs', 'stars']))
In [66]:
data['ypredSVM'] = ypredCVSVM
data['ypredBAGKNN'] = ypredCVbagKNN
In [67]:
qq = ((data['labels']==0) & (data['ypredSVM']==0))
ss = ((data['labels']==1) & (data['ypredSVM']==1))
qs = ((data['labels']==0) & (data['ypredSVM']==1))
sq = ((data['labels']==1) & (data['ypredSVM']==0))
dataqq = data[qq]
datass = data[ss]
dataqs = data[qs]
datasq = data[sq]
print len(dataqq), "quasars selected as quasars"
print len(dataqs), "quasars selected as stars"
print len(datasq), "stars selected as quasars"
In [68]:
qq2 = ((data['labels']==0) & (data['ypredBAGKNN']==0))
ss2 = ((data['labels']==1) & (data['ypredBAGKNN']==1))
qs2 = ((data['labels']==0) & (data['ypredBAGKNN']==1))
sq2 = ((data['labels']==1) & (data['ypredBAGKNN']==0))
dataqq2 = data[qq2]
datass2 = data[ss2]
dataqs2 = data[qs2]
datasq2 = data[sq2]
print len(dataqq2), "quasars selected as quasars"
print len(dataqs2), "quasars selected as stars"
print len(datasq2), "stars selected as quasars"
In [71]:
data.write('GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean_class2.fits', format='fits')
Don't forget that many of the "stars" are quasars with $\lesssim3.5$. The bagged KNN looks to be pretty clean of stars even though the contamination rate quoted above is high. See the redshift histograms below.
Similarly, most of the failed quasars are at $z\gtrsim3$, which isn't all that surprising.
This is about as good as the KDE results (at least for self classification), which got 78% completeness with 97% efficiency (where some of those stars classified as quasars are also likely $z\sim3.5$). The combination of these suggests that the candidate list that John is using should has 3% or less contamination. We can check that by also testing the photo-z's.
Make a few plots illustrating that not all of the contamination is from stars (but is instead from $z\lesssim3.5$).
In [72]:
fig = plt.figure(figsize=(12, 4))
#fig.subplots_adjust(vspace=0.05)
ax1 = fig.add_subplot(131)
plt.scatter(datass['ug'],datass['s1s2'],label='ss',edgecolor='None',color='m')
plt.scatter(dataqq['ug'],dataqq['s1s2'],label='qq',edgecolor='None',color='b',alpha=0.5)
plt.scatter(dataqs['ug'],dataqs['s1s2'],label='qs',edgecolor='None',color='k',alpha=0.5)
plt.scatter(datasq['ug'],datasq['s1s2'],label='sq',edgecolor='None',color='g',alpha=0.5)
plt.legend(scatterpoints=1)
plt.xlim([-4,8])
plt.ylim([-1.5,1.5])
plt.xlabel('u-g')
plt.ylabel('ch1-ch2')
ax2 = fig.add_subplot(132)
ax2.yaxis.set_major_formatter(plt.NullFormatter())
plt.scatter(datass2['ri'],datass2['s1s2'],label='ss',edgecolor='None',color='m')
plt.scatter(dataqq2['ri'],dataqq2['s1s2'],label='qq',edgecolor='None',color='b',alpha=0.5)
plt.scatter(dataqs2['ri'],dataqs2['s1s2'],label='qs',edgecolor='None',color='k',alpha=0.5)
plt.scatter(datasq2['ri'],datasq2['s1s2'],label='sq',edgecolor='None',color='g',alpha=0.5)
plt.legend(scatterpoints=1)
plt.xlim([-2,4.5])
plt.ylim([-1.5,1.5])
plt.xlabel('r-i')
plt.ylabel('ch1-ch2')
ax3 = fig.add_subplot(133)
ax3.yaxis.set_major_formatter(plt.NullFormatter())
plt.scatter(datass2['zs1'],datass2['s1s2'],label='ss',edgecolor='None',color='m')
plt.scatter(dataqq2['zs1'],dataqq2['s1s2'],label='qq',edgecolor='None',color='b',alpha=0.5)
plt.scatter(dataqs2['zs1'],dataqs2['s1s2'],label='qs',edgecolor='None',color='k',alpha=0.5)
plt.scatter(datasq2['zs1'],datasq2['s1s2'],label='sq',edgecolor='None',color='g',alpha=0.5)
plt.legend(scatterpoints=1)
plt.xlim([-3,6])
plt.ylim([-1.5,1.5])
plt.xlabel('z-ch1')
plt.ylabel('ch1-ch2')
plt.show()
These are more revealing as contours or density plots (which I examined in TOPCAT), but I'm going to leave them as they are for now.
Histogram of SVM selection.
In [69]:
fig = plt.figure(figsize=(6, 6))
bins=np.linspace(0,5.5,100)
plt.hist(dataqq['zspec'],bins=bins,label='qq',color='b',alpha=0.5)
plt.hist(dataqs['zspec'],bins=bins,label='qs',color='r',alpha=0.5)
plt.hist(datasq['zspec'],bins=bins,label='sq',color='g',alpha=0.5)
#plt.xlim([-4,8])
#plt.ylim([-1.5,1.5])
plt.legend()
plt.xlabel('zspec')
plt.ylabel('N')
Out[69]:
Star contam appears to be lower than for RF in SpIESHighzCandidateSelection
Histogram of Bagging selection.
In [70]:
fig = plt.figure(figsize=(6, 6))
bins=np.linspace(0,5.5,100)
plt.hist(dataqq2['zspec'],bins=bins,label='qq',color='b',alpha=0.5)
plt.hist(dataqs2['zspec'],bins=bins,label='qs',color='r',alpha=0.5)
plt.hist(datasq2['zspec'],bins=bins,label='sq',color='g',alpha=0.5)
#plt.xlim([-4,8])
#plt.ylim([-1.5,1.5])
plt.legend()
plt.xlabel('zspec')
plt.ylabel('N')
Out[70]:
Star contam here is the lowest of all. So, try to implement this in SpIESHighzQuasars2. But we'll find that that causes trouble given how slow it is.