Notebook performing selection of $3.5<z<5$ quasars from SDSS+SpIES data.
First load the training data, then instantiate and train the algorithm; see https://github.com/gtrichards/QuasarSelection/blob/master/SpIESHighzCandidateSelection.ipynb
In [1]:
%matplotlib inline
from astropy.table import Table
import numpy as np
import matplotlib.pyplot as plt
data = Table.read('GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean.fits')
# X is in the format need for all of the sklearn tools, it just has the colors
X = np.vstack([ data['ug'], data['gr'], data['ri'], data['iz'], data['zs1'], data['s1s2']]).T
y = np.array(data['labels'])
In [2]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, max_depth=15, min_samples_split=2, n_jobs=-1, random_state=42)
clf.fit(X, y)
Out[2]:
Second, load the test data
Test set data set was made as follows (see 18 April 2016 README entry):
maketest_2016.py
Output is:
classifiers_out = open('GTR-ADM-QSO-ir_classifiers_good_test_2016.dat','w')
others_out= open('GTR-ADM-QSO-ir_others_good_test_2016.dat','w')
czr_out = open('GTR-ADM-QSO-ir_photoz_in7_good_test_2016.dat','w')
Really need the first two files combined (so that we have both RA/Dec and colors in one place).
But couldn't merge them with TOPCAT or STILTS. So had to break them into 3 pieces (with TOPCAT),
then used combine_test_files_STILTS.py to merge them together (just changing the input/output file names by hand).
Actually ran this on dirac so that I'd have more memory than on quasar. Copied the output files back to quasar and merged them together with TOPCAT.
So
GTR-ADM-QSO-ir_others_good_test_2016a.dat + GTR-ADM-QSO-ir_classifiers_good_test_2016a.dat
gives
GTR-ADM-QSO-ir_good_test_2016a.dat
(and so on for "b" and "c").
Then
GTR-ADM-QSO-ir_good_test_2016a.dat + GTR-ADM-QSO-ir_good_test_2016b.dat + GTR-ADM-QSO-ir_good_test_2016c.dat
gives
GTR-ADM-QSO-ir_good_test_2016.dat
N.B. Had to delete the intermediate files from quasar to save disk space.
GTR: Re-did this with combine_test_files_STILTSn.py (on quasar) since I wanted to include the i-band and ch2 magnitudes. The output file is still GTR-ADM-QSO-ir_good_test_2016.dat.
Now read in the test file and convert it to an appropriate array format for sklearn.
In [4]:
data2 = Table.read('GTR-ADM-QSO-ir_good_test_2016.fits')
#data2 = Table.read('GTR-ADM-QSO-ir_classifiers_good_test_2016.fits')
#data2 = Table.read('gtr-adm-qso-ir_photoz_in7_good_test_2016.fits')
Xtest = np.vstack([ data2['ug'], data2['gr'], data2['ri'], data2['iz'], data2['zs1'], data2['s1s2']]).T
print len(Xtest),Xtest[:5]
In [9]:
ypred = clf.predict(Xtest)
In [10]:
print ypred[:5]
print len(ypred)
In [12]:
data2['ypred'] = ypred
data2.write('GTR-ADM-QSO-ir_good_test_2016_out.fits', format='fits')
This gives about 27,000 candidates in all and about 3000 on Stripe 82.
You can get the output file from http://www.physics.drexel.edu/~gtr/outgoing/spies/GTR-ADM-QSO-ir_good_test_2016_out.fits.bz2
In [14]:
mask = (data2['ypred']==0)
quasars = data2[mask]
print len(quasars)
The main test set classification ends here. But see below for some experimenting with bagging.
Also, see
SpIESHighzQuasarsS82all.py
for some stand-alone code that classifies Stripe 82 candidates by RF, SVM and bagging.
In [5]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X) # Use the full training set now
XS = scaler.transform(X)
XStest = scaler.transform(Xtest)
In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bag = BaggingClassifier(KNeighborsClassifier(n_neighbors=5), max_samples=0.5, max_features=1.0)
In [7]:
bag.fit(XS, y)
Out[7]:
In [8]:
from dask import compute, delayed
def processBAG(Xin):
return bag.predict(Xin)
# Create dask objects
# Reshape is necessary because the format of x as drawm from Xtest
# is not what sklearn wants.
dobjsBAG = [delayed(processBAG)(x.reshape(1,-1)) for x in XStest]
#print dobjs
See if we can even handle the first million objects, let alone the full data set.
In [10]:
print dobjsBAG[:5]
dobjsBAG1 = dobjsBAG[:1000000]
In [ ]:
import dask.threaded
ypredBAG1 = compute(*dobjsBAG1, get=dask.threaded.get)
In [23]:
ypredBAG1 = np.array(ypredBAG1).reshape(1,-1)[0]
In [24]:
print ypredBAG1[:100]
IFF we could get this to work, then we'd produce output as follows. But this always crashes. Tried running on dirac and it crashes there too. So, had to concentrate on Stripe 82. See SpIESHighzQuasarsS82all.py.
In [ ]:
data2['ypred'] = ypredBAG
data2.write('GTR-ADM-QSO-ir_good_test_2016_out_bag.fits', format='fits')