Final SpIES High-z Quasar Selection

Notebook performing selection of $3.5<z<5$ quasars from SDSS+SpIES data.

First load the training data, then instantiate and train the algorithm; see https://github.com/gtrichards/QuasarSelection/blob/master/SpIESHighzCandidateSelection.ipynb


In [1]:
%matplotlib inline
from astropy.table import Table
import numpy as np
import matplotlib.pyplot as plt
data = Table.read('GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean.fits')

# X is in the format need for all of the sklearn tools, it just has the colors
X = np.vstack([ data['ug'], data['gr'], data['ri'], data['iz'], data['zs1'], data['s1s2']]).T
y = np.array(data['labels'])


/Users/gtr/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10, max_depth=15, min_samples_split=2, n_jobs=-1, random_state=42)
clf.fit(X, y)


Out[2]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=15, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=-1, oob_score=False, random_state=42,
            verbose=0, warm_start=False)

Second, load the test data

Test Data

Test set data set was made as follows (see 18 April 2016 README entry):

maketest_2016.py

Output is:
classifiers_out = open('GTR-ADM-QSO-ir_classifiers_good_test_2016.dat','w') 
others_out= open('GTR-ADM-QSO-ir_others_good_test_2016.dat','w')
czr_out = open('GTR-ADM-QSO-ir_photoz_in7_good_test_2016.dat','w')

Really need the first two files combined (so that we have both RA/Dec and colors in one place).

But couldn't merge them with TOPCAT or STILTS. So had to break them into 3 pieces (with TOPCAT), then used combine_test_files_STILTS.py to merge them together (just changing the input/output file names by hand).

Actually ran this on dirac so that I'd have more memory than on quasar. Copied the output files back to quasar and merged them together with TOPCAT.

So
GTR-ADM-QSO-ir_others_good_test_2016a.dat + GTR-ADM-QSO-ir_classifiers_good_test_2016a.dat
gives
GTR-ADM-QSO-ir_good_test_2016a.dat
(and so on for "b" and "c").

Then
GTR-ADM-QSO-ir_good_test_2016a.dat + GTR-ADM-QSO-ir_good_test_2016b.dat + GTR-ADM-QSO-ir_good_test_2016c.dat
gives
GTR-ADM-QSO-ir_good_test_2016.dat

N.B. Had to delete the intermediate files from quasar to save disk space.

GTR: Re-did this with combine_test_files_STILTSn.py (on quasar) since I wanted to include the i-band and ch2 magnitudes. The output file is still GTR-ADM-QSO-ir_good_test_2016.dat.

Now read in the test file and convert it to an appropriate array format for sklearn.


In [4]:
data2 = Table.read('GTR-ADM-QSO-ir_good_test_2016.fits')
#data2 = Table.read('GTR-ADM-QSO-ir_classifiers_good_test_2016.fits')
#data2 = Table.read('gtr-adm-qso-ir_photoz_in7_good_test_2016.fits')

Xtest = np.vstack([ data2['ug'], data2['gr'], data2['ri'], data2['iz'], data2['zs1'], data2['s1s2']]).T

print len(Xtest),Xtest[:5]


50939704 [[  3.81193849e-02   3.65796465e-01   5.31836570e-02  -4.89434865e-02
    6.28884801e-01   6.24938477e-01]
 [  5.08303480e-01   1.11420093e-01   8.65531714e-02   1.03706865e-01
    4.21802021e-01   3.36181068e-01]
 [  1.65061871e+00   1.13939853e-01   1.54225782e-01   4.41376234e-01
    4.62311618e-01   4.68052444e-01]
 [  3.68773847e-01   2.59072437e-01   1.81469411e-01   1.44563753e-02
    7.14639858e-01   7.00231807e-01]
 [ -1.32032589e-02   1.96339791e-01   7.44004894e-02   5.95867945e-04
    1.46086353e+00   3.08763164e-01]]

Quasar Candidates

Finally, do the classification and output the test file, including the predicted labels.


In [9]:
ypred = clf.predict(Xtest)

In [10]:
print ypred[:5]
print len(ypred)


[1 1 1 1 1]
50939704

In [12]:
data2['ypred'] = ypred
data2.write('GTR-ADM-QSO-ir_good_test_2016_out.fits', format='fits')

This gives about 27,000 candidates in all and about 3000 on Stripe 82.

You can get the output file from http://www.physics.drexel.edu/~gtr/outgoing/spies/GTR-ADM-QSO-ir_good_test_2016_out.fits.bz2


In [14]:
mask = (data2['ypred']==0)
quasars = data2[mask]
print len(quasars)


27225

The main test set classification ends here. But see below for some experimenting with bagging.

Also, see
SpIESHighzQuasarsS82all.py
for some stand-alone code that classifies Stripe 82 candidates by RF, SVM and bagging.


Bagging classifier

If we want to try bagging instead, here is how we'd do that. But this takes a really long time unless we restrict ourselves to just Stripe 82. We'll use dask to schedule the processes, otherwise it kills the kernel.


In [5]:
from sklearn.preprocessing import StandardScaler 
scaler = StandardScaler()
scaler.fit(X)  # Use the full training set now
XS = scaler.transform(X)
XStest = scaler.transform(Xtest)

In [6]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bag = BaggingClassifier(KNeighborsClassifier(n_neighbors=5), max_samples=0.5, max_features=1.0)

In [7]:
bag.fit(XS, y)


Out[7]:
BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.5, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=None, verbose=0, warm_start=False)

In [8]:
from dask import compute, delayed

def processBAG(Xin):
    return bag.predict(Xin)

# Create dask objects
# Reshape is necessary because the format of x as drawm from Xtest 
# is not what sklearn wants.
dobjsBAG = [delayed(processBAG)(x.reshape(1,-1)) for x in XStest]
#print dobjs

See if we can even handle the first million objects, let alone the full data set.


In [10]:
print dobjsBAG[:5]
dobjsBAG1 = dobjsBAG[:1000000]


[Delayed('processBAG-6d7253d1-9dd0-4a7f-b328-d18caa8cf2b8'), Delayed('processBAG-f1987c0b-131a-4686-882a-0bd7f8e38f53'), Delayed('processBAG-030c344d-88f9-4600-b181-924342a29a90'), Delayed('processBAG-79eecb00-b136-4983-b21e-3f8f54cead4f'), Delayed('processBAG-ece8d4ea-41a5-4ade-ab9c-d5078f8df356')]

In [ ]:
import dask.threaded
ypredBAG1 = compute(*dobjsBAG1, get=dask.threaded.get)

In [23]:
ypredBAG1 = np.array(ypredBAG1).reshape(1,-1)[0]

In [24]:
print ypredBAG1[:100]


[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0]

IFF we could get this to work, then we'd produce output as follows. But this always crashes. Tried running on dirac and it crashes there too. So, had to concentrate on Stripe 82. See SpIESHighzQuasarsS82all.py.


In [ ]:
data2['ypred'] = ypredBAG
data2.write('GTR-ADM-QSO-ir_good_test_2016_out_bag.fits', format='fits')