Final SpIES High-z Quasar Selection

Notebook performing selection of $3.5<z<5$ quasars from SDSS+SpIES data.

Largely the same as SpIESHighzQuasars notebook except using the algoirthm(s) from SpIESHighzCandidateSelection2. See notes below for creating a version of the test set that includes i-band mag and extinctu. (This wasn't easy.)

First load the training data, then instantiate and train the algorithm; see https://github.com/gtrichards/QuasarSelection/blob/master/SpIESHighzCandidateSelection2.ipynb


In [1]:
%matplotlib inline
from astropy.table import Table
import numpy as np
import matplotlib.pyplot as plt
data = Table.read('GTR-ADM-QSO-ir-testhighz_findbw_lup_2016_starclean.fits')

# X is in the format need for all of the sklearn tools, it just has the colors
# X = np.vstack([ data['ug'], data['gr'], data['ri'], data['iz'], data['zs1'], data['s1s2'], data['imag'], data['extinctu']]).T
# Don't use imag and extinctu since they don't contribute much to the accuracy and they add a lot to the data volume.
X = np.vstack([ data['ug'], data['gr'], data['ri'], data['iz'], data['zs1'], data['s1s2'] ]).T
y = np.array(data['labels'])


/Users/gtr/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:
# For algorithms that need scaled data:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)  # Use the full training set now
XStrain = scaler.transform(X)

In [3]:
# SVM
from sklearn.svm import SVC
svm = SVC(random_state=42)
svm.fit(XStrain,y)


Out[3]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False)

In [4]:
# Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bag = BaggingClassifier(KNeighborsClassifier(n_neighbors=7), max_samples=0.5, max_features=1.0, random_state=42)
bag.fit(XStrain, y)


Out[4]:
BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=7, p=2,
           weights='uniform'),
         bootstrap=True, bootstrap_features=False, max_features=1.0,
         max_samples=0.5, n_estimators=10, n_jobs=1, oob_score=False,
         random_state=42, verbose=0, warm_start=False)

Second, load the test data

Test Data

Test set data set was made as follows (see 18 April 2016 README entry):

maketest_2016.py

Output is:
classifiers_out = open('GTR-ADM-QSO-ir_classifiers_good_test_2016.dat','w') 
others_out= open('GTR-ADM-QSO-ir_others_good_test_2016.dat','w')
czr_out = open('GTR-ADM-QSO-ir_photoz_in7_good_test_2016.dat','w')

Really need the first two files combined (so that we have both RA/Dec and colors in one place).

But couldn't merge them with TOPCAT or STILTS. So had to break them into 3 pieces (with TOPCAT), then used combine_test_files_STILTS.py to merge them together (just changing the input/output file names by hand).

Actually ran this on dirac so that I'd have more memory than on quasar. Copied the output files back to quasar and merged them together with TOPCAT.

So
GTR-ADM-QSO-ir_others_good_test_2016a.dat + GTR-ADM-QSO-ir_classifiers_good_test_2016a.dat
gives
GTR-ADM-QSO-ir_good_test_2016a.dat
(and so on for "b" and "c").

Then
GTR-ADM-QSO-ir_good_test_2016a.dat + GTR-ADM-QSO-ir_good_test_2016b.dat + GTR-ADM-QSO-ir_good_test_2016c.dat
gives
GTR-ADM-QSO-ir_good_test_2016.dat
and similarly for the fits output file.


Since I wanted to use the imag and extinctu, then I also had to make a version of the test file with combine_test_files_STILTSn.py (on quasar). This was fairly involved because of memory issues. The new output file is GTR-ADM-QSO-ir_good_test_2016n.dat. In the end, I ended up not using that and this is more of an exploration of SVM and bagging as alternatives to RF.

Now read in the test file and convert it to an appropriate array format for sklearn.


In [5]:
#data2 = Table.read('GTR-ADM-QSO-ir_good_test_2016n.fits')
data2 = Table.read('GTR-ADM-QSO-ir_good_test_2016.fits')

In [8]:
print data2.keys()


['ra', 'dec', 'iflux', 'morph', 'knownqso', 'extinctu', 'ug', 'gr', 'ri', 'iz', 'zs1', 's1s2', 'i', 's2', 'ierr', 's2err']

I had some problems with GTR-ADM-QSO-ir_good_test_2016n.fits because it thought that there were blank entries among the attributes. There actually weren't (as far as I could tell), but I found that I could use filled to fix the problem. However, that just caused problems later!


In [10]:
# Not sure why I need to do this because there don't appear to be any unfilled columns
# but the code segment below won't run without it.
# Only need to do for the file with imag and extinctu
# data2 = data2.filled()

Taking too long to do all the objects, so just do Stripe 82, which is all that we really care about anyway.


In [6]:
ramask = ( ( (data2['ra']>=300.0) & (data2['ra']<=360.0) ) | ( (data2['ra']>=0.0) & (data2['ra']<=60.0) ) )
decmask = ((data2['dec']>=-1.5) & (data2['dec']<=1.5))

In [7]:
dataS82 = data2[ramask & decmask]

In [8]:
print len(dataS82)


2029149

In [9]:
#Xtest = np.vstack([dataS82['ug'], dataS82['gr'], dataS82['ri'], dataS82['iz'], dataS82['zs1'], dataS82[]'s1s2'], dataS82['i'], data2['extinctu']]).T
Xtest = np.vstack([dataS82['ug'], dataS82['gr'], dataS82['ri'], dataS82['iz'], dataS82['zs1'], dataS82['s1s2'] ]).T

In [10]:
XStest = scaler.transform(Xtest)

Quasar Candidates

Finally, do the classification and output the test file, including the predicted labels.


In [11]:
from dask import compute, delayed

def processSVM(Xin):
    return svm.predict(Xin)

# Create dask objects
# Reshape is necessary because the format of x as drawm from Xtest 
# is not what sklearn wants.
dobjsSVM = [delayed(processSVM)(x.reshape(1,-1)) for x in XStest]

In [ ]:
import dask.threaded
ypredSVM = compute(*dobjsSVM, get=dask.threaded.get)

In [13]:
ypredSVM = np.array(ypredSVM).reshape(1,-1)[0]

In [9]:
from dask import compute, delayed

def processBAG(Xin):
    return bag.predict(Xin)

# Create dask objects
# Reshape is necessary because the format of x as drawm from Xtest 
# is not what sklearn wants.
dobjsBAG = [delayed(processBAG)(x.reshape(1,-1)) for x in XStest]

In [ ]:
import dask.threaded
ypredBAG = compute(*dobjsBAG, get=dask.threaded.get)

In [ ]:
ypredBAG = np.array(ypredBAG).reshape(1,-1)[0]

Now write results to output file. Didn't do bagging b/c takes too long. See SpIESHighzQuasarsS82all.py which I ran on dirac.


In [17]:
dataS82['ypredSVM'] = ypredSVM
dataS82['ypredBAG'] = ypredBAG
#dataS82.write('GTR-ADM-QSO-ir_good_test_2016_Stripe82svm.fits', format='fits')