Accuracy vs Mag DEIMOS Spec Test Set

In this notebook we examine the accuracy as a function of magnitude for sources with spectroscopic classifications from DEIMOS COSMOS survey. The DEIMOS set contains $\sim$ 10K sources, and $\sim$ 2.7K sources are crossmatched with PS1 catalog.

The overall accuracy for the classification by the ML model we developed is $\sim$ 95%, but the FoM @FPR=0.05 is lower than 0.4, which is worse than the FoM obtained with HSTxPS1 catalog.

We found the accuracy of the HST classification is also $\sim$ 95%. The performance of the ML model, therefore, is reasonable because the ML model is trained with the HST classification.



In [1]:

    
import numpy as np 
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline



In [2]:

    
_df = pd.read_table('DEIMOS/deimos_10K_March2018/deimos.tbl', header=None)



In [3]:

    
arr = np.empty((len(_df), len(_df.iloc[0][0].split())), dtype='<U50')

for i in range(len(_df)):
    i_row = [k for k in _df.iloc[i][0].split('  ') if (k != '')and(k != ' ')]
    for j in range(len(_df.iloc[0][0].split())):
        arr[i][j] = i_row[j]



In [4]:

    
df = pd.DataFrame(arr)



In [5]:

    
ra = np.array(df[1], dtype=float)
dec = np.array(df[2], dtype=float)
sel = np.array(df[3], dtype=int)
imag = np.array(df[4].replace('null', '-999').replace(' null', '-999'), dtype=float)
kmag = np.array(df[5].replace('null', '-999').replace(' null', '-999'), dtype=float)
zspec = np.array(df[6].replace('null', '-999').replace(' null', '-999'), dtype=float)
Qflag = np.array(df[7].replace('null', '-999').replace(' null', '-999'), dtype=int)
Q  = np.array(df[8].replace('null', '-999').replace(' null', '-999'), dtype=float)



In [6]:

    
np.array(df[9][0:20])









    Out[6]:





array([' H+K?', ' [OII],[OIII]', ' [OIII]d', ' M star', ' [OII],H+K?',
       ' star', ' Hg,[OIII]?', ' star', ' [OII],[OIII]d', ' [OII],Hb',
       ' star', ' Lya,CIV(br),CIII]', ' [OII],K,Hb,MgI', ' [OII]d,K',
       ' [OII]d,H+H,Hb,[OIII]d', ' [OII]d,H+K', ' [OII],Hb,[OIII]d',
       ' [OII],K,Hb', ' Hb(abs),Ha?', ' [OII]d'], dtype=object)



In [7]:

    
sgFlag = np.empty(len(df), dtype=int)

for i in range(len(df[9])):
    if 'star' in df[9][i]:
        sgFlag[i] = 1 # star
    elif 'null' in df[9][i]:
        sgFlag[i] = -999 # null
    else:
        sgFlag[i] = 0 # galaxy

if "Remarks" contains "star", the source is classifyed star.



In [41]:

    
plt.hist(imag[sgFlag!=-999], bins=np.arange(15, 28, 0.2), color='0.8', label='All')
plt.hist(imag[sgFlag==0], bins=np.arange(15, 28, 0.2), alpha=0.5, label='GALAXY')
plt.hist(imag[sgFlag==1], bins=np.arange(15, 28, 0.2), alpha=0.5, label='STAR')
plt.yscale('log')
plt.xlabel('i mag'); plt.ylabel('#')
plt.legend(loc='best')
plt.show()

The distribution of galaxies looks similar to that of the HST COSMOS catalog, but that of stars has a peak at i-mag$\sim$22, which is not shown in that of the HSTxPS1 catalog.



In [9]:

    
df = pd.DataFrame()
df['ra'] = ra; df['dec'] = dec
df['sel'] = sel 
df['imag'] = imag; df['kmag'] = kmag
df['zspec'] = zspec
df['Qflag'] = Qflag; df['Q'] = Q
df['class'] = sgFlag



In [42]:

    
df[0:10]



In [11]:

    
df.to_csv('./DEIMOS/DEIMOS.csv', index=None)



In [12]:

    
import star_galaxy_models
rf_obj = star_galaxy_models.RandomForestModel()
rf_obj.read_rf_from_pickle()



In [13]:

    
features = ['wwpsfChiSq', 'wwExtNSigma', 'wwpsfLikelihood',
            'wwPSFKronRatio', 'wwPSFKronDist',  'wwPSFApRatio',
            'wwmomentRH', 'wwmomentXX', 'wwmomentXY', 'wwmomentYY', 
            'wwKronRad']



In [14]:

    
from sklearn.metrics import roc_curve, accuracy_score, auc, make_scorer

ROC curve and Accuracy



In [13]:

    
ps1_dei = pd.read_csv('./DEIMOS/PS1_DEIMOS_features.csv').drop_duplicates(subset='objid')
print("PS1xDEIMOS catalog constains %i sources."%len(ps1_dei))









    



PS1xDEIMOS catalog constains 4481 sources.



In [14]:

    
ps1_dei_det_mask = np.logical_and(ps1_dei['class'] != -999, (ps1_dei.nDetections>0)&(ps1_dei.wwKronFlux>0))
ps1_dei = ps1_dei[ps1_dei_det_mask]
print("%i sources are classified by both of the DEIMOS and the ML model."%len(ps1_dei))









    



2709 sources are classified by both of the DEIMOS and the ML model.



In [15]:

    
ps1_df = pd.read_csv('./DEIMOS/HST_COSMOS_features.csv')



In [16]:

    
dupl_mask = np.empty(len(ps1_dei), dtype=bool)
for i in range(len(dupl_mask)):
    dupl_mask[i] = ps1_dei.objid.iloc[i] in np.array(ps1_df.objid)
    
print("Only %i sources are included both of the PS1xDEIMOS and the PS1xHST catalog..."%np.sum(dupl_mask))
ps1_dei = ps1_dei[~dupl_mask]
#print("%i sources are not contained in PS1xHST catalog."%len(ps1_dei))









    



Only 35 sources are included both of the PS1xDEIMOS and the PS1xHST catalog...



In [19]:

    
kron_mag = -2.5*np.log10(ps1_dei.wwKronFlux/3631)
ps1_dei_features = ps1_dei[features]
ps1_dei_class = ps1_dei['class']



In [20]:

    
ps1_dei_score = rf_obj.rf_clf_.predict_proba(ps1_dei_features)
ps1_dei_pred = rf_obj.rf_clf_.predict(ps1_dei_features)



In [21]:

    
print("Overall accuracy of the classification by the ML model is %f"%accuracy_score(ps1_dei_class, ps1_dei_pred))









    



Overall accuracy of the classification by the ML model is 0.949140



In [22]:

    
fpr, tpr, thre = roc_curve(ps1_dei_class, ps1_dei_score[:,1])
plt.grid(linestyle='dotted')
plt.plot(fpr, tpr, 'k-')
#plt.xscale('log'); plt.yscale('log')
plt.xlim(1e-3, 1e-1); plt.ylim(0.1, 1.01)
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.show()



In [23]:

    
ps1_dei_class = np.array(ps1_dei_class)
ps1_dei_score = np.array(ps1_dei_score)
kron_mag = np.array(kron_mag)

binwidth = 1.5
Nboot = 100
mag_array = np.arange(14 , 23+binwidth, binwidth)
kron_mag = np.array(-2.5*np.log10(ps1_dei['wwKronFlux']/3631))

ml_acc_arr = np.zeros_like(mag_array, dtype=float)
ml_boot_scatt = np.vstack((np.zeros_like(mag_array, dtype=float), np.zeros_like(mag_array, dtype=float)))

for bin_num, binedge in enumerate(mag_array):
    bin_sources = np.where((kron_mag >= binedge) & (kron_mag < binedge + binwidth))
    ml_acc_arr[bin_num] = accuracy_score(ps1_dei_class[bin_sources], 
                                         ps1_dei_pred[bin_sources])
    ml_boot_acc = np.empty(Nboot)
    for i in range(Nboot):
        boot_sources = np.random.choice(bin_sources[0], len(bin_sources[0]), 
                                        replace=True)
        ml_boot_acc[i] = accuracy_score(ps1_dei_class[boot_sources], 
                                         ps1_dei_pred[boot_sources])

    ml_boot_scatt[:,bin_num] = np.percentile(ml_boot_acc, [16, 84])



In [24]:

    
from sklearn.neighbors import KernelDensity
kde_grid = np.linspace(10,26,200)
deimos_stars = np.where(ps1_dei_class == 1)
deimos_gal = np.where(ps1_dei_class == 0)
deimos_kde_gal_norm = len(deimos_gal[0])/len(ps1_dei_class)
deimos_kde_star_norm = 1 - deimos_kde_gal_norm

kde_deimos = KernelDensity(bandwidth=1.059*np.std(kron_mag, ddof=1)*len(kron_mag)**(-0.2),
                         rtol=1E-4)
kde_deimos.fit(kron_mag[:, np.newaxis])

kde_deimos_stars = KernelDensity(bandwidth=1.059*np.std(kron_mag[deimos_stars], ddof=1)*len(kron_mag[deimos_stars])**(-0.2),
                               rtol=1E-4)
kde_deimos_stars.fit(kron_mag[deimos_stars[0], np.newaxis])

kde_deimos_gal = KernelDensity(bandwidth=1.059*np.std(kron_mag[deimos_gal], ddof=1)*len(kron_mag[deimos_gal])**(-0.2),
                             rtol=1E-4)
kde_deimos_gal.fit(kron_mag[deimos_gal[0], np.newaxis])

pdf_deimos = np.exp(kde_deimos.score_samples(kde_grid[:, np.newaxis]))
pdf_deimos_stars = np.exp(kde_deimos_stars.score_samples(kde_grid[:, np.newaxis]))
pdf_deimos_gal = np.exp(kde_deimos_gal.score_samples(kde_grid[:, np.newaxis]))



In [35]:

    
from matplotlib.ticker import MultipleLocator
#import seaborn as sns
color_dict = {'ml': "black"}

mag_bin_centers = mag_array + binwidth/2
#cmap_star = sns.cubehelix_palette(rot=0.5, light=0.7,dark=0.3,as_cmap=True)
#cmap_gal = sns.cubehelix_palette(start=0.3,rot=-0.5,light=0.7,dark=0.3,as_cmap=True)

fig, ax = plt.subplots(figsize=(8, 5))

ax.grid(linestyle='dotted', zorder=1)
ax.errorbar(mag_bin_centers, ml_acc_arr, 
            yerr=np.abs(ml_boot_scatt - ml_acc_arr), 
            ls='-', lw=.75, fmt='o',
            color=color_dict['ml'], label="ML model", 
           linewidth=1.5, markersize=7.5, zorder=5)

# add KDE plots
ax.fill(kde_grid, pdf_deimos + 0.5, alpha=0.4, color="0.7", zorder=2)
ax.fill(kde_grid, pdf_deimos_gal*deimos_kde_gal_norm + 0.5, alpha=0.7, zorder=3)#, color=cmap_gal(0.25))
ax.fill(kde_grid, pdf_deimos_stars*deimos_kde_star_norm + 0.5, alpha=0.7, zorder=4)#, color=cmap_star(0.25))

ax.set_ylim(0.5,1.01)
ax.set_xlim(14, 24)
ax.tick_params(which="both", top=True, right=True, labelsize=15)
ax.set_xlabel('whiteKronMag', fontsize=15)
ax.set_ylabel('Accuracy', fontsize=15)

ax.yaxis.set_minor_locator(MultipleLocator(0.025))
ax.xaxis.set_major_locator(MultipleLocator(2))
ax.xaxis.set_minor_locator(MultipleLocator(0.5))

#ax.legend(bbox_to_anchor=(0.01, 0.3, 1., 0.102), loc=3, fontsize=13)

fig.subplots_adjust(top=0.98,right=0.98,left=0.1,bottom=0.12)









    



/Users/tachibana/.pyenv/versions/anaconda2-4.0.0/envs/py36/lib/python3.6/site-packages/matplotlib/cbook/__init__.py:2446: UserWarning: Saw kwargs ['lw', 'linewidth'] which are all aliases for 'linewidth'.  Kept value from 'linewidth'
  seen=seen, canon=canonical, used=seen[-1]))

Accuracy v.s. MAG with DEIMOSxHST



In [201]:

    
from astropy.table import Table



In [202]:

    
deimos = pd.read_csv('./DEIMOS/DEIMOS.csv')



In [203]:

    
hst = Table.read('./DEIMOS/HST_COSMOS.fit').to_pandas()



In [204]:

    
hstX = np.empty((len(hst), 2), dtype=np.float64)
hstX[:, 0] = hst['ALPHA_J2000']
hstX[:, 1] = hst['DELTA_J2000']
deiX = np.empty((len(deimos), 2), dtype=np.float64)
deiX[:, 0] = deimos['ra']
deiX[:, 1] = deimos['dec']

Cross-matching the sources in the DEIMOS catalog within radius = 0.5 arcsec around those in the HST catalog



In [205]:

    
from astroML.crossmatch import crossmatch_angular
max_radius = 0.5 / 3600  # 0.5 arcsec
dist, ind = crossmatch_angular(hstX, deiX, max_radius)
match = ~np.isinf(dist)



In [206]:

    
print("The number of sources cross-matched is %i"%np.sum(match))









    



The number of sources cross-matched is 9233



In [207]:

    
plt.hist(dist[match]*3600, bins=np.arange(0, 0.5,0.01))
plt.xlabel('Distance')
plt.show()

The distribution of the distance has a peak at 0.1 arcsec. Changes the cross-matching radius to 0.3 arcsec.



In [208]:

    
from astroML.crossmatch import crossmatch_angular
max_radius = 0.3 / 3600  # 0.3 arcsec
dist, ind = crossmatch_angular(hstX, deiX, max_radius)
match = ~np.isinf(dist)



In [209]:

    
print("The number of sources cross-matched is %i"%np.sum(match))









    



The number of sources cross-matched is 7897



In [210]:

    
plt.hist(dist[match]*3600, bins=np.arange(0, 0.5,0.01))
plt.xlabel('Distance')
plt.show()



In [211]:

    
hst_match = hst[match]
deimos_match = deimos.loc[ind[match]]

Remove duplicated sources.



In [212]:

    
dupl_mask = deimos_match.duplicated('ra')



In [213]:

    
deimos_match_uniq = deimos_match[~dupl_mask.values]
hst_match_uniq = hst_match[~dupl_mask.values]

Remove the sources which are not able to be classified to star or galaxy by the DEIMOS catalog.



In [214]:

    
good_mask = deimos_match_uniq["class"] != -999



In [215]:

    
deimos_match_uniq_good = deimos_match_uniq[good_mask.values]
hst_match_uniq_good = hst_match_uniq[good_mask.values]



In [216]:

    
print("The number of sources used to verify the classification accuracy is %i"%len(deimos_match_uniq))









    



The number of sources used to verify the classification accuracy is 7589



In [217]:

    
xlims = [12, 29]
ylims = [12, 29]
plt.hexbin(hst_match_uniq["MAG_BEST"], deimos_match_uniq["imag"], 
           extent=[xlims[0], xlims[1], ylims[0], ylims[1]], 
          bins='log', cmap='viridis')
plt.xlim(xlims); plt.ylim(ylims)
plt.xlabel('MAG_BEST(HST)')
plt.ylabel('imag(DEIMOS)')









    Out[217]:





Text(0,0.5,'imag(DEIMOS)')



In [218]:

    
from sklearn.metrics import accuracy_score



In [219]:

    
print("The overall accuracy od the crassification of the HST catalog is %0.4f"\
      %accuracy_score(deimos_match_uniq_good["class"], hst_match_uniq_good["MU_CLASS"]-1))









    



The overall accuracy od the crassification of the HST catalog is 0.9489

Accuracy v.s. MAG



In [220]:

    
dei_class = np.array(deimos_match_uniq_good["class"], dtype=int)
hst_class = np.array(hst_match_uniq_good["MU_CLASS"]-1, dtype=int)
kron_mag = np.array(hst_match_uniq_good["MAG_BEST"])

binwidth = 1
Nboot = 100
mag_array = np.arange(14 , 26+binwidth, binwidth)

ml_acc_arr = np.zeros_like(mag_array, dtype=float)
ml_boot_scatt = np.vstack((np.zeros_like(mag_array, dtype=float), np.zeros_like(mag_array, dtype=float)))

for bin_num, binedge in enumerate(mag_array):
    bin_sources = np.where((kron_mag >= binedge) & (kron_mag < binedge + binwidth))
    ml_acc_arr[bin_num] = accuracy_score(dei_class[bin_sources], 
                                         hst_class[bin_sources])
    ml_boot_acc = np.empty(Nboot)
    for i in range(Nboot):
        boot_sources = np.random.choice(bin_sources[0], len(bin_sources[0]), 
                                        replace=True)
        ml_boot_acc[i] = accuracy_score(dei_class[boot_sources], 
                                         hst_class[boot_sources])

    ml_boot_scatt[:,bin_num] = np.percentile(ml_boot_acc, [16, 84])



In [221]:

    
from sklearn.neighbors import KernelDensity
kde_grid = np.linspace(10,29,200)
deimos_stars = np.where(dei_class == 1)
deimos_gal = np.where(dei_class == 0)
deimos_kde_gal_norm = len(deimos_gal[0])/len(dei_class)
deimos_kde_star_norm = 1 - deimos_kde_gal_norm

kde_deimos = KernelDensity(bandwidth=1.059*np.std(kron_mag, ddof=1)*len(kron_mag)**(-0.2),
                         rtol=1E-4)
kde_deimos.fit(kron_mag[:, np.newaxis])

kde_deimos_stars = KernelDensity(bandwidth=1.059*np.std(kron_mag[deimos_stars], ddof=1)*len(kron_mag[deimos_stars])**(-0.2),
                               rtol=1E-4)
kde_deimos_stars.fit(kron_mag[deimos_stars[0], np.newaxis])

kde_deimos_gal = KernelDensity(bandwidth=1.059*np.std(kron_mag[deimos_gal], ddof=1)*len(kron_mag[deimos_gal])**(-0.2),
                             rtol=1E-4)
kde_deimos_gal.fit(kron_mag[deimos_gal[0], np.newaxis])

pdf_deimos = np.exp(kde_deimos.score_samples(kde_grid[:, np.newaxis]))
pdf_deimos_stars = np.exp(kde_deimos_stars.score_samples(kde_grid[:, np.newaxis]))
pdf_deimos_gal = np.exp(kde_deimos_gal.score_samples(kde_grid[:, np.newaxis]))



In [222]:

    
from matplotlib.ticker import MultipleLocator
#import seaborn as sns
color_dict = {'ml': "black"}

mag_bin_centers = mag_array + binwidth/2
#cmap_star = sns.cubehelix_palette(rot=0.5, light=0.7,dark=0.3,as_cmap=True)
#cmap_gal = sns.cubehelix_palette(start=0.3,rot=-0.5,light=0.7,dark=0.3,as_cmap=True)

fig, ax = plt.subplots(figsize=(8, 5))

ax.grid(linestyle='dotted', zorder=1)
ax.errorbar(mag_bin_centers, ml_acc_arr, 
            yerr=np.abs(ml_boot_scatt - ml_acc_arr), 
            ls='-', lw=.75, fmt='o',
            color=color_dict['ml'], label="ML model", 
           linewidth=1.5, markersize=7.5, zorder=5)

# add KDE plots
ax.fill(kde_grid, pdf_deimos + 0.5, alpha=0.4, color="0.7", zorder=2)
ax.fill(kde_grid, pdf_deimos_gal*deimos_kde_gal_norm + 0.5, alpha=0.7, zorder=3)#, color=cmap_gal(0.25))
ax.fill(kde_grid, pdf_deimos_stars*deimos_kde_star_norm + 0.5, alpha=0.7, zorder=4)#, color=cmap_star(0.25))

ax.set_ylim(0.5,1.01)
ax.set_xlim(14, 27)
ax.tick_params(which="both", top=True, right=True, labelsize=15)
ax.set_xlabel('MAG_BEST', fontsize=15)
ax.set_ylabel('Accuracy', fontsize=15)

ax.yaxis.set_minor_locator(MultipleLocator(0.025))
ax.xaxis.set_major_locator(MultipleLocator(2))
ax.xaxis.set_minor_locator(MultipleLocator(0.5))

#ax.legend(bbox_to_anchor=(0.01, 0.3, 1., 0.102), loc=3, fontsize=13)

fig.subplots_adjust(top=0.98,right=0.98,left=0.1,bottom=0.12)









    



/Users/tachibana/.pyenv/versions/anaconda2-4.0.0/envs/py36/lib/python3.6/site-packages/matplotlib/cbook/__init__.py:2446: UserWarning: Saw kwargs ['lw', 'linewidth'] which are all aliases for 'linewidth'.  Kept value from 'linewidth'
  seen=seen, canon=canonical, used=seen[-1]))



In [ ]:

	ra	dec	sel	imag	kmag	zspec	Qflag	Q	class
0	149.358553	2.750301	512	20.11	20.19	0.5655	2	1.5	0
1	149.366394	2.376694	128	22.89	21.73	0.7080	3	2.0	0
2	149.369095	2.344088	128	21.73	19.99	0.7080	3	2.0	0
3	149.373750	2.776508	2	18.17	17.13	0.0000	4	2.0	1
4	149.374100	2.736905	2	21.37	19.37	0.5490	2	1.5	0
5	149.374496	2.415250	2	15.41	15.14	0.0000	4	2.0	1
6	149.375504	2.423475	128	24.28	21.87	1.0790	1	1.5	0
7	149.377594	2.363211	2	18.49	18.14	0.0000	4	2.0	1
8	149.381103	2.361963	512	21.84	20.43	0.7720	4	2.0	0
9	149.382095	2.409508	128	22.50	20.64	0.9300	3	2.0	0