How to use extreme deconvolution (XD) as a classifier

XD is a code for estimating an underlying probability density of some physical features in the presence of (understood!) experimental covariances. In short, given data with known noise properties, XD "deconvolves" the known covariances from the noisy distribution, producing an estimate of data that would have been obtained with an "ideal" apparatus.

In the lQSO classification problem, we want to estimate the following using Bayes' rule:

\begin{align} P(lQSO|features) = \frac{P(features|lQSO)P(lQSO)}{P(features|lQSO)P(lQSO) + P(features| \neg lQSO)P( \neg lQSO)} \end{align}

$P(lQSO)$ is a prior probability of an object in our catalog being a lensed quasar.

Short term: use value of lQSO's per square degree from simulation.
Long term: account for PS1 sensitivity, etc.

The likelihood function, $P(features|lQSO)$, depends on both underlying physics and the noise properties of our experiment. Beacuse many of the lQSOs we expect from PS1 will have noisy measurements, and because this noise may lead to fake detections, we need to use XD to evaluate this likelihood function in terms of underlying physics only.

Unfortunately, XD is fundamentally a gaussian mixture model (GMM), and is therefore most efficient at modeling distributions which are close to multivariate gaussians. This means that number counts of astronomical objects (which obey a power law scaling), though physically important, will require the XD model to have many (>20) gaussian components, which is highly inefficient. We have two options for dealing with non-gaussian features:

Gaussian-ize certain features by taking some transformation (log, etc.)
Bin the data in terms of the power law variable, and express remaining features in relative terms (colors, etc.)

The second is the choice made by Bovy et al in the SDSS quasar targeting paper. In each bin, the likelihood function is now represented as:

\begin{align} P(features|lQSO) = P(\text{gaussian features }|\text{ i-band mag, lQSO}) P(\text{i-band mag }|lQSO) \end{align}

where the first term can now be estimated using XD, and the second term is another "prior", estimated from:

Short term: directly estimate from histogramming simulation objects.
Long term: literature search for underlying lQSO luminosity function, adjusted for PS1 sensitivity, etc.
Other long term: infer number counts in bins directly from PS1 data (separate part of likelihood model)

Similar reasoning applies to the second denominator term, where $P( \neg lQSO)$ is just $1-P(lQSO)$ and $P(features| \neg lQSO)$ is estimated:

Short term: XD density estimation fitting to Adri's fakes
Long term: XD fit to known duds in PS1 data.
Question: Can we just use everything as the sample of duds, given that lQSOs are rare?

Let's noisify Adri's data and re-clean it with XD



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from IPython.html.widgets import interact, interactive, fixed
from IPython.html import widgets
dataLoc = '../../'

labels = ['ID','mag_model_i','g-r', 'r-i', 'i-z', 'WISE1', 'WISE2' ]
pQSO = np.loadtxt(dataLoc+'pQSO/pSDSScolmag.txt')
lQSO = np.loadtxt(dataLoc+'lQSO/SDSScolmag.txt')
sinQSO = np.loadtxt(dataLoc+'sinQSO/sSDSScolmag.txt')
unlQSO = np.loadtxt(dataLoc+'unlQSO/nlSDSScolmag.txt')
unlQSO[:,5:7] = -unlQSO[:,5:7] #bug in WISE magnitudes for this file

duds = np.concatenate((pQSO,unlQSO,sinQSO),axis=0)
data = np.concatenate((lQSO,duds),axis=0)
print lQSO.shape, duds.shape, data.shape
numPts = data.shape[0]
data = data[:,2:] #don't use IDs or i-band magnitudes









    



(1015, 7) (4596, 7) (5611, 7)

Generate noisified and XD corrected data



In [6]:

    
from astroML.density_estimation import XDGMM
from astroML.plotting.tools import draw_ellipse
import triangle

nFeatures = data.shape[1]

#generate random covariance matrix; perturb data
perturbingSigma = .2
z = zip(np.random.normal(scale=perturbingSigma,size=nFeatures),np.random.normal(scale=perturbingSigma,size=nFeatures))
covmat = np.cov(z)
perturbations = np.random.multivariate_normal(np.zeros(nFeatures),covmat,size=data.shape[0])
noisyData = data + perturbations

#XD function needs to be delivered a list of covariance matrices
covmatlist = np.dstack((covmat,covmat))
while covmatlist.shape[2] < data.shape[0]:
    covmatlist = np.dstack((covmatlist,covmat))
    
# fit the model
xdgmm = XDGMM(n_components=7)
xdgmm.fit(noisyData, covmatlist.transpose())
logp=xdgmm.logprob_a(noisyData, covmatlist.transpose()) # evaluate probability
resampledXD =  xdgmm.sample(data.shape[0])



In [25]:

    
_ =plt.hist(np.sum(np.exp(logp),axis=1),bins=20)
print data.shape
print np.max(np.exp(logp))









    



(5611, 5)
54.7850119842

Make cornerplots of "truth", noisy "data", and corrected data



In [10]:

    
#plot the resampled data drawn from the XD density model
text = ['g-r', 'r-i', 'i-z', 'WISE1', 'WISE2' ]
fig1 = triangle.corner(data,labels=text)
fig1.suptitle('(Noisy) Data plotted over \'Truth\'')
_ = triangle.corner(noisyData, fig=fig1,color='r')

fig2 = triangle.corner(noisyData,color='r',labels=text)
fig2.suptitle('Corrected Data plotted over Noisy data')
_ = triangle.corner(resampledXD,fig=fig2,color='b')

fig3 = triangle.corner(data,labels=text)
fig3.suptitle('Corrected Data plotted over \'Truth\'')
_ = triangle.corner(resampledXD,fig=fig3,color='b')

fig4 = triangle.corner(data,labels=text)
fig4.suptitle('Everything')
_ = triangle.corner(noisyData, fig=fig4,color='r')
_ = triangle.corner(resampledXD,fig=fig4,color='b')

XD Classification (Sequential)

Now we have our fake data, lets classify...

To-do:

noisify corrected data to recover noisified data
how does runtime of XDGMM scale with n_components, dataset size
get some lQSO probs

XD Classification (Object-oriented)

The same as above, but using the modular code.



In [26]:

    
"""
class Dataset(object):

    __init__(self)
        #fits header properties
    
    plot(self):
        #make triangle plot; overlay on current triangle plot, if any.
        return
"""
# import PS1QLS
from astroML.density_estimation import XDGMM
from astroML.plotting.tools import draw_ellipse
import triangle

class Classifier(object):

    def __init__(self,algorithm='XD'):
        if algorithm == 'XD':
            self.xdmodel = XDGMM(n_components=7)
        else: 
            print 'Not implemented yet!'
        
        return
    
    def train(self,training,covmatlist):
        self.xdmodel.fit(training, covmatlist.transpose())
        return
    
    def test(self,test):
        self.xdprobs=np.sum(np.exp(self.xdmodel.logprob_a(test, covmatlist.transpose())),axis=1)
        return
    
    def report(self,truth):
        return
        
    
RF = Classifier(algorithm='XD')
RF.train(noisyData,covmatlist) #
RF.test(noisyData)            # calculates classification probabilities
plt.hist(RF.xdprobs,bins=20)
#RF.report()                  # make plots, report statistics etc.
#RF.save()                    # pickles itself
#RF.score(testtruth)          # plots an ROC curve









    Out[26]:





(array([  3.64300000e+03,   9.37000000e+02,   3.38000000e+02,
          1.96000000e+02,   1.24000000e+02,   9.20000000e+01,
          7.40000000e+01,   5.50000000e+01,   3.70000000e+01,
          2.30000000e+01,   3.20000000e+01,   1.40000000e+01,
          1.30000000e+01,   9.00000000e+00,   1.20000000e+01,
          3.00000000e+00,   4.00000000e+00,   3.00000000e+00,
          1.00000000e+00,   1.00000000e+00]),
 array([  1.49011953e-05,   2.59272715e+00,   5.18543940e+00,
          7.77815165e+00,   1.03708639e+01,   1.29635761e+01,
          1.55562884e+01,   1.81490006e+01,   2.07417129e+01,
          2.33344251e+01,   2.59271374e+01,   2.85198496e+01,
          3.11125619e+01,   3.37052741e+01,   3.62979864e+01,
          3.88906986e+01,   4.14834109e+01,   4.40761231e+01,
          4.66688354e+01,   4.92615476e+01,   5.18542599e+01]),
 <a list of 20 Patch objects>)



In [28]:

    
print RF.xdprobs[100:200]









    



[  1.54865582e+00   2.76151527e+00   1.56564172e-02   5.08603855e-03
   4.92101567e+00   1.55083463e+01   2.61857002e-01   8.52920748e-01
   5.78535629e+00   2.15676650e+01   1.89916347e+00   9.28785735e+00
   1.46619692e+00   6.10502454e-01   3.65396751e+00   2.02173380e+00
   1.62027292e+01   4.36045205e-02   3.36097990e-02   1.34895043e+00
   4.84997076e+00   2.31457819e+00   4.71579677e-01   1.61676053e+00
   9.76581738e-01   2.24022091e+00   2.05510403e+00   1.28213675e+00
   3.31971528e+00   6.80718669e-01   6.74321653e-01   4.21843868e-01
   2.57595327e+00   1.96999830e+01   5.73538702e+00   1.49031644e-01
   1.90359651e+00   2.53732173e+00   3.36480854e-02   4.10099842e-01
   2.27059255e-01   9.71060940e+00   1.61694507e+00   7.59078966e-02
   3.10593701e+00   7.44165567e+00   4.19153321e-01   1.87634999e+00
   4.60862779e-01   1.41829569e-02   1.55529765e+00   1.39401089e+00
   5.18510695e-01   1.74380681e+00   1.94485111e+00   4.71591975e-01
   1.26891368e+00   1.27103395e+01   1.03425470e+00   2.41276717e+00
   4.64161515e-01   4.40804472e+00   4.31213697e-02   2.40340293e-01
   2.60914565e+00   3.15309050e+00   5.12207233e-02   1.16958998e+01
   1.07867140e+01   1.03972809e+00   2.96584755e-01   2.16007367e+00
   2.04022921e-01   4.94977929e-02   3.04168270e-01   3.67770014e+00
   1.16653932e+00   7.31071686e+00   9.08478613e-01   3.96021973e-01
   5.51384433e+00   1.83482538e+01   1.54181242e-02   3.02159096e-01
   5.07279617e-01   1.38485626e+00   2.10417776e-01   7.94530582e-01
   1.15725813e+00   4.40162852e+00   6.36428698e+00   3.19453475e-01
   1.38819171e+01   4.81945064e-01   9.84663951e-01   3.95379187e-02
   1.53203281e+00   2.20652475e+00   1.18030643e+00   1.09933674e+01]

XD: 1-D test



In [2]:

    
from astroML.density_estimation import XDGMM
X = np.random.normal(size=(1000, 1)) # 1000 pts in
ax1 = plt.subplot(211)
plt.title('XD Before and After')
_ = plt.hist(X,bins=20)
Xerr = np.random.random((1000, 1, 1)) # 1000 1x1 covariance matrices
xdgmm = XDGMM(n_components=2)
xdgmm.fit(X, Xerr) # fit the model
logp=xdgmm.logprob_a(X, Xerr) # evaluate probability
X_new =  xdgmm.sample(1000)
plt.subplot(212,sharex=ax1)
_ = plt.hist(X_new,bins=20)
print 'Sigma before = ', np.std(X), 'Sigma after = ', np.std(X_new)









    



Sigma before =  0.96744447552 Sigma after =  0.777544605608



In [4]:

    
print np.exp(xdgmm.logprob_a(np.zeros((1,1)),.1*np.ones((1,1,1))))









    



[[ 0.48889569  0.47047499]]



In [3]:

    
import brewer2mpl
from IPython.display import display
bmap = brewer2mpl.get_map('Set1', 'qualitative', 4)
col = bmap.mpl_colors

def compare(x,y):
    plt.figure()
    plt.scatter(pQSO[:,x],pQSO[:,y],c=col[1],alpha=.75,label='QSO+QSO Alignment')
    plt.scatter(lQSO[:,x],lQSO[:,y],c=col[0],alpha=.75,label='Lensed QSO')
    plt.scatter(sinQSO[:,x],sinQSO[:,y],c=col[2],alpha=.75,label='Single QSO')
    plt.scatter(unlQSO[:,x],unlQSO[:,y],c=col[3],alpha=.75,label='QSO+LRG Alignment')
    plt.xlabel(labels[x])
    plt.ylabel(labels[y])
    plt.legend(numpoints=1,bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)



In [13]:

    
interact(compare,x=(0,6),y=(0,6))