Baseline Flagging: Test on real data

In a previous document (BaselineFlagging), I examined the possibility of mitigating bad baselines by "flagging out" the worst offenders. In this document, I'll go into more detail on a specific data set to see how it performs.

I'll check the following criteria:

Do the baselines look better when the "bad" ones are flagged out? This is subjective and therefore difficult to generalize.
Are the bad baselines localized to a particular region? If so, flagging is not a good option: it does not help the whole data set, and removes data (no matter how bad) from a region of interest
Are the criteria correct? i.e., can better results be achieved by flagging more or less?



In [1]:

    
from pyspeckit.spectrum.readers import read_class
cl = read_class.ClassObject('/Users/adam/work/gc/apex/april2014/M-093.F-0009-2014-2014-04/M-093.F-0009-2014-2014-05-04.apex')
cl









    Out[1]:





ClassObject(4430161552) with 17438 entries
sources: set(['RAFGL1922   ', 'SATURN      ', 'SGRB2(N)    ', 'MAP_115     ', 'MAP_116     '])
tels: set(['AP-H201-X202', 'AP-H201-X201', 'AP-H201-PA01', 'AP-H201-PA02'])
lines: set(['CO(2-1)     ', 'shfi219ghz  '])
scans: set([20960, 20961, 20963, 20964, 20943, 20946, 20947, 20950, 20951, 20953, 20954, 20956, 20957])



In [2]:

    
spectra = cl.get_spectra(sourcere='^MAP_115', telescope='AP-H201-X201')



In [3]:

    
# Get the data as an array ("spectra" is a list of spectrum,header pairs)
data = np.array([s for s,h in spectra])
header = spectra[0][1]
data.shape









    Out[3]:





(5686, 32768)



In [4]:

    
mean_spectrum = data.mean(axis=0)
dft = np.fft.fft(data, axis=1)
mode = 3 # the fourier mode to flag
mode_data = np.abs(dft[:, mode])
fraction = 0.05
threshold = np.percentile(mode_data, 100*(1-fraction))
good = mode_data < threshold
mean_spectrum_flagged = data[good].mean(axis=0)



In [20]:

    
import pylab as pl
pl.rcParams['figure.figsize'] = [16,8]



In [21]:

    
h,l,p = pl.hist(mode_data, bins=50)
lines = pl.vlines(threshold,0,h.max(),color='k',linestyle='--')



In [22]:

    
import pylab as pl
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged, 'r,', alpha=0.5)









    Out[22]:





[<matplotlib.lines.Line2D at 0x27c99a390>]



In [23]:

    
pl.plot(mean_spectrum-mean_spectrum_flagged, 'k,')









    Out[23]:





[<matplotlib.lines.Line2D at 0x27a39a4d0>]

The 5% worst spectra have significant baseline ripples, so perhaps it's worth mitigating those.



In [25]:

    
frq = np.fft.fftfreq(data.shape[1])
vres = header['VRES']
fvr = data.shape[1]*vres
fvelo = fvr/(frq/frq[1])/2.
meanpsd = np.abs(dft).mean(axis=0)
meanpsdflagged = np.abs(dft[good,:]).mean(axis=0)



In [32]:

    
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(frq[meanpsd>0].min(), frq[meanpsd>0].max())
pl.ylim(1e2,2e3)









    Out[32]:





(100.0, 2000.0)

This PSD shows that removing the top 5% of the mode #3 spectra didn't actually affect the power spectrum that much. Also, though, it lets us see that there are two narrow, high-frequency regions that have high amplitude.



In [27]:

    
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))



In [28]:

    
ohohfour = np.argmin(np.abs(frq-0.004))
ohohsix = np.argmin(np.abs(frq-0.006))
ohoheight = np.argmin(np.abs(frq-0.008))
peak1 = np.argmax(meanpsd[ohohfour:ohohsix]) + ohohfour
peak2 = np.argmax(meanpsd[ohohsix:ohoheight]) + ohohsix
print "Fourier frequencies of the peaks: ",frq[peak1],frq[peak2]
print "Velocity scales: ",fvelo[peak1],fvelo[peak2]









    



Fourier frequencies of the peaks:  0.00411987304688 0.0067138671875
Velocity scales:  12.6738208912 7.77711736506



In [29]:

    
# See what happens if we flag these modes
mode_data1 = np.abs(dft[:, peak1])
threshold1 = np.percentile(mode_data1, 100*(1-fraction))
good1 = mode_data1 < threshold1
mean_spectrum_flagged1 = data[good1].mean(axis=0)

mode_data2 = np.abs(dft[:, peak2])
threshold2 = np.percentile(mode_data2, 100*(1-fraction))
good2 = mode_data2 < threshold2
mean_spectrum_flagged2 = data[good2].mean(axis=0)



In [30]:

    
h,l,p = pl.hist(np.abs(dft[:,peak1]), bins=50, histtype='step', color='b')
lines = pl.vlines(threshold1,0,h.max(),color='b',linestyle='--')
h,l,p = pl.hist(np.abs(dft[:,peak2]), bins=50, histtype='step', color='g')
lines = pl.vlines(threshold2,0,h.max(),color='g',linestyle='--')



In [39]:

    
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged1, 'b,', alpha=0.5)
#pl.xlim(5000,10000)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged1, 'b.', alpha=0.5)
#pl.xlim(5000,10000)
pl.ylim(-0.005, 0.025)









    Out[39]:





(-0.005, 0.025)



In [38]:

    
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged2, 'g,', alpha=0.5)
#pl.xlim(5000,15000)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged2, 'g.', alpha=0.5)
#pl.xlim(5000,15000)
pl.ylim(0.0,0.025)









    Out[38]:





(0.0, 0.025)

Despite the power in these fourier components, there is no really obvious effect on the spectra. There SHOULD be an effect on the PSDs though...



In [41]:

    
meanpsdflagged1 = np.abs(dft[good1,:]).mean(axis=0)
meanpsdflagged2 = np.abs(dft[good2,:]).mean(axis=0)



In [43]:

    
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.loglog(frq, meanpsdflagged1, 'b')
pl.loglog(frq, meanpsdflagged2, 'g')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))

The fact that we see no effect here indicates that these fourier components are present at a comparable level in all the spectra.



In [15]:

    
lon = np.array([h['OFF1'] for s,h in spectra])
lat = np.array([h['OFF2'] for s,h in spectra])



In [16]:

    
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good], lat[~good], 'r.', alpha=0.8)









    Out[16]:





[<matplotlib.lines.Line2D at 0x27b565950>]



In [17]:

    
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good1], lat[~good1], 'g.', alpha=0.8)









    Out[17]:





[<matplotlib.lines.Line2D at 0x27dba5cd0>]



In [18]:

    
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good2], lat[~good2], 'b.', alpha=0.8)









    Out[18]:





[<matplotlib.lines.Line2D at 0x27dc71550>]

The spatial distribution of the bad spectra is mostly random, with occasional peaks on individual scans, so there is no obvious harm in flagging them.



In [18]: