Baseline Flagging: Test on real data

In a previous document (BaselineFlagging), I examined the possibility of mitigating bad baselines by "flagging out" the worst offenders. In this document, I'll go into more detail on a specific data set to see how it performs.

I'll check the following criteria:

  1. Do the baselines look better when the "bad" ones are flagged out? This is subjective and therefore difficult to generalize.
  2. Are the bad baselines localized to a particular region? If so, flagging is not a good option: it does not help the whole data set, and removes data (no matter how bad) from a region of interest
  3. Are the criteria correct? i.e., can better results be achieved by flagging more or less?

In [1]:
from pyspeckit.spectrum.readers import read_class
cl = read_class.ClassObject('/Users/adam/work/gc/apex/april2014/M-093.F-0009-2014-2014-04/M-093.F-0009-2014-2014-05-04.apex')
cl


Out[1]:
ClassObject(4430161552) with 17438 entries
sources: set(['RAFGL1922   ', 'SATURN      ', 'SGRB2(N)    ', 'MAP_115     ', 'MAP_116     '])
tels: set(['AP-H201-X202', 'AP-H201-X201', 'AP-H201-PA01', 'AP-H201-PA02'])
lines: set(['CO(2-1)     ', 'shfi219ghz  '])
scans: set([20960, 20961, 20963, 20964, 20943, 20946, 20947, 20950, 20951, 20953, 20954, 20956, 20957])

In [2]:
spectra = cl.get_spectra(sourcere='^MAP_115', telescope='AP-H201-X201')

In [3]:
# Get the data as an array ("spectra" is a list of spectrum,header pairs)
data = np.array([s for s,h in spectra])
header = spectra[0][1]
data.shape


Out[3]:
(5686, 32768)

In [4]:
mean_spectrum = data.mean(axis=0)
dft = np.fft.fft(data, axis=1)
mode = 3 # the fourier mode to flag
mode_data = np.abs(dft[:, mode])
fraction = 0.05
threshold = np.percentile(mode_data, 100*(1-fraction))
good = mode_data < threshold
mean_spectrum_flagged = data[good].mean(axis=0)

In [20]:
import pylab as pl
pl.rcParams['figure.figsize'] = [16,8]

In [21]:
h,l,p = pl.hist(mode_data, bins=50)
lines = pl.vlines(threshold,0,h.max(),color='k',linestyle='--')



In [22]:
import pylab as pl
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged, 'r,', alpha=0.5)


Out[22]:
[<matplotlib.lines.Line2D at 0x27c99a390>]

In [23]:
pl.plot(mean_spectrum-mean_spectrum_flagged, 'k,')


Out[23]:
[<matplotlib.lines.Line2D at 0x27a39a4d0>]

The 5% worst spectra have significant baseline ripples, so perhaps it's worth mitigating those.


In [25]:
frq = np.fft.fftfreq(data.shape[1])
vres = header['VRES']
fvr = data.shape[1]*vres
fvelo = fvr/(frq/frq[1])/2.
meanpsd = np.abs(dft).mean(axis=0)
meanpsdflagged = np.abs(dft[good,:]).mean(axis=0)

In [32]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(frq[meanpsd>0].min(), frq[meanpsd>0].max())
pl.ylim(1e2,2e3)


Out[32]:
(100.0, 2000.0)

This PSD shows that removing the top 5% of the mode #3 spectra didn't actually affect the power spectrum that much. Also, though, it lets us see that there are two narrow, high-frequency regions that have high amplitude.


In [27]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))



In [28]:
ohohfour = np.argmin(np.abs(frq-0.004))
ohohsix = np.argmin(np.abs(frq-0.006))
ohoheight = np.argmin(np.abs(frq-0.008))
peak1 = np.argmax(meanpsd[ohohfour:ohohsix]) + ohohfour
peak2 = np.argmax(meanpsd[ohohsix:ohoheight]) + ohohsix
print "Fourier frequencies of the peaks: ",frq[peak1],frq[peak2]
print "Velocity scales: ",fvelo[peak1],fvelo[peak2]


Fourier frequencies of the peaks:  0.00411987304688 0.0067138671875
Velocity scales:  12.6738208912 7.77711736506

In [29]:
# See what happens if we flag these modes
mode_data1 = np.abs(dft[:, peak1])
threshold1 = np.percentile(mode_data1, 100*(1-fraction))
good1 = mode_data1 < threshold1
mean_spectrum_flagged1 = data[good1].mean(axis=0)

mode_data2 = np.abs(dft[:, peak2])
threshold2 = np.percentile(mode_data2, 100*(1-fraction))
good2 = mode_data2 < threshold2
mean_spectrum_flagged2 = data[good2].mean(axis=0)

In [30]:
h,l,p = pl.hist(np.abs(dft[:,peak1]), bins=50, histtype='step', color='b')
lines = pl.vlines(threshold1,0,h.max(),color='b',linestyle='--')
h,l,p = pl.hist(np.abs(dft[:,peak2]), bins=50, histtype='step', color='g')
lines = pl.vlines(threshold2,0,h.max(),color='g',linestyle='--')



In [39]:
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged1, 'b,', alpha=0.5)
#pl.xlim(5000,10000)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged1, 'b.', alpha=0.5)
#pl.xlim(5000,10000)
pl.ylim(-0.005, 0.025)


Out[39]:
(-0.005, 0.025)

In [38]:
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged2, 'g,', alpha=0.5)
#pl.xlim(5000,15000)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged2, 'g.', alpha=0.5)
#pl.xlim(5000,15000)
pl.ylim(0.0,0.025)


Out[38]:
(0.0, 0.025)

Despite the power in these fourier components, there is no really obvious effect on the spectra. There SHOULD be an effect on the PSDs though...


In [41]:
meanpsdflagged1 = np.abs(dft[good1,:]).mean(axis=0)
meanpsdflagged2 = np.abs(dft[good2,:]).mean(axis=0)

In [43]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.loglog(frq, meanpsdflagged1, 'b')
pl.loglog(frq, meanpsdflagged2, 'g')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))


The fact that we see no effect here indicates that these fourier components are present at a comparable level in all the spectra.


In [15]:
lon = np.array([h['OFF1'] for s,h in spectra])
lat = np.array([h['OFF2'] for s,h in spectra])

In [16]:
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good], lat[~good], 'r.', alpha=0.8)


Out[16]:
[<matplotlib.lines.Line2D at 0x27b565950>]

In [17]:
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good1], lat[~good1], 'g.', alpha=0.8)


Out[17]:
[<matplotlib.lines.Line2D at 0x27dba5cd0>]

In [18]:
pl.plot(lon, lat, 'ks', markeredgecolor='none')
pl.plot(lon[~good2], lat[~good2], 'b.', alpha=0.8)


Out[18]:
[<matplotlib.lines.Line2D at 0x27dc71550>]

The spatial distribution of the bad spectra is mostly random, with occasional peaks on individual scans, so there is no obvious harm in flagging them.


In [18]: