Baseline Flagging: Test on real data

In a previous document (BaselineFlagging), I examined the possibility of mitigating bad baselines by "flagging out" the worst offenders. In this document, I'll go into more detail on a specific data set to see how it performs.

I'll check the following criteria:

  1. Do the baselines look better when the "bad" ones are flagged out? This is subjective and therefore difficult to generalize.
  2. Are the bad baselines localized to a particular region? If so, flagging is not a good option: it does not help the whole data set, and removes data (no matter how bad) from a region of interest
  3. Are the criteria correct? i.e., can better results be achieved by flagging more or less?

In [1]:
from pyspeckit.spectrum.readers import read_class
cl = read_class.ClassObject('/Users/adam/work/gc/apex/april2014/M-093.F-0009-2014-2014-04/M-093.F-0009-2014-2014-05-04.apex')
cl


Out[1]:
ClassObject(4430251536) with 17438 entries
sources: set(['RAFGL1922   ', 'SATURN      ', 'SGRB2(N)    ', 'MAP_115     ', 'MAP_116     '])
tels: set(['AP-H201-X202', 'AP-H201-X201', 'AP-H201-PA01', 'AP-H201-PA02'])
lines: set(['CO(2-1)     ', 'shfi219ghz  '])
scans: set([20960, 20961, 20963, 20964, 20943, 20946, 20947, 20950, 20951, 20953, 20954, 20956, 20957])

In [2]:
spectra = cl.get_spectra(sourcere='^MAP_115', telescope='AP-H201-X202')

In [3]:
# Get the data as an array ("spectra" is a list of spectrum,header pairs)
data = np.array([s for s,h in spectra])
header = spectra[0][1]
data.shape


Out[3]:
(5686, 32768)

In [4]:
mean_spectrum = data.mean(axis=0)
dft = np.fft.fft(data, axis=1)
mode = 3 # the fourier mode to flag
mode_data = np.abs(dft[:, mode])
fraction = 0.05
threshold = np.percentile(mode_data, 100*(1-fraction))
good = mode_data < threshold
mean_spectrum_flagged = data[good].mean(axis=0)

In [24]:
import pylab as pl
pl.rcParams['figure.figsize'] = [16,8]

In [25]:
h,l,p = pl.hist(mode_data, bins=50)
lines = pl.vlines(threshold,0,h.max(),color='k',linestyle='--')



In [26]:
pl.plot(mean_spectrum, 'k,', alpha=0.5)
pl.plot(mean_spectrum_flagged, 'r,', alpha=0.5)


Out[26]:
[<matplotlib.lines.Line2D at 0x27b14ce10>]

In [27]:
pl.plot(mean_spectrum-mean_spectrum_flagged, 'k,')


Out[27]:
[<matplotlib.lines.Line2D at 0x27acacc50>]

The 5% worst spectra have significant baseline ripples, so perhaps it's worth mitigating those.


In [39]:
frq = np.fft.fftfreq(data.shape[1])
vres = header['VRES']
fvr = data.shape[1]*vres
fvelo = fvr/(frq/frq[1])/2.
meanpsd = np.abs(dft).mean(axis=0)
meanpsdflagged = np.abs(dft[good,:]).mean(axis=0)

In [40]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(frq[meanpsd>0].min(), frq[meanpsd>0].max())
pl.ylim(1e2,1e3)


Out[40]:
(100.0, 1000.0)

This PSD shows that removing the top 5% of the mode #3 spectra didn't actually affect the power spectrum that much.

Unlike the high frequency spectra, though, there are no obvious peaks at high (fourier) frequency to flag out.


In [29]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))



In [30]:
ohohfour = np.argmin(np.abs(frq-0.004))
ohohsix = np.argmin(np.abs(frq-0.006))
ohoheight = np.argmin(np.abs(frq-0.008))
peak1 = np.argmax(meanpsd[ohohfour:ohohsix]) + ohohfour
peak2 = np.argmax(meanpsd[ohohsix:ohoheight]) + ohohsix
print "Fourier frequencies of the peaks: ",frq[peak1],frq[peak2]
print "Velocity scales: ",fvelo[peak1],fvelo[peak2]


Fourier frequencies of the peaks:  0.0040283203125 0.0067138671875
Velocity scales:  -12.9618622751 -7.77711736506

In [58]:
# See what happens if we flag these modes
mode_data1 = np.abs(dft[:, peak1])
threshold1 = np.percentile(mode_data1, 100*(1-fraction))
good1 = mode_data1 < threshold1
mean_spectrum_flagged1 = data[good1].mean(axis=0)

mode_data2 = np.abs(dft[:, peak2])
threshold2 = np.percentile(mode_data2, 100*(1-fraction))
good2 = mode_data2 < threshold2
mean_spectrum_flagged2 = data[good2].mean(axis=0)


Out[58]:
(5401, 5401, 5401)

In [32]:
h,l,p = pl.hist(np.abs(dft[:,peak1]), bins=50, histtype='step', color='b')
lines = pl.vlines(threshold1,0,h.max(),color='b',linestyle='--')
h,l,p = pl.hist(np.abs(dft[:,peak2]), bins=50, histtype='step', color='g')
lines = pl.vlines(threshold2,0,h.max(),color='g',linestyle='--')



In [56]:
pl.title("Unflagged")
pl.plot(mean_spectrum, 'k.', alpha=0.5)
pl.xlim(10000,15000)
pl.ylim(0.25,0.5)
pl.figure()
pl.title("Flagged")
pl.plot(mean_spectrum_flagged1, 'b.', alpha=0.5)
pl.xlim(10000,15000)
pl.ylim(0.25,0.5)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged1, 'b.', alpha=0.5)
pl.xlim(10000,15000)
pl.ylim(-0.03, 0.04)


Out[56]:
(-0.03, 0.04)

This is a very bad result: flagging out the 5% "worst" spectra in this bin made the result look terrible! This is disturbing, and probably indicates that the mean looks OK only because these gigantic and terrible looking features are getting averaged out. But, of course, they aren't getting averaged out in the maps, so the map spectra look like panel 3 here.


In [57]:
pl.plot(mean_spectrum, 'k.', alpha=0.5)
pl.xlim(10000,15000)
pl.ylim(0.25,0.5)
pl.figure()
pl.plot(mean_spectrum_flagged2, 'g.', alpha=0.5)
pl.xlim(10000,15000)
pl.ylim(0.25,0.5)
pl.figure()
pl.plot(mean_spectrum-mean_spectrum_flagged2, 'g.')
pl.xlim(10000,15000)
pl.ylim(-0.02,0.03)


Out[57]:
(-0.02, 0.03)

Now let's examine the effect on the PSDS:


In [64]:
meanpsdflagged1 = np.abs(dft[good1,:]).mean(axis=0)
meanpsdflagged2 = np.abs(dft[good2,:]).mean(axis=0)

In [65]:
pl.loglog(frq, meanpsd, 'k')
pl.loglog(frq, meanpsdflagged, 'r')
pl.loglog(frq, meanpsdflagged1, 'b')
pl.loglog(frq, meanpsdflagged2, 'g')
pl.xlim(3e-3,9e-3)
pl.ylim(1.5e2,4e2)
from matplotlib.ticker import FormatStrFormatter
pl.gca().xaxis.set_minor_formatter(FormatStrFormatter("%.2g"))
pl.gca().yaxis.set_minor_formatter(FormatStrFormatter("%.2g"))


There's nothing remarkable about the PSD; the suppression is spread over frequency.


In [35]:
lon = np.array([h['OFF1'] for s,h in spectra])
lat = np.array([h['OFF2'] for s,h in spectra])

In [49]:
pl.plot(lon, lat, 'ks', markeredgecolor='none', markersize=20)
pl.plot(lon[~good], lat[~good], 'r.', alpha=0.8)


Out[49]:
[<matplotlib.lines.Line2D at 0x27aec7590>]

In [50]:
pl.plot(lon, lat, 'ks', markeredgecolor='none', markersize=20)
pl.plot(lon[~good1], lat[~good1], 'g.', alpha=0.8)


Out[50]:
[<matplotlib.lines.Line2D at 0x27b576e50>]

In [51]:
pl.plot(lon, lat, 'ks', markeredgecolor='none', markersize=20)
pl.plot(lon[~good2], lat[~good2], 'b.', alpha=0.8)


Out[51]:
[<matplotlib.lines.Line2D at 0x27f794a90>]

The spatial distribution of the bad spectra is mostly random, with occasional peaks on individual scans, so there is no obvious harm in flagging them.

The flagging above was based primarily on work done in another notebook on a different data set, so I want to assess whether there might be more logical fourier modes to flag on


In [63]:
for ii in range(1,13):
    if (ii-1)%3 == 0: pl.figure()
    pl.hist(np.abs(dft[:,ii]), bins=50, histtype='step', label=str(ii))
    pl.legend(loc='best')



In [ ]: