By James Keaveney (james.keaveney@durham.ac.uk)
In this tutorial, we show two commonly used data processing techniques. The first is a moving-average smoothing routine, and the second is a data-binning routine. We then give an example script to show the differences between the two.
Let's set up the python environments and import everything we'll need later on:
In [1]:
%matplotlib inline
from IPython.display import display
from numpy import array,append,arange,zeros,exp,sin,random,std
from pylab import plot,show,figure,clf
We first define our moving average smoothing. It is based on a triangular weighting function of a specified width. For convenience, the output array is the same length as the input, but the smoothing is only applied over N - 2n data points (removing n from each end of the array), where N is the length of the data array and n is the width of the smoothing window.
In [2]:
def smooth_triangle(data,npts):
""" performs moving average smoothing with a triangular weighting.
Effective length of smoothed array is shorter by 'npts' points.
Inputs data: 1D array or list
Outputs: smoothed: 1D array
"""
# if not an array already, make it an array
data = array(data)
# define triangular weighting - twice the length of npts
# [::-1] means reverse order
triangle = array(range(npts) + [npts] + range(npts)[::-1]) + 1
smoothed=[]
for i in range(0,len(data)-len(triangle)):
point=data[i:i+len(triangle)]*triangle
smoothed.append(sum(point)/sum(triangle))
j = len(data)-len(smoothed)
# add half of j (rounded) data points back on to the smoothed array at either end
# slightly different for odd and even array length
if j%2==1:
for i in range(0,(j-1)/2):
smoothed.append(data[-1-(j-1)/2+i])
smoothed.insert(0,data[(j-1)/2-i])
smoothed.append(data[-1])
else:
for i in range(0,j/2):
smoothed.append(data[-1-i])
smoothed.insert(0,data[i])
return array(smoothed)
Binning data is useful where the initial data size is very large. The output from this function is smaller by a specified factor n, which also removes some noise since every n data points are averaged into one.
In [3]:
def bin_data(x,y,blength):
""" Takes 2 arrays x and y and bins them into groups of blength.
blength must be odd, if not it's made odd.
Inputs: x, y: 1D lists or numpy arrays
Outputs: xout, yout, yerrout: 1D numpy arrays
"""
# convert to arrays if necessary
x = array(x)
y = array(y)
# check oddness
if blength % 2 == 0:
blength += 1
nobins = len(x)/blength
xmid = (blength-1)/2
xbinmax = nobins*blength - xmid
xout,yout,yerrout = [],[],[]
for i in range(int(xmid),int(xbinmax),int(blength)):
xmin = i-int(xmid)
xmax = i+int(xmid)
xout.append(sum(x[xmin:xmax+1])/blength)
yout.append(sum(y[xmin:xmax+1])/blength)
yerrout.append(std(y[xmin:xmax+1]))
return array(xout),array(yout),array(yerrout)
First, here's an example where both functions do just about the same thing. The data length is reasonable at 10k points (typical of an oscilloscope output file, for example), and the data is fairly noisy.
In [4]:
# Start with a 10k array
x = arange(0,300,0.03)
print 'Initial array length:',len(x)
# random data
A = 200; B = 0.3; C = 4
y = exp(-x/A) * sin(B*x) * C
# Add Gaussian noise
y_noise = random.randn(len(x))*0.4
y += y_noise
# Plot raw data - two separate figures for saving, one for viewing
fig1 = figure(1)
clf()
ax1 = fig1.add_subplot(111)
ax1.plot(x,y,'k',lw=2)
#display(fig1)
Out[4]:
Now let's smooth this data - we start with the moving average, taking 100 points as our smoothing window (play around with this number to see it's effect).
In [5]:
fig2 = figure(2)
clf()
ax2 = fig2.add_subplot(111)
# smooth with moving average
window = 100
y_smoothed = smooth_triangle(y,window)
ax2.plot(x,y,'k',lw=2)
ax2.plot(x,y_smoothed,'r.',ms=3)
#display(fig2)
Out[5]:
This looks pretty good, the data is much smoother without losing much information. Note the ends of the data where the smoothing isn't applied (only really visible on the right side in this example) - it's easy to crop this off if desired, just do
x_smoothed = x[window:-window]
y_smoothed = y_smoothed[window:-window]
Next, let's bin the data instead.
In [6]:
fig3 = figure(3)
clf()
ax3 = fig3.add_subplot(111)
# bin and plot
window = 25
x_binned,y_binned,y_error = bin_data(x,y,window)
ax3.plot(x,y,'k',lw=2)
ax3.plot(x_binned,y_binned,'r.',ms=9)
Out[6]:
In both these examples, we've only plotted the data points in red - with no line between them. The length of the binned data arrays is much smaller than the smoothed data. There are a couple of reasons to choose one over the other:
We'll look at the last case in more detail now.
Let's take the same data we had in the previous example and plot it as a line, instead of just the points.
In [7]:
#clear the axes...
ax3.cla()
#...and redraw
ax3.plot(x,y,'k',lw=1.5)
ax3.plot(x_binned,y_binned,'r-',lw=1.5)
display(fig3)
Now let's bin the data with a very large window and add it to the plot:
In [8]:
#select large binning window, relative to the data size
window = 400
x_binned, y_binned, y_error = bin_data(x,y,window)
ax3.plot(x_binned,y_binned,'g-',lw=3)
display(fig3)
Now we can see that bad things are happening... The data has been binned far too much and aliasing has resulted.
As a rule of thumb, the binning window should never be bigger than some fraction of the data array length, otherwise aliasing could occur. Of course, there's nothing to stop running both smoothing and binning - just run the moving average smoothing first, for the best results.