Smoothing and Binning Data

By James Keaveney (james.keaveney@durham.ac.uk)

In this tutorial, we show two commonly used data processing techniques. The first is a moving-average smoothing routine, and the second is a data-binning routine. We then give an example script to show the differences between the two.

Let's set up the python environments and import everything we'll need later on:


In [1]:
%matplotlib inline
from IPython.display import display

from numpy import array,append,arange,zeros,exp,sin,random,std
from pylab import plot,show,figure,clf

Moving-average smoothing

We first define our moving average smoothing. It is based on a triangular weighting function of a specified width. For convenience, the output array is the same length as the input, but the smoothing is only applied over N - 2n data points (removing n from each end of the array), where N is the length of the data array and n is the width of the smoothing window.


In [2]:
def smooth_triangle(data,npts):
    """ performs moving average smoothing with a triangular weighting. 
        Effective length of smoothed array is shorter by 'npts' points. 
        Inputs      data:       1D array or list
        Outputs:    smoothed:   1D array
    """
        
    # if not an array already, make it an array
    data = array(data)
    
    # define triangular weighting - twice the length of npts
    # [::-1] means reverse order
    triangle = array(range(npts) + [npts] + range(npts)[::-1]) + 1

    smoothed=[]
    for i in range(0,len(data)-len(triangle)):
        point=data[i:i+len(triangle)]*triangle
        smoothed.append(sum(point)/sum(triangle))

    j = len(data)-len(smoothed)
    # add half of j (rounded) data points back on to the smoothed array at either end
    # slightly different for odd and even array length
    if j%2==1:
        for i in range(0,(j-1)/2):
            smoothed.append(data[-1-(j-1)/2+i])
            smoothed.insert(0,data[(j-1)/2-i])
        smoothed.append(data[-1])
    else:
        for i in range(0,j/2):
            smoothed.append(data[-1-i])
            smoothed.insert(0,data[i])

    return array(smoothed)

Data binning

Binning data is useful where the initial data size is very large. The output from this function is smaller by a specified factor n, which also removes some noise since every n data points are averaged into one.


In [3]:
def bin_data(x,y,blength):
    """ Takes 2 arrays x and y and bins them into groups of blength.
        blength must be odd, if not it's made odd. 
        Inputs:     x, y:                   1D lists or numpy arrays
        Outputs:	xout, yout, yerrout:    1D numpy arrays
    """

    # convert to arrays if necessary
    x = array(x)
    y = array(y)

    # check oddness
    if blength % 2 == 0: 
        blength += 1

    nobins = len(x)/blength
    xmid = (blength-1)/2
    xbinmax = nobins*blength - xmid
    xout,yout,yerrout = [],[],[]
    for i in range(int(xmid),int(xbinmax),int(blength)):
        xmin = i-int(xmid)
        xmax = i+int(xmid)
        xout.append(sum(x[xmin:xmax+1])/blength)
        yout.append(sum(y[xmin:xmax+1])/blength)
        yerrout.append(std(y[xmin:xmax+1]))
        
    return array(xout),array(yout),array(yerrout)

Comparing the two:

First, here's an example where both functions do just about the same thing. The data length is reasonable at 10k points (typical of an oscilloscope output file, for example), and the data is fairly noisy.


In [4]:
# Start with a 10k array
x = arange(0,300,0.03)
print 'Initial array length:',len(x)

# random data
A = 200; B = 0.3; C = 4
y = exp(-x/A) * sin(B*x) * C

# Add Gaussian noise
y_noise = random.randn(len(x))*0.4
y += y_noise

# Plot raw data - two separate figures for saving, one for viewing
fig1 = figure(1)
clf()
ax1 = fig1.add_subplot(111)
ax1.plot(x,y,'k',lw=2)

#display(fig1)


Initial array length: 10000
Out[4]:
[<matplotlib.lines.Line2D at 0x107013250>]

Now let's smooth this data - we start with the moving average, taking 100 points as our smoothing window (play around with this number to see it's effect).


In [5]:
fig2 = figure(2)
clf()
ax2 = fig2.add_subplot(111)

# smooth with moving average
window = 100
y_smoothed = smooth_triangle(y,window)

ax2.plot(x,y,'k',lw=2)
ax2.plot(x,y_smoothed,'r.',ms=3)

#display(fig2)


Out[5]:
[<matplotlib.lines.Line2D at 0x107180850>]

This looks pretty good, the data is much smoother without losing much information. Note the ends of the data where the smoothing isn't applied (only really visible on the right side in this example) - it's easy to crop this off if desired, just do

x_smoothed = x[window:-window]
y_smoothed = y_smoothed[window:-window]

Next, let's bin the data instead.


In [6]:
fig3 = figure(3)
clf()
ax3 = fig3.add_subplot(111)

# bin and plot
window = 25
x_binned,y_binned,y_error = bin_data(x,y,window)
ax3.plot(x,y,'k',lw=2)
ax3.plot(x_binned,y_binned,'r.',ms=9)


Out[6]:
[<matplotlib.lines.Line2D at 0x10766b110>]

In both these examples, we've only plotted the data points in red - with no line between them. The length of the binned data arrays is much smaller than the smoothed data. There are a couple of reasons to choose one over the other:

  • If fitting the data, the binned data will process faster since there are fewer points
  • If the data array is large, the plot may be slow and resulting figure file size may be large
  • However, if the data array is already short, binning the data may introduce aliasing

We'll look at the last case in more detail now.

Bad binning - aliasing

Let's take the same data we had in the previous example and plot it as a line, instead of just the points.


In [7]:
#clear the axes...
ax3.cla()

#...and redraw
ax3.plot(x,y,'k',lw=1.5)
ax3.plot(x_binned,y_binned,'r-',lw=1.5)

display(fig3)


Now let's bin the data with a very large window and add it to the plot:


In [8]:
#select large binning window, relative to the data size
window = 400
x_binned, y_binned, y_error = bin_data(x,y,window)

ax3.plot(x_binned,y_binned,'g-',lw=3)

display(fig3)


Now we can see that bad things are happening... The data has been binned far too much and aliasing has resulted.

As a rule of thumb, the binning window should never be bigger than some fraction of the data array length, otherwise aliasing could occur. Of course, there's nothing to stop running both smoothing and binning - just run the moving average smoothing first, for the best results.