Fast matrix biquad

This notebook shows the math behind a very fast NEON implementation of biquad IIR filters. One of the main primitives NEON is very highly optimized for is multiplying a matrix by a vector, especially of size 4.

A biquad filter has two state variables (this is most apparent in transposed direct form II). For this implementation, the input vector consists of two input values and the two state variables. The output vector is two output values and the new state vector. The relationship is linear, ie the transformation from input to output vector is simply a matrix-vector multiplication. Each iteration computes two samples, using a single 4x4 matrix-vector multiply.


In [117]:
%pylab inline


Populating the interactive namespace from numpy and matplotlib

We start with a simple implementation of biquad for reference; we'll compare the error of the matrix approach against this scalar implementation. This is in direct form I.


In [118]:
def biquad(coefs, inp):
    b0, b1, b2, a1, a2 = coefs
    out = zeros(len(inp))
    x1 = 0
    x2 = 0
    y1 = 0
    y2 = 0
    for i in range(len(inp)):
        x = inp[i]
        y = b0 * x + b1 * x1 + b2 * x2 - a1 * y1 - a2 * y2
        out[i] = y
        x2, x1 = x1, x
        y2, y1 = y1, y
    return out

As a starting step, here's the same biquad in a 2x2 matrix formulation. This version is derived from transposed direct form II, with a little expansion of the effect this input sample has on the next state vector (this is the B array below). The A matrix describes the evolution of the IIR state.


In [119]:
def bqmat(coefs, inp):
    b0, b1, b2, a1, a2 = coefs
    A = array([[-a1, 1], [-a2, 0]])
    B = array([b1 - a1 * b0, b2 - a2 * b0])
    y = array([0, 0])
    out = np.zeros(len(inp))
    for i in range(len(inp)):
        x = inp[i]
        out[i] = y[0] + x * b0
        y = dot(A, y) + B * x
    return out

Let's test it out.


In [120]:
impulse = zeros(100)
impulse[10] = 1
coefs = 1.0207, -1.7719, .9376, -1.7719, 0.9583  # peaking
plot(biquad(coefs, impulse))
plot(bqmat(coefs, impulse))


Out[120]:
[<matplotlib.lines.Line2D at 0x10cecce10>]

In [121]:
plot(biquad(coefs, impulse) - bqmat(coefs, impulse))


Out[121]:
[<matplotlib.lines.Line2D at 0x10d3a9e50>]

Note the scale of the error -- it's floating point rounding.

Now for the 4x4 matrix approach. It's really four 2x2 matrices tiled. The top left corner is the influence of the two input samples on the two output samples; it's the first two values of the biquad's impulse response, banded. The top right corner is the contribution of the state vector to the output. The bottom left is the effect of the input on the state, and the bottom right is the evolution of the IIR. Since each iteration is two samples, this matrix is simply the square of the A matrix above.

Note that each iteration processes two samples. The matrix-vector product is 16 multiply-accumulates, so it amortizes to 8 per sample. This compares well to the 5 multiply-accumulates of the direct form, but organized for extremely fast evaluation in SIMD, and NEON in particular.


In [122]:
def biquadm(coefs, inp):
    b0, b1, b2, a1, a2 = coefs
    c1 = b1 - a1 * b0
    c2 = b2 - a2 * b0
    A = array([[b0, 0, 1, 0],
               [c1, b0, -a1, 1],
               [-a1 * c1 + c2, c1, -a2 + a1*a1, -a1],
               [-a2 * c1, c2, a1*a2, -a2]])
    out = np.zeros(len(inp))
    y = zeros(4)
    for i in range(0, len(inp), 2):
        y[0:2] = inp[i:i+2]
        y = dot(A, y)
        out[i:i+2] = y[0:2]
    return out

Again, let's test it out.


In [123]:
plot(biquad(coefs, impulse))
plot(biquadm(coefs, impulse))


Out[123]:
[<matplotlib.lines.Line2D at 0x10d3b9510>]

The error is similar to that of the 2x2 matrix version, but not quite the same.


In [124]:
plot(biquadm(coefs, delta) - biquad(coefs, delta))
plot(biquadm(coefs, delta) - bqmat(coefs, delta))


Out[124]:
[<matplotlib.lines.Line2D at 0x10d70b550>]

In [124]: