Filters

By Evgenia "Jenny" Nitishinskaya, Dr. Aidan O'Mahony, and Delaney Granizo-Mackenzie. Algorithms by David Edwards.

Kalman Filter Beta Estimation Example from Dr. Aidan O'Mahony's blog.

Part of the Quantopian Lecture Series:

Notebook released under the Creative Commons Attribution 4.0 License.



In [1]:
from SimPEG import *
%pylab inline
# Import a Kalman filter and other useful libraries
from pykalman import KalmanFilter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import poly1d


Efficiency Warning: Interpolation will be slow, use setup.py!

            python setup.py build_ext --inplace
    
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['linalg']
`%matplotlib` prevents importing * from pylab and numpy

Toy example: falling ball

Imagine we have a falling ball whose motion we are tracking with a camera. The state of the ball consists of its position and velocity. We know that we have the relationship $x_t = x_{t-1} + v_{t-1}\tau - \frac{1}{2} g \tau^2$, where $\tau$ is the time (in seconds) elapsed between $t-1$ and $t$ and $g$ is gravitational acceleration. Meanwhile, our camera can tell us the position of the ball every second, but we know from the manufacturer that the camera accuracy, translated into the position of the ball, implies variance in the position estimate of about 3 meters.

In order to use a Kalman filter, we need to give it transition and observation matrices, transition and observation covariance matrices, and the initial state. The state of the system is (position, velocity), so it follows the transition matrix $$ \left( \begin{array}{cc} 1 & \tau \\ 0 & 1 \end{array} \right) $$

with offset $(-\tau^2 \cdot g/2, -\tau\cdot g)$. The observation matrix just extracts the position coordinate, (1 0), since we are measuring position. We know that the observation variance is 1, and transition covariance is 0 since we will be simulating the data the same way we specified our model. For the inital state, let's feed our model something bogus like (30, 10) and see how our system evolves.


In [2]:
tau = 0.1

# Set up the filter
kf = KalmanFilter(n_dim_obs=1, n_dim_state=2, # position is 1-dimensional, (x,v) is 2-dimensional
                  initial_state_mean=[30,10],
                  initial_state_covariance=np.eye(2),
                  transition_matrices=[[1,tau], [0,1]],
                  observation_matrices=[[1,0]],
                  observation_covariance=3,
                  transition_covariance=np.zeros((2,2)),
                  transition_offsets=[-4.9*tau**2, -9.8*tau])

In [3]:
# Create a simulation of a ball falling for 40 units of time (each of length tau)
times = np.arange(40)
actual = -4.9*tau**2*times**2

# Simulate the noisy camera data
sim = actual + 3*np.random.randn(40)

# Run filter on camera data
state_means, state_covs = kf.filter(sim)

In [4]:
plt.plot(times, state_means[:,0])
plt.plot(times, sim)
plt.plot(times, actual)
plt.legend(['Filter estimate', 'Camera data', 'Actual'])
plt.xlabel('Time')
plt.ylabel('Height');



In [5]:
print(times)


[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]

In [6]:
print(state_means[:,0])


[ 22.77694155  19.31383258  16.07577534  13.65375221  11.8251647
   9.47676563   7.84310147   6.01726754   3.84437544   1.65625667
  -0.08105403  -1.55160746  -3.59383136  -5.60392698  -7.51079721
  -9.79454663 -11.28419802 -13.62575638 -16.09311713 -17.90308474
 -19.95469156 -21.95532882 -24.49134584 -26.88245639 -29.12614224
 -31.31219124 -34.25896255 -37.21886395 -39.99598751 -42.90109416
 -45.8496959  -48.96583521 -52.33790237 -55.16109531 -58.44104691
 -61.63663957 -65.52775601 -69.2596826  -72.73449013 -76.69023641]

At each point in time we plot the state estimate after accounting for the most recent measurement, which is why we are not at position 30 at time 0. The filter's attentiveness to the measurements allows it to correct for the initial bogus state we gave it. Then, by weighing its model and knowledge of the physical laws against new measurements, it is able to filter out much of the noise in the camera data. Meanwhile the confidence in the estimate increases with time, as shown by the graph below:


In [7]:
# Plot variances of x and v, extracting the appropriate values from the covariance matrix
plt.plot(times, state_covs[:,0,0])
plt.plot(times, state_covs[:,1,1])
plt.legend(['Var(x)', 'Var(v)'])
plt.ylabel('Variance')
plt.xlabel('Time');


The Kalman filter can also do smoothing, which takes in all of the input data at once and then constructs its best guess for the state of the system in each period post factum. That is, it does not provide online, running estimates, but instead uses all of the data to estimate the historical state, which is useful if we only want to use the data after we have collected all of it.


In [8]:
# Use smoothing to estimate what the state of the system has been
smoothed_state_means, _ = kf.smooth(sim)

# Plot results
plt.plot(times, smoothed_state_means[:,0])
plt.plot(times, sim)
plt.plot(times, actual)
plt.legend(['Smoothed estimate', 'Camera data', 'Actual'])
plt.xlabel('Time')
plt.ylabel('Height');


Example: moving average

Because the Kalman filter updates its estimates at every time step and tends to weigh recent observations more than older ones, a particularly useful application is estimation of rolling parameters of the data. When using a Kalman filter, there's no window length that we need to specify. This is useful for computing the moving average if that's what we are interested in, or for smoothing out estimates of other quantities. For instance, if we have already computed the moving Sharpe ratio, we can smooth it using a Kalman filter.

Below, we'll use both a Kalman filter and an n-day moving average to estimate the rolling mean of a dataset. We hope that the mean describes our observations well, so it shouldn't change too much when we add an observation; therefore, we assume that it evolves as a random walk with a small error term. The mean is the model's guess for the mean of the distribution from which measurements are drawn, so our prediction of the next value is simply equal to our estimate of the mean. We assume that the observations have variance 1 around the rolling mean, for lack of a better estimate. Our initial guess for the mean is 0, but the filter quickly realizes that that is incorrect and adjusts.


In [9]:
df = pd.read_csv("../data/ChungCheonDC/CompositeETCdata.csv")
df_DC = pd.read_csv("../data/ChungCheonDC/CompositeDCdata.csv")
df_DCstd = pd.read_csv("../data/ChungCheonDC/CompositeDCstddata.csv")

In [10]:
ax1 = plt.subplot(111)
ax1_1 = ax1.twinx()
df.plot(figsize=(12,3), x='date', y='reservoirH', ax=ax1_1, color='k', linestyle='-', lw=2)


Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xae015c0>

In [11]:
# Load pricing data for a security
# start = '2013-01-01'
# end = '2015-01-01'
#x = get_pricing('reservoirH', fields='price', start_date=start, end_date=end)
x= df.reservoirH
# Construct a Kalman filter
kf = KalmanFilter(transition_matrices = [1],
                  observation_matrices = [1],
                  initial_state_mean = 39.3,
                  initial_state_covariance = 1,
                  observation_covariance=1,
                  transition_covariance=1)

# Use the observed values of the price to get a rolling mean
state_means, _ = kf.filter(x.values)

# Compute the rolling mean with various lookback windows
mean10 = pd.rolling_mean(x, 6)
mean20 = pd.rolling_mean(x, 20)
mean30 = pd.rolling_mean(x, 30)

# Plot original data and estimated mean
plt.plot(state_means)
plt.plot(x, 'k.', ms=2)
plt.plot(mean10)
plt.plot(mean20)
plt.plot(mean30)
plt.title('Kalman filter estimate of average')
plt.legend(['Kalman Estimate', 'Reseroir H', '30-day Moving Average', '60-day Moving Average','90-day Moving Average'])
plt.xlabel('Day')
plt.ylabel('Reservoir Level');



In [12]:
plt.plot(state_means)
plt.plot(x)
plt.title('Kalman filter estimate of average')
plt.legend(['Kalman Estimate', 'Reseroir H'])
plt.xlabel('Day')
plt.ylabel('Reservoir Level');


This is a little hard to see, so we'll plot a subsection of the graph.


In [13]:
plt.plot(state_means[-400:])
plt.plot(x[-400:])
plt.plot(mean10[-400:])
plt.title('Kalman filter estimate of average')
plt.legend(['Kalman Estimate', 'Reseroir H', '6-day Moving Average'])
plt.xlabel('Day')
plt.ylabel('Reservoir Level');



In [ ]:


In [14]:
# Load pricing data for a security
# start = '2013-01-01'
# end = '2015-01-01'
#x = get_pricing('reservoirH', fields='price', start_date=start, end_date=end)
xH= df.upperH_med
# Construct a Kalman filter
kf = KalmanFilter(transition_matrices = [1],
                  observation_matrices = [1],
                  initial_state_mean = 35.5,
                  initial_state_covariance = 1,
                  observation_covariance=1,
                  transition_covariance=.01)

# Use the observed values of the price to get a rolling mean
state_means, _ = kf.filter(xH.values)

# Compute the rolling mean with various lookback windows
mean10 = pd.rolling_mean(xH, 10)
mean20 = pd.rolling_mean(xH, 20)
mean30 = pd.rolling_mean(xH, 30)

# Plot original data and estimated mean
plt.plot(state_means)
plt.plot(xH)
plt.plot(mean10)
plt.plot(mean20)
plt.plot(mean30)
plt.title('Kalman filter estimate of average')
# plt.legend(['Kalman Estimate', 'upperH_med', '10-day Moving Average', '20-day Moving Average','30-day Moving Average'])
plt.xlabel('Day')
plt.ylabel('upperH_med');



In [15]:
txrxID =  df_DC.keys()[1:-1]
xmasking = lambda x: np.ma.masked_where(np.isnan(x.values), x.values)

In [16]:
x= df_DC[txrxID[2]]
median10 = pd.rolling_median(x, 6)
mean10 = pd.rolling_max(x, 10)
x1 = median10
x2 = mean10
# Masking array having NaN
xm = xmasking(x)
# Construct a Kalman filter
kf = KalmanFilter(transition_matrices = [1],
                  observation_matrices = [1],
                  initial_state_mean = 67.6,
                  initial_state_covariance = 1,
                  observation_covariance=1,
                  transition_covariance=1)
# Use the observed values of the price to get a rolling mean
state_means, _ = kf.filter(xm)

In [17]:
#plt.plot(x1)
plt.plot(x)
plt.plot(x1)
plt.plot(x2)
plt.plot(state_means)
plt.legend([  'origin x','median x1','mean x2', 'Kalman Estimate'])


Out[17]:
<matplotlib.legend.Legend at 0xb198b00>

In [18]:
plt.plot(x)
plt.plot(state_means)


Out[18]:
[<matplotlib.lines.Line2D at 0xb617d30>]

In [19]:
upperH_med = xmasking(df.upperH_med)
state_means, _ = kf.filter(upperH_med)

plt.plot(df.upperH_med)
plt.plot(state_means)

# plt.plot(xH)
# plt.plot(mean10)
plt.title('Kalman filter estimate of average')
plt.legend(['Kalman Estimate', 'Reseroir H','10-day Moving Average'])
plt.xlabel('Day')
plt.ylabel('upperH_med');



In [20]:
# Import libraries
%matplotlib inline
import pandas as pd
import sys

In [21]:
import matplotlib.pyplot as plt
import numpy as np
import scipy as sc

plt.style.use('ggplot')
np.random.seed(20)

In [30]:
x= df.reservoirH

In [31]:
print x


0      39.30
1      39.34
2      39.34
3      39.34
4      39.42
5      39.47
6      39.53
7      39.56
8      39.59
9      39.59
10     39.59
11     39.68
12     39.70
13     39.72
14     39.74
15     39.76
16     39.76
17     39.76
18     39.82
19     39.84
20     39.86
21     39.89
22     39.92
23     39.92
24     39.92
25     39.97
26     40.00
27     40.02
28     40.04
29     40.06
       ...  
335    33.88
336    33.99
337    34.09
338    34.18
339    34.24
340    34.29
341    34.30
342    34.40
343    34.43
344    34.50
345    34.50
346    34.57
347    34.60
348    34.60
349    34.68
350    34.80
351    34.80
352    34.86
353    34.90
354    35.00
355    35.00
356    35.00
357    35.10
358    35.10
359    35.10
360    35.20
361    35.20
362    35.20
363    35.30
364    35.28
Name: reservoirH, dtype: float64

In [43]:
#-------------------------------------------------------------------------------
# Set up

# Time
t = np.linspace(0,1,100)

# Frequencies in the signal
f1 = 20
f2 = 30

# Some random noise to add to the signal
noise = np.random.random_sample(len(t))

# Complete signal
y = x #2*np.sin(2*np.pi*f1*t+0.2) + 3*np.cos(2*np.pi*f2*t+0.3) + noise*5

# The part of the signal we want to isolate
y1 = x #2*np.sin(2*np.pi*f1*t+0.2)

In [ ]:


In [44]:
y


Out[44]:
0      39.30
1      39.34
2      39.34
3      39.34
4      39.42
5      39.47
6      39.53
7      39.56
8      39.59
9      39.59
10     39.59
11     39.68
12     39.70
13     39.72
14     39.74
15     39.76
16     39.76
17     39.76
18     39.82
19     39.84
20     39.86
21     39.89
22     39.92
23     39.92
24     39.92
25     39.97
26     40.00
27     40.02
28     40.04
29     40.06
       ...  
335    33.88
336    33.99
337    34.09
338    34.18
339    34.24
340    34.29
341    34.30
342    34.40
343    34.43
344    34.50
345    34.50
346    34.57
347    34.60
348    34.60
349    34.68
350    34.80
351    34.80
352    34.86
353    34.90
354    35.00
355    35.00
356    35.00
357    35.10
358    35.10
359    35.10
360    35.20
361    35.20
362    35.20
363    35.30
364    35.28
Name: reservoirH, dtype: float64

In [45]:
# FFT of the signal
F = sc.fft(y)

# Other specs
N = len(t)                              # number of samples
dt = 0.001                              # inter-sample time difference
w = np.fft.fftfreq(N, dt)               # list of frequencies for the FFT
pFrequency = np.where(w>=0)[0]          # we only positive frequencies
magnitudeF = abs(F[:len(pFrequency)])   # magnitude of F for the positive frequencies

#-------------------------------------------------------------------------------
# Some functions we will need

In [46]:
# Plots the FFT
def pltfft():
    plt.plot(pFrequency,magnitudeF)
    plt.xlabel('Hz')
    plt.ylabel('Magnitude')
    plt.title('FFT of the full signal')
    plt.grid(True)
    plt.show()

# Plots the full signal
def pltCompleteSignal():
    plt.plot(t,y,'b')
    plt.xlabel('Time (s)')
    plt.ylabel('Amplitude')
    plt.title('Full signal')
    plt.grid(True)
    plt.show()

In [47]:
# Filter function:
# blocks higher frequency than fmax, lower than fmin and returns the cleaned FT
def blockHigherFreq(FT,fmin,fmax,plot=False):
    for i in range(len(F)):
        if (i>= fmax) or (i<=fmin):
            FT[i] = 0
    if plot:
        plt.plot(pFrequency,abs(FT[:len(pFrequency)]))
        plt.xlabel('Hz')
        plt.ylabel('Magnitude')
        plt.title('Cleaned FFT')
        plt.grid(True)
        plt.show()
    return FT

# Normalising function (gets the signal in a scale from 0 to 1)
def normalise(signal):
    M = max(signal)
    normalised = signal/M
    return normalised

In [53]:
print signal


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-53-69d7f85102ff> in <module>()
----> 1 print signal

NameError: name 'signal' is not defined

In [49]:
plt.plot(y)
#plt.plot(y1)


Out[49]:
[<matplotlib.lines.Line2D at 0xbfa5da0>]

In [50]:
#-------------------------------------------------------------------------------
# Processing

# Cleaning the FT by selecting only frequencies between 18 and 22
newFT = blockHigherFreq(F,18,22)

# Getting back the cleaned signal
cleanedSignal = sc.ifft(F)

# Error
error = normalise(y1) - normalise(cleanedSignal)

#-------------------------------------------------------------------------------
# Plot the findings

#pltCompleteSignal()         #Plot the full signal
#pltfft()                    #Plot fft

plt.figure()

plt.subplot(3,1,1)          #Subplot 1
plt.title('Original signal')
plt.plot(t,y,'g')

plt.subplot(3,1,2)          #Subplot 2
plt.plot(t,normalise(cleanedSignal),label='Cleaned signal',color='b')
plt.plot(t,normalise(y1),label='Signal to find',ls='-',color='r')
plt.title('Cleaned signal and signal to find')
plt.legend()

plt.subplot(3,1,3)          #Subplot 3
plt.plot(t,error,color='r',label='error')
plt.show()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-50-4d98d469c716> in <module>()
     21 plt.subplot(3,1,1)          #Subplot 1
     22 plt.title('Original signal')
---> 23 plt.plot(t,y,'g')
     24 
     25 plt.subplot(3,1,2)          #Subplot 2

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\pyplot.pyc in plot(*args, **kwargs)
   3152         ax.hold(hold)
   3153     try:
-> 3154         ret = ax.plot(*args, **kwargs)
   3155     finally:
   3156         ax.hold(washold)

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\__init__.pyc in inner(ax, *args, **kwargs)
   1809                     warnings.warn(msg % (label_namer, func.__name__),
   1810                                   RuntimeWarning, stacklevel=2)
-> 1811             return func(ax, *args, **kwargs)
   1812         pre_doc = inner.__doc__
   1813         if pre_doc is None:

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\axes\_axes.pyc in plot(self, *args, **kwargs)
   1422             kwargs['color'] = c
   1423 
-> 1424         for line in self._get_lines(*args, **kwargs):
   1425             self.add_line(line)
   1426             lines.append(line)

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\axes\_base.pyc in _grab_next_args(self, *args, **kwargs)
    384                 return
    385             if len(remaining) <= 3:
--> 386                 for seg in self._plot_args(remaining, kwargs):
    387                     yield seg
    388                 return

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\axes\_base.pyc in _plot_args(self, tup, kwargs)
    362             x, y = index_of(tup[-1])
    363 
--> 364         x, y = self._xy_from_xy(x, y)
    365 
    366         if self.command == 'plot':

C:\Users\sungkeun\Anaconda2\lib\site-packages\matplotlib\axes\_base.pyc in _xy_from_xy(self, x, y)
    221         y = _check_1d(y)
    222         if x.shape[0] != y.shape[0]:
--> 223             raise ValueError("x and y must have same first dimension")
    224         if x.ndim > 2 or y.ndim > 2:
    225             raise ValueError("x and y can be no greater than 2-D")

ValueError: x and y must have same first dimension

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: