Let's admit it, bayesian modeling on time series is slow. In pymc3, it typically implies using theano scan operation. Here, we will show how to profile one step of the kalman filter, as well as the scan operation over the time series.

First, load the required packages:


In [1]:
import numpy              as np
import theano
import theano.tensor      as tt

import kalman

We will use the same data as in the 01_RandomWalkPlusObservation notebook.


In [2]:
# True values
T = 500                 # Time steps
sigma2_eps0 = 3         # Variance of the observation noise
sigma2_eta0 = 10        # Variance in the update of the mean

# Simulate data
np.random.seed(12345)
eps = np.random.normal(scale=sigma2_eps0**0.5, size=T)
eta = np.random.normal(scale=sigma2_eta0**0.5, size=T)
mu = np.cumsum(eta)
y = mu + eps

Next, we create all the tensors required to describe our model:


In [3]:
# Upon using pymc3, the following theano configuration flag is changed,
# leading to tensors being required to have test values
#theano.config.compute_test_value = 'ignore'

# Tensors for the measurement equation
Z = tt.dmatrix(name='Z')
d = tt.dvector(name='d')
H = tt.dmatrix(name='H')

# Tensors for the transition equation
T = tt.dmatrix(name='T')
c = tt.dvector(name='c')
R = tt.dmatrix(name='R')
Q = tt.dmatrix(name='Q')

# Initial position and uncertainty
a0 = tt.dvector(name='a0')
P0 = tt.dmatrix(name='P0')

We will also create some actual values for them:


In [4]:
ɛ_σ2 = 3.
η_σ2 = 10.

args = dict(Z = np.array([[1.]]),
            d = np.array([0.]),
            H = np.array([[ɛ_σ2]]),
            T = np.array([[1.]]),
            c = np.array([0.]),
            R = np.array([[1.]]),
            Q = np.array([[η_σ2]]),
            a0 = np.array([0.]),
            P0 = np.array([[1e6]]))

Let's calculate the likelihood of the observed values, given the parameters above:


In [5]:
kalmanTheano = kalman.KalmanTheano(Z, d, H, T, c, R, Q, a0, P0)
(at, Pt, lliks), updates = kalmanTheano.filter(y[:,None])

f = theano.function([Z, d, H, T, c, R, Q, a0, P0], lliks)

llik = f(**args)
llik[1:].sum()


Out[5]:
-1369.7346722999789

Time required for the log-likelihood calculation:


In [6]:
print('Measuring time...')
%timeit f(**args)


Measuring time...
12.4 ms ± 96 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Profiling a non-scan operation is relatively simple. As an example, let's create a function to calculate the first time step of the Kalman filter:


In [7]:
Y0 = tt.dvector(name='Y0')
_,_,llik = kalman.core._oneStep(Y0, Z, d, H, T, c, R, Q, a0, P0)

profiler = theano.compile.ScanProfileStats()
f = theano.function([Y0, Z, d, H, T, c, R, Q, a0, P0], llik, profile=profiler)

f(y[0,None], **args);

profiler.summary()


Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  33.0%    33.0%       0.000s       4.05e-06s     C        4       4   theano.tensor.blas_c.CGemv
  32.5%    65.5%       0.000s       3.99e-06s     C        4       4   theano.tensor.blas.Dot22
  14.1%    79.6%       0.000s       2.30e-06s     C        3       3   theano.tensor.elemwise.Elemwise
  12.6%    92.2%       0.000s       3.10e-06s     C        2       2   theano.tensor.blas.Gemm
   3.9%    96.1%       0.000s       3.81e-07s     C        5       5   theano.tensor.elemwise.DimShuffle
   3.9%   100.0%       0.000s       9.54e-07s     C        2       2   theano.tensor.basic.AllocEmpty
   0.0%   100.0%       0.000s       0.00e+00s     C        1       1   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  32.5%    32.5%       0.000s       3.99e-06s     C        4        4   Dot22
  26.7%    59.2%       0.000s       6.56e-06s     C        2        2   CGemv{no_inplace}
   7.8%    67.0%       0.000s       3.81e-06s     C        1        1   Elemwise{Composite{(i0 * (i1 + log(i2) + i3))}}[(0, 2)]
   6.3%    73.3%       0.000s       1.55e-06s     C        2        2   CGemv{inplace}
   6.3%    79.6%       0.000s       3.10e-06s     C        1        1   Gemm{inplace}
   6.3%    85.9%       0.000s       3.10e-06s     C        1        1   Gemm{no_inplace}
   4.4%    90.3%       0.000s       2.15e-06s     C        1        1   Elemwise{inv,no_inplace}
   3.9%    94.2%       0.000s       9.54e-07s     C        2        2   AllocEmpty{dtype='float64'}
   1.9%    96.1%       0.000s       3.18e-07s     C        3        3   InplaceDimShuffle{1,0}
   1.9%    98.1%       0.000s       9.54e-07s     C        1        1   Elemwise{Composite{((-i0) + i1)}}[(0, 1)]
   1.9%   100.0%       0.000s       9.54e-07s     C        1        1   InplaceDimShuffle{x,0}
   0.0%   100.0%       0.000s       0.00e+00s     C        1        1   Shape_i{0}
   0.0%   100.0%       0.000s       0.00e+00s     C        1        1   InplaceDimShuffle{x,0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  24.8%    24.8%       0.000s       1.22e-05s      1     5   Dot22(T, P0)
  18.4%    43.2%       0.000s       9.06e-06s      1     6   CGemv{no_inplace}(c, TensorConstant{1.0}, T, a0, TensorConstant{1.0})
   8.3%    51.5%       0.000s       4.05e-06s      1    10   CGemv{no_inplace}(Y0, TensorConstant{-1.0}, Z, CGemv{no_inplace}.0, TensorConstant{1.0})
   7.8%    59.2%       0.000s       3.81e-06s      1    20   Elemwise{Composite{(i0 * (i1 + log(i2) + i3))}}[(0, 2)](TensorConstant{(1, 1) of -0.5}, TensorConstant{(1, 1) of ..3787706641}, Gemm{no_inplace}.0, InplaceDimShuffle{x,0}.0)
   6.3%    65.5%       0.000s       3.10e-06s      1    14   Gemm{no_inplace}(H, TensorConstant{1.0}, Dot22.0, Z.T, TensorConstant{1.0})
   6.3%    71.8%       0.000s       3.10e-06s      1    11   Gemm{inplace}(Dot22.0, TensorConstant{1.0}, Dot22.0, R.T, TensorConstant{1.0})
   4.4%    76.2%       0.000s       2.15e-06s      1    16   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, Elemwise{inv,no_inplace}.0, Elemwise{Composite{((-i0) + i1)}}[(0, 1)].0, TensorConstant{0.0})
   4.4%    80.6%       0.000s       2.15e-06s      1    15   Elemwise{inv,no_inplace}(Gemm{no_inplace}.0)
   3.9%    84.5%       0.000s       1.91e-06s      1     9   Dot22(Dot22.0, T.T)
   1.9%    86.4%       0.000s       9.54e-07s      1    18   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, InplaceDimShuffle{x,0}.0, Elemwise{Composite{((-i0) + i1)}}[(0, 1)].0, TensorConstant{0.0})
   1.9%    88.3%       0.000s       9.54e-07s      1    17   InplaceDimShuffle{x,0}(CGemv{inplace}.0)
   1.9%    90.3%       0.000s       9.54e-07s      1    13   Dot22(Z, Gemm{inplace}.0)
   1.9%    92.2%       0.000s       9.54e-07s      1    12   Elemwise{Composite{((-i0) + i1)}}[(0, 1)](d, CGemv{no_inplace}.0)
   1.9%    94.2%       0.000s       9.54e-07s      1     8   AllocEmpty{dtype='float64'}(Shape_i{0}.0)
   1.9%    96.1%       0.000s       9.54e-07s      1     7   AllocEmpty{dtype='float64'}(TensorConstant{1})
   1.9%    98.1%       0.000s       9.54e-07s      1     4   InplaceDimShuffle{1,0}(T)
   1.9%   100.0%       0.000s       9.54e-07s      1     3   Dot22(R, Q)
   0.0%   100.0%       0.000s       0.00e+00s      1    19   InplaceDimShuffle{x,0}(CGemv{inplace}.0)
   0.0%   100.0%       0.000s       0.00e+00s      1     2   InplaceDimShuffle{1,0}(R)
   0.0%   100.0%       0.000s       0.00e+00s      1     1   Shape_i{0}(Z)
   ... (remaining 1 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Repeating the procedure with a scan procedure, we can see that the code inside it is not profiled. It took me a while to make it work (not even stackoverflow helped!!!). In the end, this is how I made it work:


In [8]:
profiler = theano.compile.ScanProfileStats()
(_,_,llik),_ = kalmanTheano.filter(y[:,None], profile=profiler)

f = theano.function([Z, d, H, T, c, R, Q, a0, P0], llik, profile=profiler)

f(**args);

# Select the node corresponding to the scan operation
scan_op = next(k for k in profiler.op_nodes()
                     if isinstance(k, theano.scan_module.scan_op.Scan))
scan_op.profile.summary()


Scan Op profiling
==================
  Message: None
  Time in 1 calls of the op (for a total of 500 steps) 1.468205e-02s

  Total time spent in calling the VM 1.024890e-02s (69.806%)
  Total overhead (computing slices..) 4.433155e-03s (30.194%)

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  36.3%    36.3%       0.003s       1.01e-06s     C     2500       5   theano.tensor.blas_c.CGemv
  28.5%    64.8%       0.002s       6.62e-07s     C     3000       6   theano.tensor.blas.Dot22
  13.7%    78.5%       0.001s       4.77e-07s     C     2000       4   theano.tensor.elemwise.Elemwise
   9.1%    87.6%       0.001s       6.37e-07s     C     1000       2   theano.tensor.blas.Gemm
   9.0%    96.6%       0.001s       6.29e-07s     C     1000       2   theano.tensor.elemwise.DimShuffle
   2.4%    99.1%       0.000s       1.70e-07s     C     1000       2   theano.tensor.basic.AllocEmpty
   0.9%   100.0%       0.000s       1.27e-07s     C      500       1   theano.compile.ops.Shape_i
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  28.5%    28.5%       0.002s       6.62e-07s     C     3000        6   Dot22
  28.0%    56.5%       0.002s       1.30e-06s     C     1500        3   CGemv{no_inplace}
   9.1%    65.6%       0.001s       6.37e-07s     C     1000        2   Gemm{no_inplace}
   8.3%    73.9%       0.001s       5.77e-07s     C     1000        2   CGemv{inplace}
   6.4%    80.3%       0.000s       8.96e-07s     C      500        1   InplaceDimShuffle{x,0}
   4.5%    84.9%       0.000s       6.30e-07s     C      500        1   Elemwise{sub,no_inplace}
   4.3%    89.2%       0.000s       6.05e-07s     C      500        1   Elemwise{Composite{(i0 * (i1 + log(i2) + i3))}}
   2.8%    92.1%       0.000s       3.95e-07s     C      500        1   Elemwise{inv,no_inplace}
   2.6%    94.6%       0.000s       3.61e-07s     C      500        1   InplaceDimShuffle{x,0}
   2.4%    97.1%       0.000s       1.70e-07s     C     1000        2   AllocEmpty{dtype='float64'}
   2.0%    99.1%       0.000s       2.78e-07s     C      500        1   Elemwise{Add}[(0, 1)]
   0.9%   100.0%       0.000s       1.27e-07s     C      500        1   Shape_i{0}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  11.8%    11.8%       0.001s       1.64e-06s    500     0   CGemv{no_inplace}(c_copy, TensorConstant{1.0}, T_copy, a0[t-1], TensorConstant{1.0})
   9.1%    20.9%       0.001s       1.27e-06s    500     2   Dot22(T_copy, P0[t-1])
   8.3%    29.2%       0.001s       1.16e-06s    500    16   CGemv{no_inplace}(CGemv{no_inplace}.0, TensorConstant{1.0}, Dot22.0, Elemwise{Add}[(0, 1)].0, TensorConstant{1.0})
   7.9%    37.1%       0.001s       1.10e-06s    500     4   CGemv{no_inplace}(d_copy, TensorConstant{-1.0}, Z_copy, CGemv{no_inplace}.0, TensorConstant{-1.0})
   6.4%    43.5%       0.000s       8.96e-07s    500    14   InplaceDimShuffle{x,0}(CGemv{inplace}.0)
   6.2%    49.7%       0.000s       8.63e-07s    500    15   Dot22(Dot22.0, Z_copy)
   5.4%    55.1%       0.000s       7.56e-07s    500     6   Gemm{no_inplace}(<TensorType(float64, matrix)>, TensorConstant{1.0}, Dot22.0, T.T_replace, TensorConstant{1.0})
   4.9%    60.0%       0.000s       6.82e-07s    500    12   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, Elemwise{inv,no_inplace}.0, Elemwise{Add}[(0, 1)].0, TensorConstant{0.0})
   4.5%    64.6%       0.000s       6.30e-07s    500    20   Elemwise{sub,no_inplace}(Gemm{no_inplace}.0, Dot22.0)
   4.3%    68.9%       0.000s       6.05e-07s    500    21   Elemwise{Composite{(i0 * (i1 + log(i2) + i3))}}(TensorConstant{(1, 1) of -0.5}, TensorConstant{(1, 1) of ..3787706641}, Gemm{no_inplace}.0, InplaceDimShuffle{x,0}.0)
   3.9%    72.8%       0.000s       5.45e-07s    500    13   Dot22(Dot22.0, Elemwise{inv,no_inplace}.0)
   3.7%    76.5%       0.000s       5.18e-07s    500    10   Gemm{no_inplace}(H_copy, TensorConstant{1.0}, Dot22.0, Z.T_replace, TensorConstant{1.0})
   3.4%    79.9%       0.000s       4.73e-07s    500    17   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, InplaceDimShuffle{x,0}.0, Elemwise{Add}[(0, 1)].0, TensorConstant{0.0})
   3.3%    83.2%       0.000s       4.55e-07s    500     9   Dot22(Gemm{no_inplace}.0, Z.T_replace)
   3.2%    86.4%       0.000s       4.41e-07s    500     8   Dot22(Z_copy, Gemm{no_inplace}.0)
   2.9%    89.2%       0.000s       3.99e-07s    500    18   Dot22(Dot22.0, Gemm{no_inplace}.0)
   2.8%    92.1%       0.000s       3.95e-07s    500    11   Elemwise{inv,no_inplace}(Gemm{no_inplace}.0)
   2.6%    94.6%       0.000s       3.61e-07s    500    19   InplaceDimShuffle{x,0}(CGemv{inplace}.0)
   2.0%    96.6%       0.000s       2.78e-07s    500     7   Elemwise{Add}[(0, 1)](Y[t], CGemv{no_inplace}.0)
   1.3%    98.0%       0.000s       1.87e-07s    500     5   AllocEmpty{dtype='float64'}(Shape_i{0}.0)
   ... (remaining 2 Apply instances account for 2.01%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.