Expectation maximization for Linear Dynamical System Identification

In this notebook, we will present a high level overall summary of the paper An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification, as well as a univariate implementation of the algorithm in Python.

High level summary of solution

  • Generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm
  • Want to fit statistical model to time series data

Motivation

  • abundance of high dimensional time series data, we want to fit statistical models to them
  • Existing methods cannot cope with high dimensional nature of these problems because of computational and statistical reasons
  • Parameters for Kalman filters are often unknown - opportunity to use unsupervised learning
  • 3 main issues: dense high dimensional matrices, estimators are numerically unstable, computation time

The model

Observed variables: $Y=(y_1,\dots,y_T)$

Latent variables: $X=(x_1,\dots,x_T)$

$$P(Y|X) = \prod_{t=1}^T P(y_t | y_{0:t-1}, x_{0:t})$$$$P(X) = P(x_0) \prod_{t=1}^T P(x_t | x_{0:t-1})$$

Time invariant state space model makes the following simplifications:

$$P(y_t | y_{0:t-1}, x_{0:t}) \approx P(y_t | x_t)$$$$P(x_t | x_{0:t-1}) \approx P(x_t | x_{t-1})$$

Linear dynamical systems (resembles Kalman filters):

$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},Q),\quad x_0 \sim N(\mathbf{\pi}_0,V_0)$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$

Notation:

$A \in \mathbb{R^{d\times d}}$ is state transformation matrix

$C \in \mathbb{R^{p\times d}}$ is generative matrix

$x_t \in \mathbb{R^{d}}$

$y_t \in \mathbb{R^{p}}$

$R \in \mathbb{R^{p \times p}}$ is output noise covariance matrix

$Q \in \mathbb{R^{d \times d}}$ is state noise covariance matrix

$\pi_0 \in \mathbb{R^{d}}$ is initial state mean

$V_0 \in \mathbb{R^{d \times d}}$ is initial state covariance

The model has the following constaints to make the system identifiable and the model useful:

  1. $Q = I_{d \times d}$
  2. ordering of columns of $C$ is based on their norms
  3. $V_0 = 0_{d \times d}$
  4. $R$ is diagonal matrix
  5. $A$ is sparse
  6. $C$ has smooth columns (what does this acutally mean intuitively aside from the l_2 norm?)

These constraints simplify the model even further, as:

$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},I),\quad x_0 = \mathbf{\pi}_0$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$

The main optimization problem

Let $\theta =\{A,C,R,\mathbf{\pi}_0\}$ represent all unknown parameters, and $P(X,Y)$ be the likelihood for a generic LDS model. We arrive at the optimization problem:

$$\hat{\theta}= argmin_{\substack{\theta}}\left\{-\log P_\theta(X,Y)+\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\right\}$$

where the lambdas are tuning parameters. This can be equivalently stated as:

$$\text{minimize} \left\{-\log P_\theta(X,Y)\right\}$$$$ \alpha\|A\|_1+ (1-\alpha)\|C\|_2^2 \leq t \text{ for some }t; $$$$ A\in \mathcal{A}_{d\times d},\ C \in \mathcal{C}_{p \times d}, R \in \mathcal{R}_{p\times p}, \pi_0 \in \mathcal{\pi}_{d\times 1}.$$

Paramter Estimation

Main idea: we can use the EM algorithm to maximize likelihood function of observed data.

The likelihood in the model is: $$ P(X,Y)=\prod\limits_{t=1}^{T}P(x_t|x_{t-1})\prod\limits_{t=1}^{T} P(y_t|x_t)\mathbb{1}_{\mathbf{\pi}_0}(x_0) $$

where $\mathbb{1}_{\mathbf{\pi}_0}(x_0)$ is the indicator function and conditional likelihoods are:

$$P(y_t|x_t)= (2\pi)^{-\frac{p}{2}}|R|^{-\frac{1}{2}}\ \text{exp}\left\{-\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\right\}$$$$P(x_t|x_{t-1}) =\text{exp}\left\{-\frac{1}{2}[\mathbf{x_t}-A\mathbf{x_{t-1}}]^{T}Q^{-1}[\mathbf{x_t}-A\mathbf{x_{t-1}}]\right\}(2\pi)^{-d/2}|Q|^{-1/2} =(2\pi)^{-\frac{d}{2}}\ \text{exp}\left\{-\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\right\}$$

which allows us to further the parameter estimation simplify to:

$$\hat{\theta}= argmin_{\substack{\theta}}\biggl\{\sum\limits_{t=1}^{T}\big(\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\big)-\frac{T}{2}\text{log}|R|\\ +\sum\limits_{t=1}^{T}\big(\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\big)-\frac{T}{2}\text{log}|\mathbf{I}| - \text{log}(\mathbb{1}_{\mathbf{\pi}_0}(x_0))\\ +\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\biggr\}.$$

Let $\mathbf{\Phi}(\theta,Y,X)$ denote this objective function in the curly brackets.

E and M steps

E step

Calculuate expected value of log-likelihood $\Gamma = E[\log P(X,Y|Y)]$ (ie exactly the same as regular E step in EM)

M step

$$ R^{\text{new}} = \text{diag} \biggl\{\frac{1}{T}\sum\limits_{t=1}^{T}(y_ty_t^{T}-C\hat{x}_ty_t^{T})\biggr\} $$$$\mathbf{\pi}_0^{\text{new}} = \hat{x}_0$$$$\mathbf{c}^{\text{new}} = (X'^{T}X' + \lambda_2\mathbf{I})^{-1}X'^{T}Y'$$

where $$ \mathbf{c}^{\text{new}} = (C_{11}^{\text{new}},\ldots,C_{1d}^{\text{new}},C_{21}^{\text{new}},\ldots,C_{2d}^{\text{new}},C_{p1}^{\text{new}},\ldots,C_{pd}^{\text{new}})^{T} $$

$$Y' = (y_{11},\ldots,y_{T1},y_{12},\ldots,y_{T2},\ldots,y_{1p},\ldots,y_{Tp})^{T}$$$$ %\[ X' = \begin{pmatrix} X^{T}\\ \ddots\\ X^{T} \end{pmatrix}_{pT\times pd}. %\] $$

so we rearrange the vectorized version of $C^{\text{new}}$ to get it into matrix form.

$f_{\lambda_1}(A;X,Y)$ does not have a closed form solution, but can be solved numerically with a Fast Iterative Shrinkage-Thresholding Algorithm (FISTA).

$$A^{\text{new}} = \text{FISTA}(\|\mathbf{Z}^{T}\mathbf{a}^{\text{old}} -\mathbf{z}\|_2^2,\quad \lambda_1)$$

Initialization

$$R = I$$$$\mathbf{\pi}_0 = \mathbf{0}$$

$C = \mathbf{U}_{p\times d}$ where $\mathbf{U}_{p\times d}$ is from the compact SVD of $Y$

Columns of $X_{d \times T}$ are used as input for a vector autoregressive (VAR) model to estimate the initial value for $A$.

Implementation

I got stuck on how the E step actually works in practice with the Kalman Filter Smoother- will ask this in meeting today.

We implement the algorithm in the univariate setting below:


In [ ]:

N4SID notes

Link to paper: http://people.duke.edu/~hpgavin/SystemID/References/VanOverschee-Automatica-1994.pdf

At a high level: numerical algorithms for subspace state space system identication 2 algorithms:

  • first one calculates unbiased results
  • second one is a simpler but biased approximation

Preliminary background

Linear subspace identification methods are concerned with systems and models of the form:

$$ x_{k‡ + 1} =ˆ Ax_k +‡ Bu_k ‡+ w_k $$$$ y_k = Cx_k + Du_k + v_k$$

with $$E[\begin{pmatrix}w_p\\v_p\end{pmatrix} \begin{pmatrix} w_q^T \quad v_q^T \end{pmatrix}] = \begin{pmatrix} Q \quad S \\ S^T \quad R \end{pmatrix} \delta_{qp} \geq 0$$

Lot of controls mathematics (observability matrices, etc)

Two basic steps:

  • Finding the state sequence and/or the extended obserability matrix
  • Finding the state space model

Algorithm 1 (unbiased)

  1. Determine the projections:
$$ Z_i = Y_{i|2i-1} \textbackslash \begin{pmatrix} U_{0 | i-1} \\ U_{i | 2i-1} \\ Y_{0 | i-1} \end{pmatrix} $$$$ Z_{i+1} = Y_{i+1|2i-1} \textbackslash \begin{pmatrix} U_{0 | i} \\ U_{i+1 | 2i-1} \\ Y_{0 | i} \end{pmatrix} $$
  1. Determine SVD
$$ \begin{pmatrix} L_i^1 | L_i^3 \end{pmatrix} \begin{pmatrix} U_{0 | i-1} // Y_{0 | i-1} \end{pmatrix} = \begin{pmatrix} U_1 \quad U_2 \end{pmatrix} \begin{pmatrix} \Sigma_1 \quad 0 \\ 0 \quad 0 \end{pmatrix} V^t$$
  1. Determine least squares solution
$$ \begin{pmatrix} \Gamma_{i-1}Z_{i+1} \\ Y_{i|i} \end{pamtrix} = \begin{pmatrix} K_{11} \quad K_{12} \\ K_{21} \quad K_{22} \end{pmatrix} \begin{pmatrix} \Gamma_{i}Z_{i} \\ U_{i|2i -1} \end{pamtrix} + \begin{pmatrix} \rho_1 \\ \rho_{2} \end{pamtrix} $$
  1. System matrices are determined by:
$$ A = K_{11} $$$$ C = K_{21} $$

and B, D follow from set of linear equations for system.


In [ ]: