Expectation maximization for Linear Dynamical System Identification

In this notebook, we will present a high level overall summary of the paper An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification, as well as a univariate implementation of the algorithm in Python.

High level summary of solution

Generalization of the classical Kalman Filter-Smoother Expectation-Maximization algorithm
Want to fit statistical model to time series data

Motivation

abundance of high dimensional time series data, we want to fit statistical models to them
Existing methods cannot cope with high dimensional nature of these problems because of computational and statistical reasons
Parameters for Kalman filters are often unknown - opportunity to use unsupervised learning
3 main issues: dense high dimensional matrices, estimators are numerically unstable, computation time

The model

Observed variables: $Y=(y_1,\dots,y_T)$

Latent variables: $X=(x_1,\dots,x_T)$

$$P(Y|X) = \prod_{t=1}^T P(y_t | y_{0:t-1}, x_{0:t})$$$$P(X) = P(x_0) \prod_{t=1}^T P(x_t | x_{0:t-1})$$

Time invariant state space model makes the following simplifications:

$$P(y_t | y_{0:t-1}, x_{0:t}) \approx P(y_t | x_t)$$$$P(x_t | x_{0:t-1}) \approx P(x_t | x_{t-1})$$

Linear dynamical systems (resembles Kalman filters):

$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},Q),\quad x_0 \sim N(\mathbf{\pi}_0,V_0)$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$

Notation:

$A \in \mathbb{R^{d\times d}}$ is state transformation matrix

$C \in \mathbb{R^{p\times d}}$ is generative matrix

$x_t \in \mathbb{R^{d}}$

$y_t \in \mathbb{R^{p}}$

$R \in \mathbb{R^{p \times p}}$ is output noise covariance matrix

$Q \in \mathbb{R^{d \times d}}$ is state noise covariance matrix

$\pi_0 \in \mathbb{R^{d}}$ is initial state mean

$V_0 \in \mathbb{R^{d \times d}}$ is initial state covariance

The model has the following constaints to make the system identifiable and the model useful:

$Q = I_{d \times d}$
ordering of columns of $C$ is based on their norms
$V_0 = 0_{d \times d}$
$R$ is diagonal matrix
$A$ is sparse
$C$ has smooth columns (what does this acutally mean intuitively aside from the l_2 norm?)

These constraints simplify the model even further, as:

$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},I),\quad x_0 = \mathbf{\pi}_0$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$

The main optimization problem

Let $\theta =\{A,C,R,\mathbf{\pi}_0\}$ represent all unknown parameters, and $P(X,Y)$ be the likelihood for a generic LDS model. We arrive at the optimization problem:

$$\hat{\theta}= argmin_{\substack{\theta}}\left\{-\log P_\theta(X,Y)+\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\right\}$$

where the lambdas are tuning parameters. This can be equivalently stated as:

$$\text{minimize} \left\{-\log P_\theta(X,Y)\right\}$$$$ \alpha\|A\|_1+ (1-\alpha)\|C\|_2^2 \leq t \text{ for some }t; $$$$ A\in \mathcal{A}_{d\times d},\ C \in \mathcal{C}_{p \times d}, R \in \mathcal{R}_{p\times p}, \pi_0 \in \mathcal{\pi}_{d\times 1}.$$

Paramter Estimation

Main idea: we can use the EM algorithm to maximize likelihood function of observed data.

The likelihood in the model is: $$ P(X,Y)=\prod\limits_{t=1}^{T}P(x_t|x_{t-1})\prod\limits_{t=1}^{T} P(y_t|x_t)\mathbb{1}_{\mathbf{\pi}_0}(x_0) $$

where $\mathbb{1}_{\mathbf{\pi}_0}(x_0)$ is the indicator function and conditional likelihoods are:

$$P(y_t|x_t)= (2\pi)^{-\frac{p}{2}}|R|^{-\frac{1}{2}}\ \text{exp}\left\{-\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\right\}$$$$P(x_t|x_{t-1}) =\text{exp}\left\{-\frac{1}{2}[\mathbf{x_t}-A\mathbf{x_{t-1}}]^{T}Q^{-1}[\mathbf{x_t}-A\mathbf{x_{t-1}}]\right\}(2\pi)^{-d/2}|Q|^{-1/2} =(2\pi)^{-\frac{d}{2}}\ \text{exp}\left\{-\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\right\}$$

which allows us to further the parameter estimation simplify to:

$$\hat{\theta}= argmin_{\substack{\theta}}\biggl\{\sum\limits_{t=1}^{T}\big(\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\big)-\frac{T}{2}\text{log}|R|\\ +\sum\limits_{t=1}^{T}\big(\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\big)-\frac{T}{2}\text{log}|\mathbf{I}| - \text{log}(\mathbb{1}_{\mathbf{\pi}_0}(x_0))\\ +\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\biggr\}.$$

Let $\mathbf{\Phi}(\theta,Y,X)$ denote this objective function in the curly brackets.

E and M steps

E step

Calculuate expected value of log-likelihood $\Gamma = E[\log P(X,Y|Y)]$ (ie exactly the same as regular E step in EM)

M step

$$ R^{\text{new}} = \text{diag} \biggl\{\frac{1}{T}\sum\limits_{t=1}^{T}(y_ty_t^{T}-C\hat{x}_ty_t^{T})\biggr\} $$$$\mathbf{\pi}_0^{\text{new}} = \hat{x}_0$$$$\mathbf{c}^{\text{new}} = (X'^{T}X' + \lambda_2\mathbf{I})^{-1}X'^{T}Y'$$

where $$ \mathbf{c}^{\text{new}} = (C_{11}^{\text{new}},\ldots,C_{1d}^{\text{new}},C_{21}^{\text{new}},\ldots,C_{2d}^{\text{new}},C_{p1}^{\text{new}},\ldots,C_{pd}^{\text{new}})^{T} $$

$$Y' = (y_{11},\ldots,y_{T1},y_{12},\ldots,y_{T2},\ldots,y_{1p},\ldots,y_{Tp})^{T}$$$$ %\[ X' = \begin{pmatrix} X^{T}\\ \ddots\\ X^{T} \end{pmatrix}_{pT\times pd}. %\] $$

so we rearrange the vectorized version of $C^{\text{new}}$ to get it into matrix form.

$f_{\lambda_1}(A;X,Y)$ does not have a closed form solution, but can be solved numerically with a Fast Iterative Shrinkage-Thresholding Algorithm (FISTA).

$$A^{\text{new}} = \text{FISTA}(\|\mathbf{Z}^{T}\mathbf{a}^{\text{old}} -\mathbf{z}\|_2^2,\quad \lambda_1)$$

Initialization

$$R = I$$$$\mathbf{\pi}_0 = \mathbf{0}$$

$C = \mathbf{U}_{p\times d}$ where $\mathbf{U}_{p\times d}$ is from the compact SVD of $Y$

Columns of $X_{d \times T}$ are used as input for a vector autoregressive (VAR) model to estimate the initial value for $A$.

Implementation

I got stuck on how the E step actually works in practice with the Kalman Filter Smoother- will ask this in meeting today.

We implement the algorithm in the univariate setting below:



In [ ]:

N4SID notes

Link to paper: http://people.duke.edu/~hpgavin/SystemID/References/VanOverschee-Automatica-1994.pdf

At a high level: numerical algorithms for subspace state space system identication 2 algorithms:

first one calculates unbiased results
second one is a simpler but biased approximation

Preliminary background

Linear subspace identification methods are concerned with systems and models of the form:

$$ x_{k + 1} = Ax_k + Bu_k + w_k $$$$ y_k = Cx_k + Du_k + v_k$$

with $$E[\begin{pmatrix}w_p\\v_p\end{pmatrix} \begin{pmatrix} w_q^T \quad v_q^T \end{pmatrix}] = \begin{pmatrix} Q \quad S \\ S^T \quad R \end{pmatrix} \delta_{qp} \geq 0$$

Lot of controls mathematics (observability matrices, etc)

Two basic steps:

Finding the state sequence and/or the extended obserability matrix
Finding the state space model

Algorithm 1 (unbiased)

Determine the projections:

$$ Z_i = Y_{i|2i-1} \textbackslash \begin{pmatrix} U_{0 | i-1} \\ U_{i | 2i-1} \\ Y_{0 | i-1} \end{pmatrix} $$$$ Z_{i+1} = Y_{i+1|2i-1} \textbackslash \begin{pmatrix} U_{0 | i} \\ U_{i+1 | 2i-1} \\ Y_{0 | i} \end{pmatrix} $$

Determine SVD

$$ \begin{pmatrix} L_i^1 | L_i^3 \end{pmatrix} \begin{pmatrix} U_{0 | i-1} // Y_{0 | i-1} \end{pmatrix} = \begin{pmatrix} U_1 \quad U_2 \end{pmatrix} \begin{pmatrix} \Sigma_1 \quad 0 \\ 0 \quad 0 \end{pmatrix} V^t$$

Determine least squares solution

$$ \begin{pmatrix} \Gamma_{i-1}Z_{i+1} \\ Y_{i|i} \end{pamtrix} = \begin{pmatrix} K_{11} \quad K_{12} \\ K_{21} \quad K_{22} \end{pmatrix} \begin{pmatrix} \Gamma_{i}Z_{i} \\ U_{i|2i -1} \end{pamtrix} + \begin{pmatrix} \rho_1 \\ \rho_{2} \end{pamtrix} $$

System matrices are determined by:

$$ A = K_{11} $$$$ C = K_{21} $$

and B, D follow from set of linear equations for system.



In [ ]: