In this notebook, we will present a high level overall summary of the paper An M-Estimator for Reduced-Rank High-Dimensional Linear Dynamical System Identification, as well as a univariate implementation of the algorithm in Python.
Observed variables: $Y=(y_1,\dots,y_T)$
Latent variables: $X=(x_1,\dots,x_T)$
$$P(Y|X) = \prod_{t=1}^T P(y_t | y_{0:t-1}, x_{0:t})$$$$P(X) = P(x_0) \prod_{t=1}^T P(x_t | x_{0:t-1})$$Time invariant state space model makes the following simplifications:
$$P(y_t | y_{0:t-1}, x_{0:t}) \approx P(y_t | x_t)$$$$P(x_t | x_{0:t-1}) \approx P(x_t | x_{t-1})$$Linear dynamical systems (resembles Kalman filters):
$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},Q),\quad x_0 \sim N(\mathbf{\pi}_0,V_0)$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$Notation:
$A \in \mathbb{R^{d\times d}}$ is state transformation matrix
$C \in \mathbb{R^{p\times d}}$ is generative matrix
$x_t \in \mathbb{R^{d}}$
$y_t \in \mathbb{R^{p}}$
$R \in \mathbb{R^{p \times p}}$ is output noise covariance matrix
$Q \in \mathbb{R^{d \times d}}$ is state noise covariance matrix
$\pi_0 \in \mathbb{R^{d}}$ is initial state mean
$V_0 \in \mathbb{R^{d \times d}}$ is initial state covariance
The model has the following constaints to make the system identifiable and the model useful:
These constraints simplify the model even further, as:
$$ x_{t+1}= Ax_t+Y'_t \quad Y'_t\sim N(\mathbf{0},I),\quad x_0 = \mathbf{\pi}_0$$$$y_t=Cx_t+\mathbf{v}_t,\qquad \mathbf{v}_t\sim N(\mathbf{0},R)$$Let $\theta =\{A,C,R,\mathbf{\pi}_0\}$ represent all unknown parameters, and $P(X,Y)$ be the likelihood for a generic LDS model. We arrive at the optimization problem:
$$\hat{\theta}= argmin_{\substack{\theta}}\left\{-\log P_\theta(X,Y)+\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\right\}$$where the lambdas are tuning parameters. This can be equivalently stated as:
$$\text{minimize} \left\{-\log P_\theta(X,Y)\right\}$$$$ \alpha\|A\|_1+ (1-\alpha)\|C\|_2^2 \leq t \text{ for some }t; $$$$ A\in \mathcal{A}_{d\times d},\ C \in \mathcal{C}_{p \times d}, R \in \mathcal{R}_{p\times p}, \pi_0 \in \mathcal{\pi}_{d\times 1}.$$Main idea: we can use the EM algorithm to maximize likelihood function of observed data.
The likelihood in the model is: $$ P(X,Y)=\prod\limits_{t=1}^{T}P(x_t|x_{t-1})\prod\limits_{t=1}^{T} P(y_t|x_t)\mathbb{1}_{\mathbf{\pi}_0}(x_0) $$
where $\mathbb{1}_{\mathbf{\pi}_0}(x_0)$ is the indicator function and conditional likelihoods are:
$$P(y_t|x_t)= (2\pi)^{-\frac{p}{2}}|R|^{-\frac{1}{2}}\ \text{exp}\left\{-\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\right\}$$$$P(x_t|x_{t-1}) =\text{exp}\left\{-\frac{1}{2}[\mathbf{x_t}-A\mathbf{x_{t-1}}]^{T}Q^{-1}[\mathbf{x_t}-A\mathbf{x_{t-1}}]\right\}(2\pi)^{-d/2}|Q|^{-1/2} =(2\pi)^{-\frac{d}{2}}\ \text{exp}\left\{-\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\right\}$$which allows us to further the parameter estimation simplify to:
$$\hat{\theta}= argmin_{\substack{\theta}}\biggl\{\sum\limits_{t=1}^{T}\big(\frac{1}{2}[y_t-Cx_t]^{T}R^{-1}[y_t-Cx_t]\big)-\frac{T}{2}\text{log}|R|\\ +\sum\limits_{t=1}^{T}\big(\frac{1}{2}[x_t-Ax_{t-1}]^{T}[x_t-Ax_{t-1}]\big)-\frac{T}{2}\text{log}|\mathbf{I}| - \text{log}(\mathbb{1}_{\mathbf{\pi}_0}(x_0))\\ +\lambda_1\|A\|_1+\lambda_2\|C\|_2^2\biggr\}.$$Let $\mathbf{\Phi}(\theta,Y,X)$ denote this objective function in the curly brackets.
Calculuate expected value of log-likelihood $\Gamma = E[\log P(X,Y|Y)]$ (ie exactly the same as regular E step in EM)
where $$ \mathbf{c}^{\text{new}} = (C_{11}^{\text{new}},\ldots,C_{1d}^{\text{new}},C_{21}^{\text{new}},\ldots,C_{2d}^{\text{new}},C_{p1}^{\text{new}},\ldots,C_{pd}^{\text{new}})^{T} $$
$$Y' = (y_{11},\ldots,y_{T1},y_{12},\ldots,y_{T2},\ldots,y_{1p},\ldots,y_{Tp})^{T}$$$$ %\[ X' = \begin{pmatrix} X^{T}\\ \ddots\\ X^{T} \end{pmatrix}_{pT\times pd}. %\] $$so we rearrange the vectorized version of $C^{\text{new}}$ to get it into matrix form.
$f_{\lambda_1}(A;X,Y)$ does not have a closed form solution, but can be solved numerically with a Fast Iterative Shrinkage-Thresholding Algorithm (FISTA).
$$A^{\text{new}} = \text{FISTA}(\|\mathbf{Z}^{T}\mathbf{a}^{\text{old}} -\mathbf{z}\|_2^2,\quad \lambda_1)$$$C = \mathbf{U}_{p\times d}$ where $\mathbf{U}_{p\times d}$ is from the compact SVD of $Y$
Columns of $X_{d \times T}$ are used as input for a vector autoregressive (VAR) model to estimate the initial value for $A$.
In [ ]:
Link to paper: http://people.duke.edu/~hpgavin/SystemID/References/VanOverschee-Automatica-1994.pdf
At a high level: numerical algorithms for subspace state space system identication 2 algorithms:
Linear subspace identification methods are concerned with systems and models of the form:
$$ x_{k + 1} = Ax_k + Bu_k + w_k $$$$ y_k = Cx_k + Du_k + v_k$$with $$E[\begin{pmatrix}w_p\\v_p\end{pmatrix} \begin{pmatrix} w_q^T \quad v_q^T \end{pmatrix}] = \begin{pmatrix} Q \quad S \\ S^T \quad R \end{pmatrix} \delta_{qp} \geq 0$$
Lot of controls mathematics (observability matrices, etc)
Two basic steps:
and B, D follow from set of linear equations for system.
In [ ]: