$N$ is the number of timesteps
$F$ is the number of features (perhaps spectrogram bins)
$X$ is a data matrix with dimension $N$x$F$
We will decompose the data matrix $X$ by using $K$ basis vectors
$A$ is an activation matrix with dimension $K$x$N$
$D$ is a dictionary matrix with dimension $K$x$F$
We will decompose the data matrix $X$ by using $K$ basis windows. Each window is length $W$
$A$ is an activation matrix with dimension $K$x$N$
$D$ is a dictionary tensor with dimension $K$x$W$x$F$
So $D_0$ will be the first basis window with dimension $W$x$F$
We will simultanously try to find a good activation matrix and a good dictionary matrix. In other words, we try to find a good set of basis vector, and at the same time the coefficients that approximate the original data matrix using those basis windows.
In choosing a cost function to optimize, we'll be picking a noise model. The connection is straightforward in the non-convolutive case. Optimizing the squared error is equivalent to an MLE estimate under Gaussian noise. Optimizing the generalized KL-Divergence is equivalent to an MLE estimate under Poisson noise.
The approximation will be constructed by summing up the basis vectors in $D$ using the activations in $A$
$\widetilde{X} = A^TD$
If our original data matrix $X$ arose through measurements of some real world process, it is often the case that $X$ is a noisy combination of a small (compared to N) number of signals. Part of our motivation for decomposing $X$ into $A$ and $D$ is to remove some of that noise. If the noise is Poisson, the magnitude of the noise is proportional to the magnitude of the underlying signals. In other words, the greater $X_{ij}$, the more uncertainty we should have about the magnitude of the underlying signal we would like to extract.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In creating the approximation we'll attempt to capture the parts of the data we care about, and leave out the rest. This is done by choosing an appropriate cost function. Or, looking from a different angle, by choosing a noise model. We'll start with the KL-Divergence, which, at least in the non-convolutive case, corresponds to a poisson noise model.
so we'll attempt to minimize $D(X| \widetilde{X} ) = \sum{x_{ij}log(\frac{x_{ij}}{\tilde{x_{ij}}})}$
This model is convex in either $A$ or $D$, but not in both. To exploit this convexity we alternate between updating $A$ with $D$ held fixed, and updating $D$ with $A$ held fixed
As it turns out, using the simple multiplicate updates from this paper, the updates to both $A$ and $D$ depend on the current approximation only through the ratio...
$R=\frac{X}{\widetilde{X}}$
This can be seen as a multiplicative residual. If we are trying to update $A$ with $D$ held fixed, we need to decide how to distribute this residual to the parameters of $A$. In other words, how to change the activations to bring $\widetilde{X}$ closer to $X$
To do this we take each activation coefficient independently, and ask whether the corresponding basis covaries with the parts of R that it can influence. So for activation $t$, basis $j$ and offset $t'$, we want to know...
$\widehat{D_{j}} \cdot R_{t}$
In words, we want to know how much of the variation in $R$ can be explained by a particular basis window
Similarly, when updating the dictionaries, we can take each dictionary parameter and ask whether its corresponding activations covary with $R$
$\widehat{A_j} \cdot R_t$
We can use this info to update both $A$ and $D$
$ A_{j} *= \widehat{D_{j}} \cdot R_{t}$
$D_{j} *= \widehat{A_j} \cdot R_t$
When the basis windows are smaller than $X$, and we want activations for different offsets, this gets a bit more complex. It seems to work well to treat each offset independently in the updates.
$ A_{jt} *= \widehat{D_{jt'}} \cdot R_{t+t'}$
$D_{j} *= \widehat{A_j,t+t'} \cdot R_{t$
In [ ]: