The GitHub native renderer for Jupyter Notebooks doesn't seem to cope with the LaTeX in this document. You can try nbviewer instead:

Bayesian and Information Theoretic ideas

Here we take an approach motivated by Bayesian Inference.

References

  1. Mackay, "Information Theory, Inference, and Learning Algorithms" (Cambridge University Press, 2003) Available for on-screen reading at http://www.inference.org.uk/itprnn/book.html

  2. Gelman et al. "Bayesian Data Analysis. 2nd Edition" (Chapman & Hall/CRC, 2004)

Point process background

As before, we assume that the events occuring in the "validation period" (the e.g. next day after the prediction) are well approximated by an Inhomogeneous Poisson point process with overall rate $\lambda$ (the expected number of events in the validation period) and a probability density function $f$, defined on the region of interest $\Omega$, which gives the local intensity. This means that in a region $A$, the expected number of events is $\lambda \int_A f$. The overall likelihood function is $$ l((x_i)_{i=1}^n) = \frac{\lambda^n e^{-\lambda}}{n!} \prod_{i=1}^n f(x_i), $$ where the first term comes from the Poisson distribution. In particular, conditioned upon $n$, the location of the events $(x_i)$ are independent from one another.

Suppose we have partitioned the region $\Omega$ into disjoint regions $(\Omega_k)_{k=1}^K$. For each $k$ set $$ q_k = \int_{\Omega_k} f(x) \ dx. $$ Then, conditional on $n$, each $x_i$ will occur in region $\Omega_k$ with probability $q_k$, independently. If $n_k$ denotes the number of events we see in $\Omega_k$, then $(n_k)_{k=1}^K$ will be distributed as a Multinomial distribution, so $$ \mathbb P((n_k) \mid \textstyle{\sum_k} n_k=n) = \displaystyle\frac{n!}{\textstyle{\prod_k}n_k!} \prod_k q_k^{n_k}$$

Fully Bayesian approach

We regard $\lambda$ as a Nuisance parameter, which in a Bayesian framework we can average over [2, Chapter 3]. We now make a choice of prior distribution for $(q_k)$ and $\lambda$, assuming that they are independent.

  • For $\lambda$ we choose the conjugate prior, which is a Gamma distribution. The Jeffrey's Prior is the improper distributuion $\Gamma(0,0)$, so we might consider $\Gamma(\alpha, \beta)$ for small $\alpha, \beta$.
  • For $(q_k)$ we again choose the conjugate prior, which is a Dirichlet distribution $\operatorname{Dir}(K,(\alpha_k))$. Again, an uniformative prior is given by small $\alpha_k$. Another choice might be "climatic": that is, take $\alpha_k$ to reflect the long-term average rate in region $\Omega_k$.

After observing $(n_k)$ we get the posterior distribution \begin{align*} p(\lambda, (q_k) \mid (n_k)) &\propto p( (n_k) \mid \lambda, (q_k) ) p(\lambda) p((q_k)) = \frac{\lambda^n e^{-\lambda}}{n!} \frac{n!}{\prod_k n_k!} \prod_k q_k^{n_k} \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta \lambda} \frac{1}{B((\alpha_k))} \prod_k q_k^{\alpha_k-1}\\ &\propto \frac{1}{\prod_k n_k!} \lambda^{n+\alpha-1} e^{-\lambda(\beta+1)} \prod_k q_k^{(n_k+\alpha_k-1)}. \end{align*} Integrating out $\lambda$ we obtain \begin{align*} p((q_k) \mid (n_k)) &\propto (\beta+1)^{-n-\alpha} \Gamma(n+\alpha) \frac{1}{\prod_k n_k!} \prod_k q_k^{(n_k+\alpha_k-1)} \\ &\propto \prod_k q_k^{(n_k+\alpha_k-1)} \end{align*} Normalising this (so that the integral over the simplex is 1) just gives a Dirichlet distribution with parameters $(\alpha_k+n_k)$.

Footnote

That we get no dependance on $\lambda$ should be surprise us here. In a general setup, suppose we have data $x$ and parameters $\theta_1,\theta_2$, and that $$ p(x\mid \theta_1,\theta_2) = p(y\mid\theta_2) p(x\mid \theta_1,y) $$ where $y$ is a function of $x$. (In our case, $x=(n_k)$ and $y=n=\sum_k n_k$). Then \begin{align*} p(\theta_1\mid x) &= \int p(\theta_1,\theta_2\mid x) \ d\theta_2 \propto \int p(x\mid\theta_1,\theta_2) p(\theta_1,\theta_2) \ d\theta_2 = \int p(x\mid\theta_1,\theta_2) p(\theta_1) p(\theta_2) \ d\theta_2 \\ &= \int p(y\mid\theta_2) p(x\mid \theta_1,y) p(\theta_1) p(\theta_2) \ d\theta_2 = p(x\mid \theta_1,y) p(\theta_1) \int p(y\mid\theta_2) p(\theta_2) \ d\theta_2 \\ &\propto p(x\mid \theta_1,y) p(\theta_1) \int p(\theta_2\mid y) \ d\theta_2 = p(x\mid \theta_1,y) p(\theta_1), \end{align*} and so there is no dependence on $\theta_2$ nor the prior $p(\theta_2)$.

Comparing against a prediction

We are given a prediction of the values $q_k$, say $(\hat p_k)_{k=1}^K$. Our idea is to treat this prediction as a strongly informative Bayesian prior, form the posterior as above, and then to compare the posterior to the prior. A "good" prediction should give a close match between the prior and posterior.

  • An alternative way to think about this is that a "good" prediction should mean that on learning the real positions (the data $(n_k)$) we do not gain much information.

Our prior will be a Dirichlet distribution with $\alpha_k = t \hat p_k$ for large $t$. As above, the posterior will also be a Dirichlet distribution with $\alpha_k = t \hat p_k + n_k$. This means that the mean prior has $q_k=\hat p_k$ (that is, the prediction) while the posterior mean has $$ q_k = \frac{t}{t+n} \hat p_k + \frac{n}{t+n} \frac{n_k}{n}, $$ that is, a convex combination of $\hat p_k$ and $n_k/n$, which is the MLE for the Multinomial given the data. The parameter $t$ controls how much we weight the prediction.

We wish to compare these distributions. A common approach is to use the Kullback–Leibler divergence $D_{KL}(P\|Q)$. To quote,

$D_{KL}(P \| Q)$ is a measure of the information gained when one revises one's beliefs from the prior probability distribution $Q$ to the posterior probability distribution $P$.

In our case, $Q \sim Dir(\alpha_k) = Dir(t\hat p_k)$ and $P\sim Dir(\alpha_k+n_k)$. Thus $$ D_{KL}(P\|Q) = \mathbb E_P( \log(p/q) ) $$ where $$ p((q_k)) = \frac{1}{B((\alpha_k+n_k))} \prod_k q_k^{\alpha_k+n_k-1} $$ and similarly for $q$, so that $$ p/q = \frac{B((\alpha_k))}{B((\alpha_k+n_k))} \prod_k q_k^{n_k}. $$ Using the linearity of the expectation, $$ D_{KL}(P\|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))} + \sum_k n_k \mathbb E_P(\log q_k) $$ Using that the marginal distribution of $q_k$ for $P$ is $Beta(\alpha_k+n_k, \sum_j \alpha_j + n - \alpha_k-n_k)$ we find that $$ \mathbb E_P(\log q_k) = \psi(\alpha_k+n_k) - \psi\big(\sum_j \alpha_j + n\big), $$ where $\psi$ is the Digamma function.

Thus $$ D_{KL}(P\|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))} + \sum_k n_k \psi(\alpha_k+n_k) - n \psi\big(\sum_j \alpha_j + n\big) $$

With $\alpha_k = t\hat p_k$ this reduces to $$ D_{KL}(P|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))}

  • \sum_k n_k \psi(t\hat p_k+n_k) - n \psi\big(t + n\big) $$

Efficient computation

The normalising constant is $$ B((\alpha_k)) = \frac{\prod_k \Gamma(\alpha_k)}{\Gamma(\alpha_0)} $$ where $\alpha_0 = \sum_k \alpha_k$ and $\Gamma$ is the Gamma function. We know that $\Gamma(x+1) = x\Gamma(x)$ and so for any integer $n\geq0$, $$ \frac{\Gamma(\alpha+n)}{\Gamma(\alpha)} = (\alpha+n-1)(\alpha+n-2)\cdots \alpha = \prod_{i=0}^{n-1} (\alpha+i), $$ where as usual the empty product is $1$.

Thus \begin{align*} \frac{B((\alpha_k))}{B((\alpha_k+n_k))} &= \frac{\prod_k \Gamma(\alpha_k)}{\prod_k \Gamma(\alpha_k+n_k)} \frac{\Gamma(\alpha_0+n)}{\Gamma(\alpha_0)} = \frac{\prod_{i=0}^{n-1} \alpha_0+i}{\prod_k \prod_{i=0}^{n_k-1} \alpha_k+i} \end{align*} and so $$ \log \frac{B((\alpha_k))}{B((\alpha_k+n_k))} = \sum_{i=0}^{n-1} \log(\alpha_0+i) - \sum_k \sum_{i=0}^{n_k-1} \log(\alpha_k+i), $$ with the convention that if $n_k=0$ the second sum is $0$.

We don't currently see any cancelation in the terms involving the Digamma function.

Frankly, this does not work well, so we shall explore other ideas.

Predictive distributions

Instead, we can compare the Prior and Posterior predictive distribution. Again, we use a Dirichlet prior, which is conjugate to the Multinomial, leading to the prior and posterior predictive distributions being Dirichlet-multinomial distributions with different hyper-parameters, as above.

It is hard to compare two Dirichlet-multinomial distributions, however, as we must condition on the total number of observations. Instead, as we wish to compare the distributions, and not samples, we can instead work with the Categorical distribution, equivalently, the Dirichlet-multinomial with a single observation. This has PDF $$ \mathbb P(k) = \frac{\Gamma(\alpha_0)}{\Gamma(1+\alpha_0)} \frac{\Gamma(\alpha_k+1)}{\Gamma(\alpha_k)} = \frac{\alpha_k}{\alpha_0}, $$ where as usual $\alpha_0 = \sum_i \alpha_i$.

Thus, the posterior predictive distribution $P$ and the prior predictive $Q$ satisfy $$ p(k) = \frac{\alpha_k + n_k}{\alpha_0 + n}, \qquad q(k) = \frac{\alpha_k}{\alpha_0}. $$ The Kullback-Leibler divergence is thus $$ \sum_{k=1}^K p(k) \log\big(p(k) / q(k)\big) = \sum_{k=1}^K \frac{\alpha_k + n_k}{\alpha_0 + n} \log\Big(\frac{\alpha_0(\alpha_k+n_k)}{(\alpha_0+n)\alpha_k}\Big). $$ With $\alpha_k = t\hat p_k$ this becomes $$ \sum_{k=1}^K \frac{t\hat p_k + n_k}{t + n} \log\Big(\frac{t\hat p_k+n_k}{(t+n)\hat p_k}\Big). $$


In [ ]: