The GitHub native renderer for Jupyter Notebooks doesn't seem to cope with the LaTeX in this document. You can try nbviewer instead:
Here we take an approach motivated by Bayesian Inference.
Mackay, "Information Theory, Inference, and Learning Algorithms" (Cambridge University Press, 2003) Available for on-screen reading at http://www.inference.org.uk/itprnn/book.html
Gelman et al. "Bayesian Data Analysis. 2nd Edition" (Chapman & Hall/CRC, 2004)
As before, we assume that the events occuring in the "validation period" (the e.g. next day after the prediction) are well approximated by an Inhomogeneous Poisson point process with overall rate $\lambda$ (the expected number of events in the validation period) and a probability density function $f$, defined on the region of interest $\Omega$, which gives the local intensity. This means that in a region $A$, the expected number of events is $\lambda \int_A f$. The overall likelihood function is $$ l((x_i)_{i=1}^n) = \frac{\lambda^n e^{-\lambda}}{n!} \prod_{i=1}^n f(x_i), $$ where the first term comes from the Poisson distribution. In particular, conditioned upon $n$, the location of the events $(x_i)$ are independent from one another.
Suppose we have partitioned the region $\Omega$ into disjoint regions $(\Omega_k)_{k=1}^K$. For each $k$ set $$ q_k = \int_{\Omega_k} f(x) \ dx. $$ Then, conditional on $n$, each $x_i$ will occur in region $\Omega_k$ with probability $q_k$, independently. If $n_k$ denotes the number of events we see in $\Omega_k$, then $(n_k)_{k=1}^K$ will be distributed as a Multinomial distribution, so $$ \mathbb P((n_k) \mid \textstyle{\sum_k} n_k=n) = \displaystyle\frac{n!}{\textstyle{\prod_k}n_k!} \prod_k q_k^{n_k}$$
We regard $\lambda$ as a Nuisance parameter, which in a Bayesian framework we can average over [2, Chapter 3]. We now make a choice of prior distribution for $(q_k)$ and $\lambda$, assuming that they are independent.
After observing $(n_k)$ we get the posterior distribution \begin{align*} p(\lambda, (q_k) \mid (n_k)) &\propto p( (n_k) \mid \lambda, (q_k) ) p(\lambda) p((q_k)) = \frac{\lambda^n e^{-\lambda}}{n!} \frac{n!}{\prod_k n_k!} \prod_k q_k^{n_k} \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1} e^{-\beta \lambda} \frac{1}{B((\alpha_k))} \prod_k q_k^{\alpha_k-1}\\ &\propto \frac{1}{\prod_k n_k!} \lambda^{n+\alpha-1} e^{-\lambda(\beta+1)} \prod_k q_k^{(n_k+\alpha_k-1)}. \end{align*} Integrating out $\lambda$ we obtain \begin{align*} p((q_k) \mid (n_k)) &\propto (\beta+1)^{-n-\alpha} \Gamma(n+\alpha) \frac{1}{\prod_k n_k!} \prod_k q_k^{(n_k+\alpha_k-1)} \\ &\propto \prod_k q_k^{(n_k+\alpha_k-1)} \end{align*} Normalising this (so that the integral over the simplex is 1) just gives a Dirichlet distribution with parameters $(\alpha_k+n_k)$.
That we get no dependance on $\lambda$ should be surprise us here. In a general setup, suppose we have data $x$ and parameters $\theta_1,\theta_2$, and that $$ p(x\mid \theta_1,\theta_2) = p(y\mid\theta_2) p(x\mid \theta_1,y) $$ where $y$ is a function of $x$. (In our case, $x=(n_k)$ and $y=n=\sum_k n_k$). Then \begin{align*} p(\theta_1\mid x) &= \int p(\theta_1,\theta_2\mid x) \ d\theta_2 \propto \int p(x\mid\theta_1,\theta_2) p(\theta_1,\theta_2) \ d\theta_2 = \int p(x\mid\theta_1,\theta_2) p(\theta_1) p(\theta_2) \ d\theta_2 \\ &= \int p(y\mid\theta_2) p(x\mid \theta_1,y) p(\theta_1) p(\theta_2) \ d\theta_2 = p(x\mid \theta_1,y) p(\theta_1) \int p(y\mid\theta_2) p(\theta_2) \ d\theta_2 \\ &\propto p(x\mid \theta_1,y) p(\theta_1) \int p(\theta_2\mid y) \ d\theta_2 = p(x\mid \theta_1,y) p(\theta_1), \end{align*} and so there is no dependence on $\theta_2$ nor the prior $p(\theta_2)$.
We are given a prediction of the values $q_k$, say $(\hat p_k)_{k=1}^K$. Our idea is to treat this prediction as a strongly informative Bayesian prior, form the posterior as above, and then to compare the posterior to the prior. A "good" prediction should give a close match between the prior and posterior.
Our prior will be a Dirichlet distribution with $\alpha_k = t \hat p_k$ for large $t$. As above, the posterior will also be a Dirichlet distribution with $\alpha_k = t \hat p_k + n_k$. This means that the mean prior has $q_k=\hat p_k$ (that is, the prediction) while the posterior mean has $$ q_k = \frac{t}{t+n} \hat p_k + \frac{n}{t+n} \frac{n_k}{n}, $$ that is, a convex combination of $\hat p_k$ and $n_k/n$, which is the MLE for the Multinomial given the data. The parameter $t$ controls how much we weight the prediction.
We wish to compare these distributions. A common approach is to use the Kullback–Leibler divergence $D_{KL}(P\|Q)$. To quote,
$D_{KL}(P \| Q)$ is a measure of the information gained when one revises one's beliefs from the prior probability distribution $Q$ to the posterior probability distribution $P$.
In our case, $Q \sim Dir(\alpha_k) = Dir(t\hat p_k)$ and $P\sim Dir(\alpha_k+n_k)$. Thus $$ D_{KL}(P\|Q) = \mathbb E_P( \log(p/q) ) $$ where $$ p((q_k)) = \frac{1}{B((\alpha_k+n_k))} \prod_k q_k^{\alpha_k+n_k-1} $$ and similarly for $q$, so that $$ p/q = \frac{B((\alpha_k))}{B((\alpha_k+n_k))} \prod_k q_k^{n_k}. $$ Using the linearity of the expectation, $$ D_{KL}(P\|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))} + \sum_k n_k \mathbb E_P(\log q_k) $$ Using that the marginal distribution of $q_k$ for $P$ is $Beta(\alpha_k+n_k, \sum_j \alpha_j + n - \alpha_k-n_k)$ we find that $$ \mathbb E_P(\log q_k) = \psi(\alpha_k+n_k) - \psi\big(\sum_j \alpha_j + n\big), $$ where $\psi$ is the Digamma function.
Thus $$ D_{KL}(P\|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))} + \sum_k n_k \psi(\alpha_k+n_k) - n \psi\big(\sum_j \alpha_j + n\big) $$
With $\alpha_k = t\hat p_k$ this reduces to $$ D_{KL}(P|Q) = \log\frac{B((\alpha_k))}{B((\alpha_k+n_k))}
The normalising constant is $$ B((\alpha_k)) = \frac{\prod_k \Gamma(\alpha_k)}{\Gamma(\alpha_0)} $$ where $\alpha_0 = \sum_k \alpha_k$ and $\Gamma$ is the Gamma function. We know that $\Gamma(x+1) = x\Gamma(x)$ and so for any integer $n\geq0$, $$ \frac{\Gamma(\alpha+n)}{\Gamma(\alpha)} = (\alpha+n-1)(\alpha+n-2)\cdots \alpha = \prod_{i=0}^{n-1} (\alpha+i), $$ where as usual the empty product is $1$.
Thus \begin{align*} \frac{B((\alpha_k))}{B((\alpha_k+n_k))} &= \frac{\prod_k \Gamma(\alpha_k)}{\prod_k \Gamma(\alpha_k+n_k)} \frac{\Gamma(\alpha_0+n)}{\Gamma(\alpha_0)} = \frac{\prod_{i=0}^{n-1} \alpha_0+i}{\prod_k \prod_{i=0}^{n_k-1} \alpha_k+i} \end{align*} and so $$ \log \frac{B((\alpha_k))}{B((\alpha_k+n_k))} = \sum_{i=0}^{n-1} \log(\alpha_0+i) - \sum_k \sum_{i=0}^{n_k-1} \log(\alpha_k+i), $$ with the convention that if $n_k=0$ the second sum is $0$.
We don't currently see any cancelation in the terms involving the Digamma function.
Frankly, this does not work well, so we shall explore other ideas.
Instead, we can compare the Prior and Posterior predictive distribution. Again, we use a Dirichlet prior, which is conjugate to the Multinomial, leading to the prior and posterior predictive distributions being Dirichlet-multinomial distributions with different hyper-parameters, as above.
It is hard to compare two Dirichlet-multinomial distributions, however, as we must condition on the total number of observations. Instead, as we wish to compare the distributions, and not samples, we can instead work with the Categorical distribution, equivalently, the Dirichlet-multinomial with a single observation. This has PDF $$ \mathbb P(k) = \frac{\Gamma(\alpha_0)}{\Gamma(1+\alpha_0)} \frac{\Gamma(\alpha_k+1)}{\Gamma(\alpha_k)} = \frac{\alpha_k}{\alpha_0}, $$ where as usual $\alpha_0 = \sum_i \alpha_i$.
Thus, the posterior predictive distribution $P$ and the prior predictive $Q$ satisfy $$ p(k) = \frac{\alpha_k + n_k}{\alpha_0 + n}, \qquad q(k) = \frac{\alpha_k}{\alpha_0}. $$ The Kullback-Leibler divergence is thus $$ \sum_{k=1}^K p(k) \log\big(p(k) / q(k)\big) = \sum_{k=1}^K \frac{\alpha_k + n_k}{\alpha_0 + n} \log\Big(\frac{\alpha_0(\alpha_k+n_k)}{(\alpha_0+n)\alpha_k}\Big). $$ With $\alpha_k = t\hat p_k$ this becomes $$ \sum_{k=1}^K \frac{t\hat p_k + n_k}{t + n} \log\Big(\frac{t\hat p_k+n_k}{(t+n)\hat p_k}\Big). $$
In [ ]: