Likelihood

References

Rosser et al. "Predictive Crime Mapping: Arbitrary Grids or Street Networks?"

This technique has been used in the crime prediction literature to select optimal bandwidths, typically at a stage before forming a grid or "hot spots", e.g. [1].

It is thus worth spending some time to develop the mathematical model upon which this idea is based. We assume that the events occuring in the "validation period" (the e.g. next day after the prediction) are well approximated by an Inhomogeneous Poisson point process with overall rate $\lambda$ (the expected number of events in the validation period) and a probability density function $f$, defined on the area of interest, which gives the local intensity. This means that in an area $A$, the expected number of events is $\lambda \int_A f$.

Our predictions do not (excepting perhaps the SEPP methods, though we discard this information) provide $\lambda$. To compare like with like, we will assume that $f$ is given by a grid: that is, $f$ is constant in each grid cell.

The likelihood, as used by [1], is then $$ \prod_{i=1}^N f(x_i) $$ where $(x_i)_{i=1}^N$ denote the coordinates of the events in the validation period. Mathematically, this is the likelihood, conditional on $N$, the number of events. Typically we look at the log likelihood instead, $$ L_0 = \sum_{i=1}^N \log f(x_i). $$ As noted by [1], if $f(x_i)=0$ then we have observed an impossible event! Thus we set a very small lower bound for $f$.

One potential problem here is that we have a dependence on $N$. For the application in [1], this is unimportant, but when comparing predictions, we should normalise by computing $$ L_1 = \frac{1}{N} \sum_{i=1}^N \log f(x_i). $$ If we wish to compare scores between different studies, then we also need to normalise for the area, leading to $$ L = \log A + \frac{1}{N} \sum_{i=1}^N \log f(x_i), $$ where $A$ is the area of the study zone.

For example, if the prediction is complete spatial randomness, then we have $L=0$ for any observation.

Advantages

Easy to compute
Clear probability theory interpretation

Disadvantages

Not clear how to combine the log likelihood scores for multiple predictions. Is is appropriate to report mean and standard error?
Linked to this: Intuitively, if for one prediction, we (by chance!) see a small number of actual events, we would expect to learn less about the quality of the prediction (because maybe we were just unlucky) compared to a different prediction where the number of actual events was large.
Mathematically, there is nothing wrong taking point estimates (evaluating $f$ only at the exact locations $(x_i)$). However, for "grid based" predictions such as the classical ProHotSpot, we might again worry about the MAUP.



In [ ]: