Scoring functions

Despite our reservations about treating our predictions as "yes/no" predictions of crime, we can consider using a Scoring rule.

References

Roberts, "Assessing the spatial and temporal variation in the skill of precipitation forecasts from an NWP model" DOI:10.1002/met.57
Weijs, "Kullback–Leibler Divergence as a Forecast Skill Score with Classic Reliability–Resolution–Uncertainty Decomposition" DOI:10.1175/2010MWR3229.1

Discussion

The classical e.g. Brier score is appropriate when we have a sequence of events $i$ which either may occur or not. Let $p_i$ be our predicted probability that event $i$ will occur, and let $o_i$ be the $1$ if the event occurred, and $0$ otherwise. The Brier score is $$ \frac{1}{N} \sum_{i=1}^N (p_i - o_i)^2. $$

The paper [1] considers aggregating this over different (spatial) scales. For the moment, we shall use [1] by analogy only, in order to deal with the problem that we might have repeated events ($o_i$ for us is the number of events to occur in a cell, so may be $>1$). We shall follow [1], vaguely, and let $u_i$ be the fraction of the total number of events which occurred in spatial region (typically, grid cell) $i$. The score is then $$ S = \frac{1}{N} \sum_{i=1}^N (p_i - u_i)^2 $$ where we sum over all spatial units $i=1,\cdots,N$.

Normalisation

Notice that this is related to the KDE method. We can think of the values $(u_i)$ as a histogram estimation of the real probability density, and then $S$ is just the mean squared error, estimating the continuous version $$ \int_{\Omega} (p(x) - f(x))^2 \ dx $$ where $\Omega$ is the study area. If we divide by the area of $\Omega$, then we obtain a measure of difference which is invariant under rescaling of $\Omega$.

The values $(p_i)$, as probabilities, sum to $1$, and the $(u_i)$ by definition sum to $1$. We hence see that an appropriate normalisation factor for $S$ is $$ S = \frac{1}{NA} \sum_{i=1}^N (p_i - u_i)^2 $$ where $A$ is the area of each grid cell and so $NA$ is the total area.

Skill scores

A related Skill score is $$ SS = 1 - \frac{S}{S_\text{worst}} = 1 - \frac{\sum_{i=1}^N (p_i - u_i)^2}{\sum_{i=1}^N p_i^2 + u_i^2} = \frac{2\sum_{i=1}^N p_iu_i}{\sum_{i=1}^N p_i^2 + u_i^2}. $$ Here $$ S_\text{worst} = \frac{1}{NA} \sum_{i=1}^N (p_i^2 + u_i^2) $$ is the worst possible value for $S$ if there is no spatial association between the $(p_i)$ and $(u_i)$.

Multi-scale issues

Finally, [1] considers a multi-scale measure by aggregating the values $(p_i)$ and $(u_i)$ over larger and larger areas.

Firstly we use $(p_i)$ and $(u_i)$ as is, on a grid of size $n\times m$ say. So $N=nm$.
Now take the "moving average" or "sliding window" by averaging over each $2\times 2$ block. This gives a grid of size $(n-1) \times (m-1)$
And so on...
Ending with just the average of $p_i$ over all the whole grid compared to the average of $u_i$ over the whole grid. These will always agree.
If the grid is not square, then we will stop before this. Similarly, non-rectangular regions will need to be dealt with in an ad-hoc fashion.

Finally, we should not forget to normalise correctly-- at each stage, the "averaged" values should still sum to $1$ (being probabilities) and we should continue to divide by the total area. Let us think a bit more clearly about this. Suppose we group the original cells into (in general, overlapping) regions $(\Omega_i)$ and values (the sum of values in the regions) $(x_i)$ and $(x_i')$ say. We then want to normalise these values in some, and compute the appropriate Brier score. If each region $\Omega_i$ has the same area (e.g. we start with a rectangular grid) then there is no issue. For more general grids (which have been clipped to geometry, say) we proceed with a vague analogy by pretending that the regions $\Omega_i$ are actually disjoint, cover the whole study area, and that $x_i = \int_{\Omega_i} f$ for some non-normalised function $f$.