Despite our reservations about treating our predictions as "yes/no" predictions of crime, we can consider using a Scoring rule.
The classical e.g. Brier score is appropriate when we have a sequence of events $i$ which either may occur or not. Let $p_i$ be our predicted probability that event $i$ will occur, and let $o_i$ be the $1$ if the event occurred, and $0$ otherwise. The Brier score is $$ \frac{1}{N} \sum_{i=1}^N (p_i - o_i)^2. $$
The paper [1] considers aggregating this over different (spatial) scales. For the moment, we shall use [1] by analogy only, in order to deal with the problem that we might have repeated events ($o_i$ for us is the number of events to occur in a cell, so may be $>1$). We shall follow [1], vaguely, and let $u_i$ be the fraction of the total number of events which occurred in spatial region (typically, grid cell) $i$. The score is then $$ S = \frac{1}{N} \sum_{i=1}^N (p_i - u_i)^2 $$ where we sum over all spatial units $i=1,\cdots,N$.
Notice that this is related to the KDE method. We can think of the values $(u_i)$ as a histogram estimation of the real probability density, and then $S$ is just the mean squared error, estimating the continuous version $$ \int_{\Omega} (p(x) - f(x))^2 \ dx $$ where $\Omega$ is the study area. If we divide by the area of $\Omega$, then we obtain a measure of difference which is invariant under rescaling of $\Omega$.
The values $(p_i)$, as probabilities, sum to $1$, and the $(u_i)$ by definition sum to $1$. We hence see that an appropriate normalisation factor for $S$ is $$ S = \frac{1}{NA} \sum_{i=1}^N (p_i - u_i)^2 $$ where $A$ is the area of each grid cell and so $NA$ is the total area.
A related Skill score is $$ SS = 1 - \frac{S}{S_\text{worst}} = 1 - \frac{\sum_{i=1}^N (p_i - u_i)^2}{\sum_{i=1}^N p_i^2 + u_i^2} = \frac{2\sum_{i=1}^N p_iu_i}{\sum_{i=1}^N p_i^2 + u_i^2}. $$ Here $$ S_\text{worst} = \frac{1}{NA} \sum_{i=1}^N (p_i^2 + u_i^2) $$ is the worst possible value for $S$ if there is no spatial association between the $(p_i)$ and $(u_i)$.
Finally, [1] considers a multi-scale measure by aggregating the values $(p_i)$ and $(u_i)$ over larger and larger areas.
Finally, we should not forget to normalise correctly-- at each stage, the "averaged" values should still sum to $1$ (being probabilities) and we should continue to divide by the total area. Let us think a bit more clearly about this. Suppose we group the original cells into (in general, overlapping) regions $(\Omega_i)$ and values (the sum of values in the regions) $(x_i)$ and $(x_i')$ say. We then want to normalise these values in some, and compute the appropriate Brier score. If each region $\Omega_i$ has the same area (e.g. we start with a rectangular grid) then there is no issue. For more general grids (which have been clipped to geometry, say) we proceed with a vague analogy by pretending that the regions $\Omega_i$ are actually disjoint, cover the whole study area, and that $x_i = \int_{\Omega_i} f$ for some non-normalised function $f$.
Following (2) now (and again with an adhoc change to allow non-binary variables) we could use Kullback-Leibler divergance (discussed in more detail, and more rigourously, in another notebook) to form the score: $$ S{KL} = \frac{1}{N} \sum{i=1}^N \Big( u_i \log\big( u_i / p_i \big)
In [ ]: