In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import open_cp
The "Knox statistic" is a way of detecting if events are clustered in space and time. It has been used in the crime predicition literature to find natural "bandwidths" for use with, e.g. the prospective space/time scan.
References:
Papers 1--3 give details of the application to crime prediction / analysis. Paper 4 is the original work of Knox, and paper 5 gives details of the monte carlo method which we use.
We start with a set of events which occur in space and time. For each pair of events, we look at the spatial distance between them, and the time distance. We count the number of events which are "close" (for various cut-off values) in both space and time. The count of such events is significant if it is extreme compared to the expected count. In the original work, a Poisson approximation was used, but in our application, the general background rate of e.g. crime varies with space and possibly time, and so model fitting becomes a problem.
Instead we adopt a monte carlo approach. Repeatedly, we shuffle the timestamps of the events, while keeping their locations fixed (as for the Space-Time Scan Statistic) and recompute the counts. This gives a distribution of counts, and we compute the rank, and hence the p-value, of the real count by comparison.
If the count is significant, this gives evidence of "communicability of crime" at that level of space and time. If we repeat this test at different cut-off values, we can tabulate the results, compare [2, page 221]. Be (extremely) aware of the Multiple comparisons problem.
In [2]:
import open_cp.knox as knox
In [3]:
import datetime
def make_some_data():
x = np.random.random(100) * 1000
y = np.random.random(100) * 1000
times = [datetime.datetime(2017,1,1) + datetime.timedelta(days=20) * t
for t in np.random.random(100)]
times.sort()
return times, x, y
In [4]:
calculator = knox.Knox()
times, x, y = make_some_data()
calculator.data = open_cp.TimedPoints.from_coords(times, x, y)
calculator.space_bins = [(0,20), (20,40), (40,80), (80,200)]
calculator.set_time_bins([(0,1), (1,2), (2,5), (5,10)], unit="days")
cells = calculator.calculate()
In [5]:
fig, ax = plt.subplots(figsize=(8,8))
ax.scatter(x, y, marker="+")
None
In [6]:
# p-values
cells[...,1] * 100
Out[6]:
In [7]:
cells[1,3,1], calculator.space_bins[1], calculator.time_bins[3]
Out[7]:
In [11]:
calculator = knox.Knox()
calculator.space_bins = [(0,20), (20,40), (40,80), (80,200)]
calculator.set_time_bins([(0,1), (1,2), (2,5), (5,10)], unit="days")
all_p_values = []
for _ in range(1000):
calculator.data = open_cp.TimedPoints.from_coords(*make_some_data())
cells = calculator.calculate()
all_p_values.extend( cells[...,1].flatten() )
In [17]:
fig, ax = plt.subplots(figsize=(10,6))
ax.hist(all_p_values)
ax.set_title("Histogram of p-values from pure random data")
None
In [ ]: