In [1]:
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(".."))
These are a class of models which take an explicit statistical model, fit that model to the data, and then uses that fit model to produce a risk intensity, and thus a prediction.
The statistical model is a Point Process. The events occurring a split into two classes:
This model fits the theoretical pattern of e.g. burglaries whereby a single perhaps "random" event can then lead to repeat events at the same or similar location.
Further readings:
We work with events occuring in two dimensional space, at coordinates $(x,y)$, and at time $t$. Somethings with think of time as being special, and so a different dimension, and some times we just think of a three dimensional point process in coordinates $(x,y,t)$.
We let $\lambda(x,y,t)$ be the "conditional intensity" of the point processes. This is defined to be the expected number of events seen in a small region about $(x,y,t)$, divided by the volume of that region, in the limit as the size of the region tends to 0. Intuitively, a larger intensity means that we are more likely to see events near $(x,y,t)$.
(An intuitive view of probability is the Frequentist viewpoint. If you roll a dice repeatedly, then the statement that the probability of getting a 3 is 1/6 means that, over a large number of repeated trials, you would expect to see the outcome of the dice landing on 3 about 1/6 of the time. I find it quite hard to an iterpretation of a point process where the random occurrences of past events changes the intensity function, within this framework, because it seems difficult to say exactly what "repeating the process" would mean. Of course, in the mathematical axiomatisation of modern probability theory, there are of course existence theorems for all the models we will study.)
Our model gives $\lambda$ in the following form. Let $(x_i,y_i,t_i)$ be the events. Then
\begin{equation} \lambda(x,y,t) = \nu(t)\mu(x,y) + \sum_{i : t_i<t} g(t-t_i, x-x_i, y-y_i). \tag{eq:1} \end{equation}Let us explain the two parts:
The sum is taken over all events which occurred before the current time.
The original Hawkes process, and self-exciting points processes as used in Earthquake research, are often Parametric in nature, the underlying form of the functions $\nu, mu$ and $g$ coming from theory.
An alternative is to take a Non-Parametric viewpoint, and estimate the functions using a kernel density estimation (KDE). This is the approach taken by (1), where a variable bandwidth DKE is used. Except the following quote from (1) must be made (page 104):
... we find that for predictive purposes variable bandwidth KDE is less accurate than fixed bandwidth KDE. We therefore estimate $\mu (x,y)$ in Equation (10) using fixed bandwidth Gaussian KDE.
The reasons given behind this change in methodology are, I believe, an admission that looking at the local geography in which crime is taking place is important.
There are two main simulation methods, both of which treat time as a special variable, and seek to generate events ordered in time. We should say that, in the abstract, the point processes we are disucssing are not bounded in space or time, and so there is no notion of "start time". Thus a "perfect" simulation is difficult, as we do not (and cannot) know how far back in time we need to start in order to get an accurate simulation around time zero. A similar problem obviously occurs in fitting data: if we have data from a relative time of $0$, then events close to $0$ might well be triggered by events before time $0$ (i.e. events we do not know about). There is a fair amount of theory to suggest that events should not be trigger from too far in the past (order of 2 months is discussed regularly in the literature) and so we shall tacitally ignore such problems
The more general method is a form of Rejection Sampling, and is known in the literature as Otago's thinning procedure, see (2). Suppose we have simulated events at times $t_1 < t_2 < \cdots < t_n$. We seek a fixed number $\lambda_\max$ such that $$ \lambda_\max \geq \max_{x,y,t>t_n} \lambda(t,x,y) $$ Let $t_{\text{current}} = t_n$. We then sample the next point $(t,x,y)$ with $t>{\text{current}}$ from an ordinary, homogeneous Poisson process with intensity $\lambda_\max$.
This method, as with all rejection techniques, can be rather slow.
An alternative simulation technique is to notice that the total intensity is linear, so each term contributes essentially a new, independent process. This suggests a simulation strategy reminiscent of a Branching Process:
We now describe the optimisation algorithm from (1) (see also the description in (4)).
If we have an estimate for $\nu,\mu$ and $g$ then from (eq:1) we have that $$ p_{ii} = \frac{\nu(t_i)\mu(x_i,y_i)}{\lambda(t_i,x_i,y_i)} $$ is the probability that event $i$ is a background event, and $$p_{ji} = \frac{g(t_i-t_j,x_i-x_j,y_i-y_j)}{\lambda(t_i,x_i,y_i)} $$ is the probability that event $j$ triggered event $i$, for $j\leq i$.
(We set $p_{ji}=0$ for $j>i$. Notice then that $(p_{ij})$ is a upper-triangle matrix, and is a (left) Stochastic matrix. But we shall not use this.)
We then apply the following optimisation procedure:
Note the scaling factor of $1/N$ in our KDE for $g$, despite the fact we only sum over $N_o$ entries.
When computing the matrix $p$ we only use the ratio between $g$ and $\lambda$ (and that between $\nu\mu$ and $\lambda$) and so if we rescale everything, $p$ doesn't change.
However, the original definition of $\lambda$ means that integrating $\lambda$ over a space-time volume gives the expected number of events in that volume. The sum of the $g$ kernels corresponds to "triggered" events, and so it is appropriate that if we estimate $N_o$ triggered events, then each $g$ summand should add $N_o/N$ extra intensity.
The calculation for $\nu(t)\mu(x,y)$ is slightly different. As we have factorised this, we can normalise to assume that $\mu$ is a probability density function giving the probability of crime at a location $(x,y)$, averaged over all time. Then $\nu(t)$ encodes any time variation, and encodes the total intensity. If we estimate $N_b$ background events in a time window of length $T$, then we should normalise so that $$ \int_0^T \nu(t) \ dt = N_b. $$
I think then an appropriate normalisation for $\nu$ would be to initially perform a variable bandwidth KDE on the $N_b$ data points, normalising to get a probability density function. Then approximately $\int_0^T \nu(t) \ dt = 1$, so we then need to multiply by $N_b$.
Once we have fitted the model using data in a time window $[0,T]$ we can compute the intensity $\lambda(t,x,y)$ from (eq:1). If we compute this at $t=T$ we have an exact calculation of the intensity at the instant of time at the end of the "training" dataset.
Suppose we wish to make a prediction at time $t>T$. We can simply evaluate $\lambda(t,x,y)$, but this would ignore the "self-exciting" nature of the process: there could have been events in the time window $[T,t]$ and these (random) events would increase the intensity $\lambda$. Two possible options:
(We note that other prediction methods, e.g. prospective hot-spotting, suffers from the same issue, but as in that case we don't have an explicit statistical model, it is more obscured. To be precise, prospective hot-spotting "works" in the sense that events close in space and time predict more events. If we then try to predict (fairly far) into the future then the logic under-pinning the prediction suggests that more events would occur, especially in high risk areas, and these would have a feed-back effect of increasing the predicted intensity.)
In practise, a big problem here is that the use of kernel density estimation in estimating the background kernel leads to edge effects near $T$.
It is, however, important to get the normalisation correct: that is, we use the average (relative) time intensity as our estimate of what the time background component should be at the end-point.
In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import open_cp
import open_cp.sources.sepp
import datetime
In [3]:
region = open_cp.RectangularRegion(xmin=0, xmax=1000, ymin=0, ymax=500)
back_time_kernel = open_cp.sources.sepp.HomogeneousPoisson(10)
back_space_sampler = open_cp.sources.sepp.UniformRegionSampler(region)
background_sampler = open_cp.sources.sepp.InhomogeneousPoissonFactors(
back_time_kernel, back_space_sampler)
trig_time_kernel = open_cp.sources.sepp.Exponential(exp_rate=10, total_rate=.5)
trig_space_sampler = open_cp.sources.sepp.GaussianSpaceSampler([0,0],[2,2],0)
trigger_sampler = open_cp.sources.sepp.InhomogeneousPoissonFactors(
trig_time_kernel, trig_space_sampler)
simulation = open_cp.sources.sepp.SelfExcitingPointProcess(background_sampler, trigger_sampler)
In [4]:
sample = simulation.sample(0, 100)
# Scale time from 0...100 to 28 days, at a second resolution
time_unit = open_cp.sources.sepp.make_time_unit(datetime.timedelta(days=28)) / 100
timed_points = open_cp.sources.sepp.scale_to_real_time(sample, datetime.datetime(2016,4,10,0,0),
time_unit)
timed_points.number_data_points, timed_points.timestamps[0], timed_points.timestamps[-1]
Out[4]:
In [5]:
import matplotlib.dates
fig, ax = plt.subplots(ncols=2, figsize=(16,7))
ax[0].scatter(timed_points.xcoords, timed_points.ycoords, alpha=0.5)
ax[0].set_title("Space location of simulated data")
ax[0].set(xlim=[0,1000], ylim=[0,500])
times = timed_points.times_datetime()
ax[1].scatter(times, timed_points.xcoords, alpha=0.5)
ax[1].set_xlim([datetime.datetime(2016,4,9), datetime.datetime(2016,5,9)])
ax[1].set_title("Date against x location")
fig.autofmt_xdate()
None
We have chosen the trigger / aftershock kernel to be very tightly clustered in space which helps the estimator.
Unfortunately, the optimisation procedure is rather slow, a fact noticed already by the authors of (4). To speed it up, we can set "cut offs", space/time distances at which we believe triggering will not occur. These are set by default to be suitable for real data, but we change them for this simulated data.
The time units get a bit muddled.
Similarly, the initial background rate of 10 per time unit becomes 10/403.2 per minute
In [6]:
import open_cp.sepp as sepp
In [7]:
trainer = sepp.SEPPTrainer()
trainer.data = timed_points
trainer.space_cutoff = 10
trainer.time_cutoff = datetime.timedelta(hours=5)
In [8]:
predictor = trainer.train()
We can now visualise the space background rate. In our simulation, this is uniform but we know that the variable bandwidth KDE scheme employed here is likely to be poor at approximating a uniform distribution, due to edge effects.
We also show the estimate for the time rate, which we normalise. This shows the edge effect problems I describe above.
In [9]:
import open_cp.predictors
def plot_density(predictor, ax):
kernel_array = open_cp.predictors.grid_prediction_from_kernel(predictor.background_kernel.space_kernel,
region, 10)
ax.set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
mesh = ax.pcolormesh(*kernel_array.mesh_data(), kernel_array.intensity_matrix, cmap="Blues")
fig.colorbar(mesh, ax=ax)
return kernel_array
fig, ax = plt.subplots(ncols=2, figsize=(16,7))
kernel_array = plot_density(predictor, ax[0])
# Estimated time kernel is _normalised_
total_time = 28*24*60
tc = np.linspace(-total_time * 0.1, total_time * 1.1, 100)
def actual(t):
return ((t>=0) & (t<=total_time)) / total_time
ax[1].plot(tc, actual(tc), linewidth=1, color="r")
ax[1].scatter(tc, predictor.background_kernel.time_kernel(tc))
ax[0].set_title("Estimated background risk in space")
ax[1].set_title("Estimated background risk in time")
None
The predictor
has a property adjusted_background_kernel
which replaces the time component by a constant derived from the average value.
The values below are a sanity check that our normalisations are correct.
In [10]:
print("Average time intensity:", predictor.adjusted_background_kernel.time_kernel(0))
print("Total intensity at (500,250):", predictor.adjusted_background_kernel([0,500,250]))
print("Expected average intensity is:", 10 / 403.2 / (1000 * 500))
As we have a copy of the estimated "p matrix", we can "stocastically decluster" the points, and get an estimate of which points the algorithm thinks are background events. This allows us to visualise the KDE procedure more clearly.
In [11]:
backgrounds, aftershocks = sepp.sample_points(trainer.as_time_space_points(), predictor.result.p)
In [12]:
fig, ax = plt.subplots(ncols=3, figsize=(16,7))
ax[0].scatter(timed_points.xcoords, timed_points.ycoords, alpha=0.5)
ax[0].set_title("Space location of simulated data")
ax[0].set(xlim=[0,1000], ylim=[0,500])
ax[1].scatter(backgrounds[1], backgrounds[2], alpha=0.5)
ax[1].set_title("Estimated background events")
ax[1].set(xlim=[0,1000], ylim=[0,500])
ax[2].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
mesh = ax[2].pcolormesh(*kernel_array.mesh_data(), kernel_array.intensity_matrix, cmap="YlGn")
ax[2].set_title("Estimated background kernel")
ax[2].scatter(backgrounds[1], backgrounds[2], marker="+", color="black", alpha=0.7, linewidth=1)
None
It is also extremely interesting to visualise the triggering kernel, and to do this quickly we plot marginals. By definition, this is accomplished by integrating out 2 of the 3 variables. As we know the particular form of the kernel estimation used, we can cheat, and quickly compute marginals using some utility methods in the kernels
package. For more details, see the "Kernel estimation" notebook, or the "SEPP testbed" notebooks.
In [13]:
def plot_actual_trig_marginals(rate=1/40.32, intensity=0.5, variances=[2,2]):
def actual_t_marginal(t):
return intensity * rate * np.exp(-t * rate)
def actual_x_marginal(x):
return intensity * np.exp(-x*x/(2*variances[0])) / np.sqrt(2*np.pi*variances[0])
def actual_y_marginal(y):
return intensity * np.exp(-y*y/(2*variances[1])) / np.sqrt(2*np.pi*variances[1])
fig, ax = plt.subplots(ncols=3, figsize=(16,5))
tc = np.linspace(0, 3.5 / rate, 100)
actual = actual_t_marginal(tc)
ax[0].plot(tc, actual, color="red", linewidth=1)
xl = np.sqrt(variances[0])*4
xc = np.linspace(-xl, xl, 100)
actual = actual_x_marginal(xc)
ax[1].plot(xc, actual, color="red", linewidth=1)
yl = np.sqrt(variances[1])*4
yc = np.linspace(-yl, yl, 100)
actual = actual_y_marginal(yc)
ax[2].plot(yc, actual, color="red", linewidth=1)
return ax, tc, xc, yc
def plot_marginals(trigs, rescale, reflect_time=False, intensity=0.5, variances=[2,2], rate=1/40.32):
ax, tc, xc, yc = plot_actual_trig_marginals(rate, intensity, variances)
if reflect_time:
kernel = open_cp.kernels.marginal_knng(trigs, 0)
td = kernel(tc) + kernel(-tc)
else:
td = open_cp.kernels.marginal_knng(trigs, 0)(tc)
ax[0].scatter(tc, td * rescale)
ax[1].scatter(xc, open_cp.kernels.marginal_knng(trigs, 1)(xc) * rescale)
ax[2].scatter(yc, open_cp.kernels.marginal_knng(trigs, 2)(yc) * rescale)
for i, s in enumerate(list("txy")):
ax[i].set_title(s+" marginal")
In [14]:
plot_marginals(aftershocks, aftershocks.shape[-1] / timed_points.number_data_points)
We see here the usual pattern for this optimisaition method: slightly heavy tails, due to misclassification of background events as aftershocks.
The estimation of the time marginal is particularly poor:
At each step of the optimisation procedure, we use the current kernel estimates to inform the next estimate of which events are background and which are aftershocks. Therefore, to be correct, we should re-run the optimisation process with the new trigger kernel estimator. As we shall see, this makes little difference to the estimated background events.
In [15]:
trainer.trigger_kernel_estimator = open_cp.kernels.ReflectedKernelEstimator(trainer.trigger_kernel_estimator)
predictor_reflected = trainer.train()
In [16]:
backgrounds, aftershocks = sepp.sample_points(trainer.as_time_space_points(), predictor_reflected.result.p)
plot_marginals(aftershocks, aftershocks.shape[-1] / timed_points.number_data_points, reflect_time=True)
In [17]:
fig, ax = plt.subplots(ncols=2, figsize=(16,7))
plot_density(predictor, ax[0])
plot_density(predictor_reflected, ax[1])
ax[0].set_title("Original background density")
ax[1].set_title("With reflected trigger kernel")
None
Having "trained" the predictor on the data, the "output" is the two kernels, estimating the background rate and the trigger / aftershcok kernel.
The trained kernels can be applied to any data: past events only affect the current predict through triggering. Here we shall use the same data and make a prediction for the day after the data.
In [18]:
predictor.data = timed_points
prediction = predictor.predict(datetime.datetime(2016,5,9))
In [19]:
grided = open_cp.predictors.GridPredictionArray.from_continuous_prediction_region(prediction, region, 20, 20)
In [20]:
fig, ax = plt.subplots(ncols=2, figsize=(16,8))
ax[0].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
ax[0].pcolormesh(*grided.mesh_data(), grided.intensity_matrix, cmap="Blues")
ax[0].set_title("Prediction of risk at 2016-05-09")
grided_bground = open_cp.predictors.grid_prediction_from_kernel(predictor.background_kernel.space_kernel,
region, 20)
ax[1].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
ax[1].pcolormesh(*grided_bground.mesh_data(), grided_bground.intensity_matrix, cmap="Blues")
ax[1].set_title("Background risk")
None
These look the same! But that's because the trigger / aftershock kernel decays rather rapidly in time-- on the order of hours for this simulated data. Making a prediction a whole day after the last event means we just look at the background rate...
In [21]:
prediction = predictor.predict(datetime.datetime(2016,5,8, 0,0))
grided = open_cp.predictors.GridPredictionArray.from_continuous_prediction_region(prediction, region, 20, 20)
In [22]:
print("Last event occurred at {} at {}".format(timed_points.timestamps[-1], timed_points.coords[:,-1]))
x, y = timed_points.coords[:,-1]
(prediction.risk(x, y), predictor.adjusted_background_kernel([0,x, y]),
prediction.risk(x, y) / predictor.adjusted_background_kernel([0,x, y]))
Out[22]:
So in this case we do see an increase in risk against the background.
In [23]:
fig, ax = plt.subplots(ncols=2, figsize=(16,8))
ax[0].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
ax[0].pcolormesh(*grided.mesh_data(), grided.intensity_matrix, cmap="Blues")
ax[0].set_title("Prediction of risk at 2016-05-08")
ax[1].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
ax[1].pcolormesh(*grided_bground.mesh_data(), grided_bground.intensity_matrix, cmap="Blues")
ax[1].set_title("Background risk")
None
In [24]:
region = open_cp.RectangularRegion(xmin=0, xmax=1000, ymin=0, ymax=1000)
back_time_kernel = open_cp.sources.sepp.HomogeneousPoisson(1)
back_space_sampler = open_cp.sources.sepp.UniformRegionSampler(region)
background_sampler = open_cp.sources.sepp.InhomogeneousPoissonFactors(
back_time_kernel, back_space_sampler)
trig_time_kernel = open_cp.sources.sepp.Exponential(exp_rate=0.5, total_rate=0.8)
trig_space_sampler = open_cp.sources.sepp.GaussianSpaceSampler([0,0],[20,20],0)
trigger_sampler = open_cp.sources.sepp.InhomogeneousPoissonFactors(
trig_time_kernel, trig_space_sampler)
simulation = open_cp.sources.sepp.SelfExcitingPointProcess(background_sampler, trigger_sampler)
In [25]:
sample = simulation.sample(0, 365)
time_unit = open_cp.sources.sepp.make_time_unit(datetime.timedelta(days=1))
timed_points = open_cp.sources.sepp.scale_to_real_time(sample, datetime.datetime(2017,1,1,0,0),
time_unit)
timed_points.number_data_points, timed_points.timestamps[0], timed_points.timestamps[-1]
Out[25]:
In [26]:
import matplotlib.dates
fig, ax = plt.subplots(ncols=2, figsize=(16,7))
ax[0].scatter(timed_points.xcoords, timed_points.ycoords, alpha=0.5)
ax[0].set_title("Space location of simulated data")
ax[0].set(xlim=[0,1000], ylim=[0,1000])
times = timed_points.times_datetime()
ax[1].scatter(times, timed_points.xcoords, alpha=0.5)
ax[1].set_xlim([datetime.datetime(2017,1,1), datetime.datetime(2018,1,1)])
ax[1].set_title("Date against x location")
fig.autofmt_xdate()
None
In [27]:
trainer = sepp.SEPPTrainer()
trainer.data = timed_points
trainer.space_cutoff = 100
trainer.time_cutoff = datetime.timedelta(days=10)
predictor = trainer.train()
In [28]:
backgrounds, aftershocks = sepp.sample_points(trainer.as_time_space_points(), predictor.result.p)
fig, ax = plt.subplots(ncols=3, figsize=(16,5))
ax[0].scatter(timed_points.xcoords, timed_points.ycoords, alpha=0.5)
ax[0].set_title("Space location of simulated data")
ax[0].set(xlim=[0,1000], ylim=[0,1000])
ax[1].scatter(backgrounds[1], backgrounds[2], alpha=0.5)
ax[1].set_title("Estimated background events")
ax[1].set(xlim=[0,1000], ylim=[0,1000])
ax[2].set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
kernel_array = open_cp.predictors.grid_prediction_from_kernel(predictor.background_kernel.space_kernel,
region, 20)
mesh = ax[2].pcolormesh(*kernel_array.mesh_data(), kernel_array.intensity_matrix, cmap="YlGn")
ax[2].set_title("Estimated background kernel")
ax[2].scatter(backgrounds[1], backgrounds[2], marker="+", color="black", alpha=0.7, linewidth=1)
None
In [29]:
aftershock_rate = 0.5
mean_days = 1 / aftershock_rate
mean_minutes = 24 * 60 * mean_days
rate_in_minutes = 1 / mean_minutes
plot_marginals(aftershocks, aftershocks.shape[-1] / timed_points.number_data_points,
rate = rate_in_minutes, intensity=0.5, variances=[10,10])
In [30]:
predictor.data = timed_points
dates = [datetime.datetime(2018,1,1) + datetime.timedelta(hours=30*i) for i in range(9)]
predictions = [predictor.predict(d) for d in dates]
grided = [open_cp.predictors.GridPredictionArray.from_continuous_prediction_region(pr, region, 40, 40)
for pr in predictions]
In [31]:
fig, axes = plt.subplots(ncols=3, nrows=3, figsize=(16,12))
for i, ax in enumerate(axes.ravel()):
ax.set(xlim=[region.xmin, region.xmax], ylim=[region.ymin, region.ymax])
ax.pcolormesh(*grided[i].mesh_data(), grided[i].intensity_matrix, cmap="Blues")
ax.set_title("Prediction of risk at {}".format(dates[i]))
We see here a somewhat different "problem": the aftershock risk completely dominates the background risk!
It would obviously be very interesting to look at real data.
In [ ]: