Problems with Monte Carlo integration

By default, we convert "continuous" predictions into grid based predictions by using monte carlo integration: evaluate the kernel at random points and average.

This seems to introduce a large amount of error... The following demonstrates this with artificial, random data. However, I noticed this problem on data from Chicago.



In [1]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

import open_cp.retrohotspot
import open_cp.data
import open_cp.predictors
import open_cp.evaluation



In [2]:

    
n = 1000
times = np.random.random(n) * 365
times = times * (np.timedelta64(1, "D") / np.timedelta64(1, "s")) * np.timedelta64(1, "s")
times = np.sort(np.datetime64("2016-01-01") + times)
points = open_cp.data.TimedPoints(times, np.random.random((2,n)) * 1000)



In [3]:

    
mask = [[False]*100]*100
grid = open_cp.data.MaskedGrid(10, 10, 0, 0, mask)



In [4]:

    
predictor = open_cp.retrohotspot.RetroHotSpot()
predictor.data = points
predictor.weight = open_cp.retrohotspot.TruncatedGaussian(100, 10)
cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = 5
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)



In [5]:

    
fig, ax = plt.subplots(ncols=2, figsize=(14,8))

ax[0].pcolor(*pred.mesh_data(), pred.intensity_matrix, cmap="Greys")
p = open_cp.evaluation.top_slice_prediction(pred, 0.1)
ax[1].pcolor(*p.mesh_data(), p.intensity_matrix, cmap="Greys")









    Out[5]:





<matplotlib.collections.PolyCollection at 0x7fc8b43e0b70>

Repeat: Form the prediction again, top slice the top 10% of risk again, and compare the selected cells (same as looking at the "mask" of the top sliced prediction. They differ by a lot!)



In [6]:

    
cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = 5
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)
p1 = open_cp.evaluation.top_slice_prediction(pred, 0.1)



In [7]:

    
np.sum(p.intensity_matrix.mask ^ p.intensity_matrix.mask)









    Out[7]:





0



In [8]:

    
np.sum(p1.intensity_matrix.mask ^ p1.intensity_matrix.mask)









    Out[8]:





0



In [9]:

    
np.sum(p.intensity_matrix.mask ^ p1.intensity_matrix.mask)









    Out[9]:





224

You might argue that 5 samples is too small...



In [10]:

    
cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = 25
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)
p = open_cp.evaluation.top_slice_prediction(pred, 0.1)

cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = 25
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)
p1 = open_cp.evaluation.top_slice_prediction(pred, 0.1)



In [11]:

    
np.sum(p.intensity_matrix.mask ^ p1.intensity_matrix.mask)









    Out[11]:





86

We have added a "reproducible" mode, whereby we don't sample at random points, but rather on a regular sub-grid



In [12]:

    
cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = -5
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)
p = open_cp.evaluation.top_slice_prediction(pred, 0.1)

cts_pred = predictor.predict(end_time=np.datetime64("2016-10-01"))
cts_pred.samples = -5
pred = open_cp.predictors.GridPredictionArray.from_continuous_prediction_grid(cts_pred, grid)
p1 = open_cp.evaluation.top_slice_prediction(pred, 0.1)



In [13]:

    
np.sum(p.intensity_matrix.mask ^ p1.intensity_matrix.mask)









    Out[13]:





0



In [ ]: