Spatial Poisson Processes

Homogeneous 1-D Poisson Process

Count statistics

The number of events $N$ within an interval $X$ is Poisson distributed $$P(N) = \frac{(\lambda X)^N}{N!}e^{-\lambda X}$$ with mean $$\langle N \rangle = \lambda X$$ and variance $$\sigma^2_N = \lambda X$$ That is, the Fano factor (variance of a count divided by the mean) is $$F=\frac{\sigma^2_N}{\langle N \rangle} = 1$$

The intensity $\lambda$ of the Poisson process can be estimated from the average number of events $\langle N \rangle$ within an interval X $$\lambda = \frac{\langle N \rangle}{X}$$ The unit of $\lambda$ is the inverse of the unit of $X$.


In [2]:
%matplotlib inline
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# generate poisson events via the interval distribution:
n = 100000
l = 2.0   # lambda, intensity of the poisson point process
x = np.cumsum(np.random.exponential(1.0/l, n))

# count statistics:
X = 4.0
windows = np.arange(0.0, x[-1], X)
counts, _ = np.histogram(x, windows)
mean_count =  np.mean(counts)
variance_count = np.var(counts)
print('lambda*X   : %.1f' % (l*X))
print('mean count : %.1f' % mean_count)
print('variance   : %.1f' % variance_count)
print('Fano factor: %.2f' % (variance_count/mean_count))

# count histogram:
plt.title('Poisson distribution of counts')
cr = np.arange(0, np.max(counts)+1, 1)
plt.hist(counts, cr-0.5, normed=True)
pp = st.poisson.pmf(cr, l*X)
plt.plot(cr, pp, 'r', lw=2)
plt.xlim(0, 3*l*X)
plt.xlabel('Counts in window X')
plt.ylabel('Probability')


lambda*X   : 8.0
mean count : 8.0
variance   : 8.2
Fano factor: 1.02
Out[2]:
<matplotlib.text.Text at 0x7f3b40788790>

Interval statistics

The nearest-neighbor distances, i.e. the intervals $\Delta x$ between events, are exponentially distributed $$p(\Delta x) = \lambda e^{-\lambda\Delta x}$$ The mean $$\langle \Delta x \rangle = \frac{1}{\lambda} = \frac{X}{\langle N \rangle}$$ is the same as for regularly distributed points. The standard deviation is $$\sigma_{\Delta x} = \frac{1}{\lambda}$$ and the coefficient of variation (standard deviation divided by mean) is $$CV = \frac{\sigma_{\Delta x}}{\langle \Delta x \rangle} = 1$$


In [3]:
# intervals:
dx = np.diff(x)

# statistics:
mean_dx = np.mean(dx)
std_dx = np.std(dx)
print('1/lambda: %.2f' % (1.0/l))
print('mean dx : %.2f' % mean_dx)
print('std     : %.2f' % std_dx)
print('CV      : %.2f' % (std_dx/mean_dx))

# interval histogram:
plt.title('Exponential distribution of intervals')
intervals = np.arange(0.0, np.max(dx), 0.1/l)
plt.hist(dx, intervals, normed=True)
pe = st.expon.pdf(intervals, 0.0, 1.0/l)
plt.plot(intervals, pe, 'r', lw=2)
plt.xlim(0.0, 5.0/l)
plt.xlabel('Nearest neighbor distances $dx$')
plt.ylabel('pdf')


1/lambda: 0.50
mean dx : 0.50
std     : 0.50
CV      : 1.00
Out[3]:
<matplotlib.text.Text at 0x7f3b40564490>

The exponential distribution of intervals is a special case of the Weibull distribution $$p_w(x)=\frac{k}{\alpha}\left(\frac{x}{\alpha}\right)^{k-1}e^{-(x/\alpha)^k}$$ with $k=1$ and $\alpha=1/\lambda$.

Generation of 1-D Poisson data

The probability of getting an event in a small interval $\Delta x$ is $$P = \lambda \Delta x$$ By drawing a uniformly distributed number for each interval $\Delta x$ one can generate a Poisson process. Note, however, that this is computationally expensive, since way more random numbers than actual points need to be generated.


In [4]:
N = 20    # mean number of points
a = 2.0   # interval on which to distribute the points
l = N/a   # intensity
dx = 0.001
P = l*dx
x = np.arange(0.0, a+dx, dx)
r = np.random.rand(len(x))
points = x[r<P]

plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])


Out[4]:
([], <a list of 0 Text yticklabel objects>)

In [5]:
def statistics1d(points, ne, ae, w=1.0):
    le = ne/ae
    n = len(points)
    a = points[-1] - points[0]
    counts, _ = np.histogram(points, np.arange(0.0, points[-1]+w, w))
    intervals = np.diff(points)
    mint = np.mean(intervals)
    print('expected number of points: %d' % ne)
    print('number of points         : %d' % n)
    print('expected interval        : %.3f' % ae)
    print('total interval           : %.3f' % a)
    print('expected mean interval   : %.3f' % (ae/ne))
    print('mean interval            : %.3f' % mint)
    print('expected intensity       : %.3f' % (ne/ae))
    print('observed intensity 1     : %.3f' % (1.0/mint))
    print('observed intensity 2     : %.3f' % ((n-1)/a))
    plt.subplot(1, 2, 1)
    cr = np.arange(0, np.max(counts)+1, 1)
    plt.hist(counts, cr-0.5, normed=True)
    pp = st.poisson.pmf(cr, w*ne/ae)
    plt.plot(cr, pp, 'r', lw=2)
    plt.subplot(1, 2, 2)
    plt.hist(intervals, 30, normed=True)
    dx = np.linspace(0.0, np.max(intervals), 200)
    pe = st.expon.pdf(dx, 0.0, 1.0/le)
    plt.plot(dx, pe, 'r', lw=2)

x = np.arange(0.0, 100*a+dx, dx)
r = np.random.rand(len(x))
points = x[r<P]
statistics1d(points, 100*N, 100*a, 0.2)


expected number of points: 2000
number of points         : 1984
expected interval        : 200.000
total interval           : 199.983
expected mean interval   : 0.100
mean interval            : 0.101
expected intensity       : 10.000
observed intensity 1     : 9.916
observed intensity 2     : 9.916

Given an interval $a$ in which on average $n$ points should be randomly placed, we first need to draw the actual number of points from the Poisson distribution and then randomly distribute these points on the interval $a$. The intensity of the Poisson process is $\lambda = N/a$. For computing the interval statistics we need to sort the events.


In [6]:
n = np.random.poisson(N) # number of points
x = np.random.rand(n)*a
points = np.sort(x)

plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])


Out[6]:
([], <a list of 0 Text yticklabel objects>)

We can even get rid of drawing the number of events from the Poisson distribution. However, keep in mind that this causes trouble when doing a bootstrap.


In [7]:
points = np.sort(np.random.rand(N*100)*a*100)
statistics1d(points, N*100, a*100, 0.2)


expected number of points: 2000
number of points         : 2000
expected interval        : 200.000
total interval           : 199.888
expected mean interval   : 0.100
mean interval            : 0.100
expected intensity       : 10.000
observed intensity 1     : 10.001
observed intensity 2     : 10.001

Distributing points by drawing them from a uniform distribution is additive. We always get Poisson distributed points with an accordingly increased intensity.


In [8]:
points = np.array([])
for k in range(5):
    print('Iteration %d:' % (k+1))
    # draw some uniformly distributed points:
    newpoints = np.random.rand(N*100)*a*100
    # add them to the previous ones:
    points = np.sort(np.concatenate((points, newpoints)))
    plt.figure()
    statistics1d(points, N*100*(k+1), a*100, 0.2) 
    print('')


Iteration 1:
expected number of points: 2000
number of points         : 2000
expected interval        : 200.000
total interval           : 199.892
expected mean interval   : 0.100
mean interval            : 0.100
expected intensity       : 10.000
observed intensity 1     : 10.000
observed intensity 2     : 10.000

Iteration 2:
expected number of points: 4000
number of points         : 4000
expected interval        : 200.000
total interval           : 199.904
expected mean interval   : 0.050
mean interval            : 0.050
expected intensity       : 20.000
observed intensity 1     : 20.005
observed intensity 2     : 20.005

Iteration 3:
expected number of points: 6000
number of points         : 6000
expected interval        : 200.000
total interval           : 199.943
expected mean interval   : 0.033
mean interval            : 0.033
expected intensity       : 30.000
observed intensity 1     : 30.004
observed intensity 2     : 30.004

Iteration 4:
expected number of points: 8000
number of points         : 8000
expected interval        : 200.000
total interval           : 199.943
expected mean interval   : 0.025
mean interval            : 0.025
expected intensity       : 40.000
observed intensity 1     : 40.006
observed intensity 2     : 40.006

Iteration 5:
expected number of points: 10000
number of points         : 10000
expected interval        : 200.000
total interval           : 199.970
expected mean interval   : 0.020
mean interval            : 0.020
expected intensity       : 50.000
observed intensity 1     : 50.003
observed intensity 2     : 50.003

We can also generate a Poisson process based on the interval distribution. This method is similarly efficient as the one before, because drawing random numbers from expontential distributions is more expensive than drawing them from uniform distributions. This offsets the costs for sorting.


In [9]:
dx = np.random.exponential(1.0/l, N)
x = np.cumsum(dx)
points = x[x<a] # truncate excess points

plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])


Out[9]:
([], <a list of 0 Text yticklabel objects>)

In [10]:
x = np.cumsum(np.random.exponential(1.0/l, N*100))
points = x[x<a*100]
statistics1d(points, N*100, a*100, 0.2)


expected number of points: 2000
number of points         : 1933
expected interval        : 200.000
total interval           : 199.538
expected mean interval   : 0.100
mean interval            : 0.103
expected intensity       : 10.000
observed intensity 1     : 9.682
observed intensity 2     : 9.682

Distribution of mean intervals

In 2-D the mean nearest-neighbor distance is used for testing for randomly placed points. Here we test this for the 1-D case.

The average of the mean interval is not biased when bootstrapping a fixed intensity and not a fixed number of points. The standard error of the mean can be estimated from the standard deviation by dividing by $\sqrt{n}$.


In [11]:
npoints = 100
l = npoints/a
de = a/npoints
sem = a/npoints/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
    # generate points:
    points = np.sort(np.random.rand(np.random.poisson(npoints)))*a
    #points = np.cumsum(np.random.exponential(1.0/l, npoints))
    # compute nearest neighbor distance:
    min_dists = np.diff(points)
    # statistics:
    do = np.mean(min_dists)
    md[k] = do
    sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean    : %.3f' % de)
print('observed mean    : %.3f' % do)
print('average std      : %.3f' % np.mean(sd))
print('sem              : %.3f' % (np.mean(sd)/np.sqrt(npoints)))
print('std observed mean: %.3f' % np.std(md))
print('bias             : %.3f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
plt.ylabel('pdf')


expected mean    : 0.020
observed mean    : 0.020
average std      : 0.020
sem              : 0.002
std observed mean: 0.002
bias             : 0.000
Out[11]:
<matplotlib.text.Text at 0x7f3b3f8329d0>

When bootstrapping a fixed number of points, we get the wrong distribution. The distribution is non Gaussian and narrower than what is expected from the standard error.


In [12]:
npoints = 100
l = npoints/a
de = a/npoints
sem = a/npoints/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
    # generate points:
    points = np.sort(np.random.rand(npoints))*a
    # compute nearest neighbor distance:
    min_dists = np.diff(points)
    # statistics:
    do = np.mean(min_dists)
    md[k] = do
    sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean    : %.3f' % de)
print('observed mean    : %.3f' % do)
print('average std      : %.3f' % np.mean(sd))
print('sem              : %.3f' % (np.mean(sd)/np.sqrt(npoints)))
print('std observed mean: %.3f' % np.std(md))
print('bias             : %.3f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
plt.ylabel('pdf')


expected mean    : 0.020
observed mean    : 0.020
average std      : 0.020
sem              : 0.002
std observed mean: 0.000
bias             : -0.000
Out[12]:
<matplotlib.text.Text at 0x7f3b3f447990>

The mean nearest-neighbor distance of the 1-D Poisson is the same as the one for regularly spaced points. Therefore we cannot test for random points based on bootstrapping the mean nearest-neighbor distance.

The following code just confirms bootstrapping the mean nearest-neighbor distance, but is not applicable in real-world scenarios.


In [13]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
    # generate "original" Poisson data:
    n = 100
    a = 2000.0
    #x = np.cumsum(np.random.exponential(a/n, n))
    x = np.sort(np.random.rand(np.random.poisson(n)))*a
    
    # nearest neighbors statistics:
    dx = np.diff(np.sort(x))
    md = np.mean(dx)
    
    # significance:
    se = a/n/np.sqrt(n)
    zann = (md-a/n)/se
    p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
    if p < 0.05:
        p_counter += 1
    
    # bootstrap:
    rn = 2000
    bmd = np.zeros(rn)
    for r in range(rn):
        #bx = np.random.rand(n)*a   # this does not work!!!
        bx = np.random.rand(np.random.poisson(n))*a
        bdx = np.diff(np.sort(bx))
        bmd[r] = np.mean(bdx)
    lp, up = np.percentile(bmd, [2.5, 97.5])
    if md < lp or md > up:
        bp_counter += 1
    if j == 1:
        plt.hist(bmd, 30, normed=True)
        plt.plot([md, md], [0, 0.4*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score)  : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))


Percentage of significant deviations (z-score)  :   6.0%
Percentage of significant deviations (bootstrap):   3.0%

Bootstrap and z-score work as expected, in 5% of the cases we get a significant difference from a Poisson process.

In a real world application we have to estimate the parameter of the Poisson process from the data. Consequently, the bootstrapped distribution is centered around the mean and both the z-score and the bootstrap will indicate Poisson data in all cases.


In [14]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
    # generate "original" Poisson data:
    n = 100
    a = 2000.0
    #x = np.cumsum(np.random.exponential(a/n, n))
    x = np.sort(np.random.rand(np.random.poisson(n)))*a
    
    # nearest neighbors statistics:
    dx = np.diff(np.sort(x))
    md = np.mean(dx)
    
    # expected mean nearest-neighbor distance:
    de = (x[-1]-x[0])/len(x)
    
    # significance:
    se = de/np.sqrt(n)
    zann = (md-de)/se
    p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
    if p < 0.05:
        p_counter += 1
    
    # bootstrap:
    nn = len(x)
    aa = md*nn
    rn = 1000
    bmd = np.zeros(rn)
    for r in range(rn):
        #bx = np.random.rand(nn)*a   # this does not work!!!
        bx = np.random.rand(np.random.poisson(nn))*aa
        bdx = np.diff(np.sort(bx))
        bmd[r] = np.mean(bdx)
    lp, up = np.percentile(bmd, [2.5, 97.5])
    if md < lp or md > up:
        bp_counter += 1
    if j == 1:
        plt.hist(bmd, 30, normed=True)
        plt.plot([md, md], [0, 0.4*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score)  : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))


Percentage of significant deviations (z-score)  :   0.0%
Percentage of significant deviations (bootstrap):   0.0%

Now let's bootstrap some regularly spaced points:


In [15]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
    # generate "original", regularly spaced data:
    n = 100
    a = 2000.0
    x = np.linspace(0.0, a, n) + (0.02*a/n)*np.random.randn(n)
    
    # nearest neighbors statistics:
    dx = np.diff(np.sort(x))
    md = np.mean(dx)
    
    # expected mean nearest-neighbor distance:
    de = (x[-1]-x[0])/len(x)
    
    # significance:
    se = de/np.sqrt(n)
    zann = (md-de)/se
    p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
    if p < 0.05:
        p_counter += 1
    
    # bootstrap:
    nn = len(x)
    aa = md*nn
    rn = 1000
    bmd = np.zeros(rn)
    for r in range(rn):
        #bx = np.random.rand(nn)*a   # this does not work!!!
        bx = np.random.rand(np.random.poisson(nn))*aa
        bdx = np.diff(np.sort(bx))
        bmd[r] = np.mean(bdx)
    lp, up = np.percentile(bmd, [2.5, 97.5])
    if md < lp or md > up:
        bp_counter += 1
    if j == 1:
        plt.hist(bmd, 30, normed=True)
        plt.plot([md, md], [0, 0.04*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score)  : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))


Percentage of significant deviations (z-score)  :   0.0%
Percentage of significant deviations (bootstrap):   0.0%

None of the data is significantly different from the Poisson data ... This is not what we want, but we understand why this is the case.

Homogeneous 2-D Poisson Process

Count statistics

The intensity of the Poisson process sets the mean number of events $N$ per area $A$ $$\lambda=\frac{\langle N \rangle}{A}$$ The unit of $\lambda$ is the inverse of the unit of $A$.

Within some area $A$, the number of points is Poisson distributed, exactly as in the 1-D case: $$P(N) = \frac{(\lambda A)^N}{N!}e^{-\lambda A}$$ with mean $$\langle N \rangle = \lambda A$$ and variance $$\sigma^2_N = \lambda A$$


In [16]:
# generate poisson events in a plane:
n = 10000
a = 40.0
A = a*a
points = np.random.rand(np.random.poisson(n), 2)*a
l = n/A
print('lambda = %g' % l)


lambda = 6.25

Statistics of nearest neighbor distances

The nearest neighbor for each point is the single point that is closest to the point. The corresponding distance $d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$ is the nearest-neighbor distance. In 2-D, nearest neighbor distances $d$ are distributed according to $$p(d)=2\pi\lambda d e^{-\pi\lambda d^2}$$ with mean nearest-neighbor distance $$\langle d \rangle = \frac{1}{2\sqrt{\lambda}} = \frac{1}{2}\sqrt{\frac{A}{\langle N \rangle}}$$ variance $$\sigma_d^2 = \frac{1}{\lambda}\left(\frac{1}{\pi} - \frac{1}{4}\right)$$ standard deviation $$\sigma_d = \frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{\pi} - \frac{1}{4}} \approx 0.26136 \cdot \frac{1}{\sqrt{\lambda}} = 0.26136 \cdot \sqrt{\frac{A}{\langle N \rangle}} = 0.52272 \cdot \langle d \rangle$$ and coefficient of variation $$CV_d = 2\sqrt{\frac{1}{\pi} - \frac{1}{4}} \approx 0.52272$$

The 2-D nearest-neighbor distribution is a special case of the Weibull distribution $p_w(x)$ with $k=2$ and $\alpha=1/\sqrt{\pi\lambda}$.

It is also a special case of the Rayleigh distribution $$p_r(x) = \frac{x}{\sigma^2}e^{-\frac{x^2}{2\sigma^2}}$$ with $\sigma^2=1/(2\pi\lambda)$.


In [17]:
import scipy.spatial as sp

def nearest_neighbor_distances(points):
    distances = sp.distance_matrix(points, points)
    np.fill_diagonal(distances, np.nan)
    min_dists = np.nanmin(distances, axis=0)
    return min_dists

In [18]:
min_dists = nearest_neighbor_distances(points)

# statistics of nearest-neighbor distances:
meand = 0.5/np.sqrt(l)
mean_d = np.mean(min_dists)
std_d = np.std(min_dists)
print('0.5/sqrt(lambda): %.3f' % meand)
print('mean distances  : %.3f' % mean_d)
print('std             : %.3f' % std_d)
print('CV              : %.3f' % (std_d/mean_d))

plt.title('Nearest-neighbor distribution')
plt.hist(min_dists, 30, normed=True)
x = np.linspace(0.0, meand*3, 200)
pw = st.weibull_min.pdf(x, 2, 0.0, 1.0/np.sqrt(np.pi*l))
plt.plot(x, pw, 'r', lw=2)
plt.xlabel('Nearest-neighbor distance')
plt.ylabel('pdf')


0.5/sqrt(lambda): 0.200
mean distances  : 0.202
std             : 0.104
CV              : 0.514
Out[18]:
<matplotlib.text.Text at 0x7f3b3f3ba610>

Distribution of the mean nearest-neighbor distances

The average of mean nearest-neighbors, computed from all random points generated within an rectangular area, does not converge to the expected mean $(2\sqrt{\lambda})^{-1}$. The reason is, that the nearest-neighbor distance for the points close to the border of the region is larger than for points from the center. This seems to overestimate the mean neares neighbor distance by one standarad deviation.

The standard deviation of this distribution, the standard error of the mean, seems, however, to be right. This result is independent of whether we draw a fixed number of points on the area or a Poisson distributed one.


In [19]:
npoints = 100
l = npoints/A
de = 0.5*np.sqrt(A/npoints)
sem = 0.26136/np.sqrt(l)/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
    # generate points:
    #points = np.random.rand(npoints, 2)*a
    points = np.random.rand(np.random.poisson(npoints), 2)*a
    # compute nearest neighbor distance:
    distances = sp.distance_matrix(points, points)
    np.fill_diagonal(distances, np.nan)
    min_dists = np.nanmin(distances, axis=0)
    # statistics:
    do = np.mean(min_dists)
    md[k] = do
    sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean    : %.2f' % de)
print('observed mean    : %.2f' % do)
print('average std      : %.2f' % np.mean(sd))
print('sem              : %.2f' % (np.mean(sd)/np.sqrt(npoints)))
print('expected sem     : %.2f' % sem)
print('std observed mean: %.2f' % np.std(md))
print('bias             : %.2f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')


expected mean    : 2.00
observed mean    : 2.10
average std      : 1.15
sem              : 0.11
expected sem     : 0.10
std observed mean: 0.16
bias             : 0.10
Out[19]:
<matplotlib.text.Text at 0x7f3b3f658950>

Now, lets take only points from the center of the area:


In [20]:
npoints = 100
lower = a*0.3
upper = a*0.7
l = npoints/A
de = 0.5*np.sqrt(A/npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
sed = np.zeros(r)
nn = np.zeros(r)
j = 0
for k in range(r):
    # generate points:
    #points = np.random.rand(npoints, 2)*a
    points = np.random.rand(np.random.poisson(npoints), 2)*a
    # compute nearest neighbor distance:
    distances = sp.distance_matrix(points, points)
    np.fill_diagonal(distances, np.nan)
    min_dists = np.nanmin(distances, axis=0)
    # select points from the center:
    min_dists = min_dists[(points[:,0]>lower)&(points[:,0]<upper)&(points[:,1]>lower)&(points[:,1]<upper)]
    if len(min_dists) == 0:
        continue
    # statistics:
    do = np.mean(min_dists)
    md[j] = do
    sd[j] = np.std(min_dists)
    nn[j] = len(min_dists)
    sed[j] = sd[j]/np.sqrt(len(min_dists))
    j += 1
md = md[:j]
sd = sd[:j]
sed = sed[:j]
nn = nn[:j]
do = np.mean(md)
sem = 0.26136/np.sqrt(l)/np.sqrt(np.mean(nn))
print('expected mean    : %.2f' % de)
print('observed mean    : %.2f' % do)
print('average std      : %.2f' % np.mean(sd))
print('average sem      : %.2f' % np.mean(sed))
print('expected sem     : %.2f' % sem)
print('std observed mean: %.2f' % np.std(md))
print('bias             : %.2f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')


expected mean    : 2.00
observed mean    : 2.10
average std      : 0.98
average sem      : 0.26
expected sem     : 0.26
std observed mean: 0.42
bias             : 0.10
Out[20]:
<matplotlib.text.Text at 0x7f3b3f91cd10>

Eliminating points from the edges indeed removes the bias. The resulting distribution of mean nearest-neighbor distances is, however, not exactly Gaussian and the expected standard error of the mean as well as the s.e.m. computed from the standard deviation of the nearest-neighbor distances by dividing by square root of the number of points slightly underestimate the standard deviation of the distribution of mean nearest-neighbor distances.

Estimating the mean nearest-neighbor distance

As we just have convinced ourselves, the mean nearest-neighbor distance is not just the average of the nearest-neighbor distances (known as a.n.n.) $$\langle d \rangle_{est} \ne \bar d = \frac{1}{N} \sum_i d_i$$ because of the bias introduced by the points close to the edge.

An improved, unbiased estimate can be achieved by reducing the a.n.n. by one standard error of the mean: $$\langle d \rangle_{est} = \bar d - \sigma_{\bar d}$$ where the standard error is $$\sigma_{\bar d} = 0.52272 \cdot \frac{\langle d \rangle}{\sqrt{N}} $$ That is $$\langle d \rangle_{est} = \bar d \cdot\left(1 - \frac{0.52272}{\sqrt{N}} \right)$$

Why this is exactly(?) one s.e.m. I do not know... --- but see below for more on that.

Estimating the intensity of the Poisson process

By definition, the intensity of the Poisson process is $$\lambda = \frac{\langle N \rangle}{A}$$

For estimating the intensity from data we need an estimate the mean number of points as well as the area: $$\lambda_{est} = \frac{\langle N \rangle_{est}}{A_{est}}$$

The mean number of points we estimate by the actually observed number of points: $$\langle N_{est} \rangle = N$$

How to estimate the $A$? The area within the hull of the points certainly is too small.

Recall, that the intensity can also be estimated based on the mean nearest-neighbor distance: $$\lambda_{est} = \frac{1}{4\langle d \rangle^2}$$

Both can be used to estimate the area, from which the random points are drawn: $$A_{est} = 4\langle d \rangle^2 N$$

For this we use the corrected estimate for the mean nearest-neighbor distance: $$A_{est} = 4\langle d \rangle_{est}^2 N$$

Influence of the border

In the following we generate randomly placed points on a unit square. From these points we subsequently select those from a rectangular area with one side being shorter than one according to aspect_ratios. That is, the smaller the aspect ratio, the more deviates the rectangle from the square.


In [21]:
for n in [20, 100, 500]:
    a = 1.0
    A = a*a
    l = n/A
    de = 0.5/np.sqrt(l)
    nr = 500
    aspect_ratios = np.arange(1.0, 0.0, -0.1)
    sem = 0.52272*de/np.sqrt(n*aspect_ratios) 
    areas = A*aspect_ratios
    perimeters = 2.0*a*(1.0 + aspect_ratios) 
    alpha = perimeters/areas
    ratio = 0.5*(1.0+1.0/aspect_ratios)
    ls = np.zeros((nr, len(aspect_ratios)))
    lests = np.zeros((nr, len(aspect_ratios)))
    anns = np.zeros((nr, len(aspect_ratios)))
    for r in range(nr):
        #points = np.random.rand(np.random.poisson(n), 2)*a
        points = np.random.rand(n, 2)*a
        for k, ar in enumerate(aspect_ratios):
            rpoints = points[points[:, 1] < ar*a, :]
            if len(rpoints) > 0:
                ann = np.mean(nearest_neighbor_distances(rpoints))
                ls[r, k] = len(rpoints)/(A*ar)
                de =    len(rpoints)
                lests[r, k] = 0.25/de**2
                anns[r, k] = ann
            else:
                ls[r, k] = np.nan
                lests[r, k] = np.nan
                anns[r, k] = np.nan
    plt.figure()
    plt.subplot(1, 2, 1)
    plt.title('total n = %d' % n)
    plt.errorbar(aspect_ratios, np.nanmean(ls, axis=0), yerr=np.nanstd(ls, axis=0)/np.sqrt(nr), fmt='o')
    plt.plot([0.0, 1.0], [l, l], '-b')
    plt.xlim(0.0, 1.0)
    plt.ylim(0.0, 1.5*l)
    plt.xlabel('aspect ratio')
    plt.ylabel('intensity')
    plt.subplot(1, 2, 2)
    plt.title('total n = %d' % n)
    plt.errorbar(aspect_ratios, np.nanmean(anns, axis=0), yerr=np.nanstd(anns, axis=0)/np.sqrt(nr), fmt='o')
    plt.plot([0.0, 1.0], [de, de], '-b')
    plt.plot(aspect_ratios, de+sem, '-c')
    plt.plot(aspect_ratios, de+sem*ratio**0.5, '-m')
    plt.xlim(0.0, 1.0)
    plt.ylim(0.9*de, 1.7*de)
    plt.xlabel('aspect ratio')
    plt.ylabel('a.n.n.')
    plt.tight_layout()


/usr/lib/python2.7/dist-packages/numpy/lib/nanfunctions.py:220: RuntimeWarning: All-NaN axis encountered
  warnings.warn("All-NaN axis encountered", RuntimeWarning)

First, the observed average nearest-neighbor distance is always larger than expected from the Poisson process. This is because of the border points.

The more the area deviates from a square, the more deviates the average nearest-neighbor distance from the theory.

The expected distance is quite well explained by the theory plus one s.e.m. If, however, the number of points in the area is less than about 15, the observed average nearest-neighbor distance is even heigher.

Multiplying the s.e.m. by $$\sqrt{\frac{1}{2} \left( 1+\frac{1}{a_r} \right)}$$, where $a_r$ is the aspect ratio of the area improves the estimate in particular for small $n$, but is too large for larger number of points.

How to test for randomness?

Let's bootstrap the distribution of mean nearest-neighbor distances and compare the results with the z-score method:


In [22]:
def bootstrap_mean_distance(points, n, a=1.0, plot=False, title=''):
    """
    For some given points in 2-D estimate the significance of being different from Poisson data
    using z-scores and bootstrap.
    
    n: average number of points
    a: length of side of the square
    """
    min_dists = nearest_neighbor_distances(points)
    md = np.mean(min_dists)
    
    # significance based on true intensity:
    se = 0.26136*a/n
    zann = (md-0.5*a/np.sqrt(n))/se
    p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
        
    # estimate Poisson parameter:
    nn = len(points)
    # corrected mean-neighbor distance:
    cmd = md*(1.0 - 0.52272/np.sqrt(nn))
    A = 4.0 * cmd**2 * nn
    aa = np.sqrt(A)
    #aa = np.sqrt(np.prod(np.max(points, axis=0) - np.min(points, axis=0)))
    #aa += md
    
    # significance based on oberved intensity:
    seo = 0.26136*aa/nn
    zanno = (md-0.5*aa/np.sqrt(nn))/seo
    po = 2.0*(1.0 - st.norm.cdf(np.abs(zanno)))
    
    # bootstrap:
    rn = 500
    bmd = np.zeros(rn)
    for r in range(rn):
        #bx = np.random.rand(nn, 2)*aa
        bx = np.random.rand(np.random.poisson(nn), 2)*aa
        distances = sp.distance_matrix(bx, bx)
        np.fill_diagonal(distances, np.nan)
        min_dists = np.nanmin(distances, axis=0)
        bmd[r] = np.mean(min_dists)
    pb = 0.01*st.percentileofscore(bmd, md)
    if pb > 0.5:
        pb = 1.0-pb
    pb *= 2.0
    
    if plot == 1:
        plt.subplot(1, 2, 1)
        plt.plot(points[:, 0], points[:, 1], 'o')
        plt.xlim(-0.1*aa, 1.1*aa)
        plt.ylim(-0.1*aa, 1.1*aa)
        plt.xlabel('x')
        plt.xlabel('y')
        plt.subplot(1, 2, 2)
        plt.hist(bmd, 20, normed=True)
        plt.plot([md, md], [0, 0.4*np.sqrt(len(min_dists))/np.std(min_dists)], 'r', lw=2)
        plt.xlabel('mean nearest-neighbor distances')
        plt.ylabel('pdf')
        plt.suptitle(title)
    return p, po, pb
    
nt = 100
p_counter = 0
po_counter = 0
bp_counter = 0
for j in range(nt):
    # generate "original" Poisson data:
    n = 100
    a = 20.0
    points = np.random.rand(np.random.poisson(n), 2)*a
    
    p, po, pb = bootstrap_mean_distance(points, n, a, j==1, 'Poisson')
    if p < 0.05:
        p_counter += 1
    if po < 0.05:
        po_counter += 1
    if pb < 0.05:
        bp_counter += 1

print('Poisson data: percentage of significant deviations')
print('  true z-score    : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print('  observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print('  bootstrap       : %.1f%%' % (100.0*float(bp_counter)/float(nt)))


Poisson data: percentage of significant deviations
  true z-score    : 30.0%
  observed z-score: 0.0%
  bootstrap       : 0.0%

The Z-score based on the true mean number of points and area strongly overestimates significant deviations from randomness.

The Z-score and the bootstrap based on enlarging the area such that the Poisson mean nearest-neighbor distance is the same as the observed one results of course in no deviation from the Poisson case - independent whether the original points are Poisson distributed or not!


In [23]:
def bootstrap_grid(noisesd=0.0):
    nt = 100
    p_counter = 0
    po_counter = 0
    bp_counter = 0
    for j in range(nt):
        # generate "original" regularly spaced data:
        n = 100   # must be square number
        a = 20.0
        dx = a/np.sqrt(n)
        points = np.zeros((n, 2))
        i = 0
        for x in np.linspace(0.0, a, np.sqrt(n)):
            for y in np.linspace(0.0, a, np.sqrt(n)):
                points[i,:] = np.array([x, y]) + noisesd*dx*np.random.randn(2)
                i += 1
                
        p, po, pb = bootstrap_mean_distance(points, n, a, j==1, 'Grid $\sigma$=%g' % noisesd)
        if p < 0.05:
            p_counter += 1
        if po < 0.05:
            po_counter += 1
        if pb < 0.05:
            bp_counter += 1
    
    print('Regular grid with sd noise = %g: percentage of significant deviations' % noisesd)
    print('  true z-score    : %.1f%%' % (100.0*float(p_counter)/float(nt)))
    print('  observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
    print('  bootstrap       : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
    print('')

for nsd in [0.0, 0.5, 1.0, 3.0]:
    plt.figure()
    bootstrap_grid(nsd)


Regular grid with sd noise = 0: percentage of significant deviations
  true z-score    : 100.0%
  observed z-score: 0.0%
  bootstrap       : 0.0%

Regular grid with sd noise = 0.5: percentage of significant deviations
  true z-score    : 100.0%
  observed z-score: 0.0%
  bootstrap       : 0.0%

Regular grid with sd noise = 1: percentage of significant deviations
  true z-score    : 100.0%
  observed z-score: 0.0%
  bootstrap       : 0.0%

Regular grid with sd noise = 3: percentage of significant deviations
  true z-score    : 100.0%
  observed z-score: 0.0%
  bootstrap       : 0.0%

Because we are using the observed mean nearest-neighbor distance to come up with a good estimate of the intensity of the Poisson process we cannot use it for testing whether the data are Poisson!

Use CV for testing


In [24]:
def bootstrap_cv(points, n, a=1.0, plot=False, title=''):
    """
    For some given points in 2-D estimate the significance of being different from Poisson data
    using z-scores and bootstrap.
    
    n: average number of points
    a: length of side of the square
    """
    min_dists = nearest_neighbor_distances(points)
    md = np.mean(min_dists)
    sd = np.std(min_dists)
    cv = sd/md
    
    # significance based on z-score:
    se = 2.0*0.26136/np.sqrt(n)
    zcv = (cv - 0.52272)/se
    p = 2.0*(1.0 - st.norm.cdf(np.abs(zcv)))
        
    # estimate Poisson parameter:
    nn = len(points)
    # corrected mean-neighbor distance:
    cmd = md*(1.0 - 0.52272/np.sqrt(nn))
    A = 4.0 * cmd**2 * nn
    aa = np.sqrt(A)
    #aa = np.sqrt(np.prod(np.max(points, axis=0) - np.min(points, axis=0)))
    #aa += md
    
    # significance based on oberved intensity:
    cve = 0.52272
    seo = 2.0*0.26136/np.sqrt(nn)
    zcvo = (cv - 0.52272)/seo
    po = 2.0*(1.0 - st.norm.cdf(np.abs(zcvo)))
    
    # bootstrap:
    rn = 500
    bcv = np.zeros(rn)
    for r in range(rn):
        #bx = np.random.rand(nn, 2)*aa
        bx = np.random.rand(np.random.poisson(nn), 2)*aa
        distances = sp.distance_matrix(bx, bx)
        np.fill_diagonal(distances, np.nan)
        min_dists = np.nanmin(distances, axis=0)
        bcv[r] = np.std(min_dists)/np.mean(min_dists)
    pb = 0.01*st.percentileofscore(bcv, cv)
    if pb > 0.5:
        pb = 1.0-pb
    pb *= 2.0
    
    if plot == 1:
        plt.subplot(1, 2, 1)
        plt.plot(points[:, 0], points[:, 1], 'o')
        plt.xlim(-0.1*aa, 1.1*aa)
        plt.ylim(-0.1*aa, 1.1*aa)
        plt.xlabel('x')
        plt.xlabel('y')
        plt.subplot(1, 2, 2)
        plt.hist(bcv, 20, normed=True)
        x = np.linspace(cv-4.0*seo, cv+4.0*seo, 200)
        y = st.norm.pdf(x, loc=cve, scale=seo)
        plt.plot(x, y, 'r', lw=2)
        plt.plot([cv, cv], [0, 0.4/seo], 'r', lw=2)
        plt.xlabel('CV')
        plt.ylabel('pdf')
        plt.suptitle(title)
    return cv, p, po, pb
    
nt = 50
p_counter = 0
po_counter = 0
bp_counter = 0
cvs = np.zeros(nt)
for j in range(nt):
    # generate "original" Poisson data:
    n = 100
    a = 20.0
    points = np.random.rand(np.random.poisson(n), 2)*a
    
    cv, p, po, pb = bootstrap_cv(points, n, a, j==1, 'Poisson')
    cvs[j] = cv
    if p < 0.05:
        p_counter += 1
    if po < 0.05:
        po_counter += 1
    if pb < 0.05:
        bp_counter += 1

print('Poisson data: percentage of significant deviations')
print('  true z-score    : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print('  observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print('  bootstrap       : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
print('  observed CV     : %.3f +- %.3f' % (np.mean(cvs), np.std(cvs)))


Poisson data: percentage of significant deviations
  true z-score    : 4.0%
  observed z-score: 4.0%
  bootstrap       : 6.0%
  observed CV     : 0.536 +- 0.049

The bootstrapped distribution of CVs is biased to larger CVs, probably because of the edges.


In [25]:
def bootstrap_grid_cv(noisesd=0.0):
    nt = 50
    p_counter = 0
    po_counter = 0
    bp_counter = 0
    cvs = np.zeros(nt)
    for j in range(nt):
        # generate "original" regularly spaced data:
        n = 100   # must be square number
        a = 20.0
        dx = a/np.sqrt(n)
        points = np.zeros((n, 2))
        i = 0
        for x in np.linspace(0.0, a, np.sqrt(n)):
            for y in np.linspace(0.0, a, np.sqrt(n)):
                points[i,:] = np.array([x, y]) + noisesd*dx*np.random.randn(2)
                i += 1
                
        cv, p, po, pb = bootstrap_cv(points, n, a, j==1, 'Grid $\sigma$=%g' % noisesd)
        cvs[j] = cv
        if p < 0.05:
            p_counter += 1
        if po < 0.05:
            po_counter += 1
        if pb < 0.05:
            bp_counter += 1
    
    print('Regular grid with sd noise = %g: percentage of significant deviations' % noisesd)
    print('  true z-score    : %.1f%%' % (100.0*float(p_counter)/float(nt)))
    print('  observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
    print('  bootstrap       : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
    print('  observed CV     : %.3f +- %.3f' % (np.mean(cvs), np.std(cvs)))
    print('')

for nsd in [0.0, 0.5, 0.8, 2.0]:
    plt.figure()
    bootstrap_grid_cv(nsd)


Regular grid with sd noise = 0: percentage of significant deviations
  true z-score    : 100.0%
  observed z-score: 100.0%
  bootstrap       : 100.0%
  observed CV     : 0.000 +- 0.000

Regular grid with sd noise = 0.5: percentage of significant deviations
  true z-score    : 12.0%
  observed z-score: 12.0%
  bootstrap       : 26.0%
  observed CV     : 0.476 +- 0.046

Regular grid with sd noise = 0.8: percentage of significant deviations
  true z-score    : 8.0%
  observed z-score: 8.0%
  bootstrap       : 6.0%
  observed CV     : 0.555 +- 0.049

Regular grid with sd noise = 2: percentage of significant deviations
  true z-score    : 78.0%
  observed z-score: 78.0%
  bootstrap       : 70.0%
  observed CV     : 0.710 +- 0.092

Test for randomness

  1. Count the number of points and use it as an estimate of the mean number of points $$N_{est} = N$$
  2. Compute the average nearest-neighbor distance $$\bar d = \frac{1}{N}\sum_i d_i$$
  3. The expected nearest-neighbor distance of a Poisson process on an infinite area is $$d_e = \bar d \cdot \left(1 - 2 \cdot \frac{0.26136}{\sqrt{N}} \right)$$
  4. Use this to estimate the area from which the Poisson points might have been drawn as $$A_{est} = 4 d_e^2 N$$
  5. The intensity of the corresponding Poisson processs is $$\lambda_{est} = \frac{N_{est}}{A_{est}}$$ and the corresponding expected value for the mean nearest-neighbor distance for the Poisson process is $$d_e = \frac{1}{2\sqrt{\lambda_{est}}}$$
  6. For bootstrapping generate $N$ Poisson points within the area $A_{est}$.
  7. From the bootstrap compute the distribution of the coefficient of variation of the distances between nearest neighbors and compare it to the measured one. The CV of the Poisson process is 0.52272, but only without edges.