The number of events $N$ within an interval $X$ is Poisson distributed $$P(N) = \frac{(\lambda X)^N}{N!}e^{-\lambda X}$$ with mean $$\langle N \rangle = \lambda X$$ and variance $$\sigma^2_N = \lambda X$$ That is, the Fano factor (variance of a count divided by the mean) is $$F=\frac{\sigma^2_N}{\langle N \rangle} = 1$$
The intensity $\lambda$ of the Poisson process can be estimated from the average number of events $\langle N \rangle$ within an interval X $$\lambda = \frac{\langle N \rangle}{X}$$ The unit of $\lambda$ is the inverse of the unit of $X$.
In [2]:
%matplotlib inline
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
# generate poisson events via the interval distribution:
n = 100000
l = 2.0 # lambda, intensity of the poisson point process
x = np.cumsum(np.random.exponential(1.0/l, n))
# count statistics:
X = 4.0
windows = np.arange(0.0, x[-1], X)
counts, _ = np.histogram(x, windows)
mean_count = np.mean(counts)
variance_count = np.var(counts)
print('lambda*X : %.1f' % (l*X))
print('mean count : %.1f' % mean_count)
print('variance : %.1f' % variance_count)
print('Fano factor: %.2f' % (variance_count/mean_count))
# count histogram:
plt.title('Poisson distribution of counts')
cr = np.arange(0, np.max(counts)+1, 1)
plt.hist(counts, cr-0.5, normed=True)
pp = st.poisson.pmf(cr, l*X)
plt.plot(cr, pp, 'r', lw=2)
plt.xlim(0, 3*l*X)
plt.xlabel('Counts in window X')
plt.ylabel('Probability')
Out[2]:
The nearest-neighbor distances, i.e. the intervals $\Delta x$ between events, are exponentially distributed $$p(\Delta x) = \lambda e^{-\lambda\Delta x}$$ The mean $$\langle \Delta x \rangle = \frac{1}{\lambda} = \frac{X}{\langle N \rangle}$$ is the same as for regularly distributed points. The standard deviation is $$\sigma_{\Delta x} = \frac{1}{\lambda}$$ and the coefficient of variation (standard deviation divided by mean) is $$CV = \frac{\sigma_{\Delta x}}{\langle \Delta x \rangle} = 1$$
In [3]:
# intervals:
dx = np.diff(x)
# statistics:
mean_dx = np.mean(dx)
std_dx = np.std(dx)
print('1/lambda: %.2f' % (1.0/l))
print('mean dx : %.2f' % mean_dx)
print('std : %.2f' % std_dx)
print('CV : %.2f' % (std_dx/mean_dx))
# interval histogram:
plt.title('Exponential distribution of intervals')
intervals = np.arange(0.0, np.max(dx), 0.1/l)
plt.hist(dx, intervals, normed=True)
pe = st.expon.pdf(intervals, 0.0, 1.0/l)
plt.plot(intervals, pe, 'r', lw=2)
plt.xlim(0.0, 5.0/l)
plt.xlabel('Nearest neighbor distances $dx$')
plt.ylabel('pdf')
Out[3]:
The exponential distribution of intervals is a special case of the Weibull distribution $$p_w(x)=\frac{k}{\alpha}\left(\frac{x}{\alpha}\right)^{k-1}e^{-(x/\alpha)^k}$$ with $k=1$ and $\alpha=1/\lambda$.
The probability of getting an event in a small interval $\Delta x$ is $$P = \lambda \Delta x$$ By drawing a uniformly distributed number for each interval $\Delta x$ one can generate a Poisson process. Note, however, that this is computationally expensive, since way more random numbers than actual points need to be generated.
In [4]:
N = 20 # mean number of points
a = 2.0 # interval on which to distribute the points
l = N/a # intensity
dx = 0.001
P = l*dx
x = np.arange(0.0, a+dx, dx)
r = np.random.rand(len(x))
points = x[r<P]
plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])
Out[4]:
In [5]:
def statistics1d(points, ne, ae, w=1.0):
le = ne/ae
n = len(points)
a = points[-1] - points[0]
counts, _ = np.histogram(points, np.arange(0.0, points[-1]+w, w))
intervals = np.diff(points)
mint = np.mean(intervals)
print('expected number of points: %d' % ne)
print('number of points : %d' % n)
print('expected interval : %.3f' % ae)
print('total interval : %.3f' % a)
print('expected mean interval : %.3f' % (ae/ne))
print('mean interval : %.3f' % mint)
print('expected intensity : %.3f' % (ne/ae))
print('observed intensity 1 : %.3f' % (1.0/mint))
print('observed intensity 2 : %.3f' % ((n-1)/a))
plt.subplot(1, 2, 1)
cr = np.arange(0, np.max(counts)+1, 1)
plt.hist(counts, cr-0.5, normed=True)
pp = st.poisson.pmf(cr, w*ne/ae)
plt.plot(cr, pp, 'r', lw=2)
plt.subplot(1, 2, 2)
plt.hist(intervals, 30, normed=True)
dx = np.linspace(0.0, np.max(intervals), 200)
pe = st.expon.pdf(dx, 0.0, 1.0/le)
plt.plot(dx, pe, 'r', lw=2)
x = np.arange(0.0, 100*a+dx, dx)
r = np.random.rand(len(x))
points = x[r<P]
statistics1d(points, 100*N, 100*a, 0.2)
Given an interval $a$ in which on average $n$ points should be randomly placed, we first need to draw the actual number of points from the Poisson distribution and then randomly distribute these points on the interval $a$. The intensity of the Poisson process is $\lambda = N/a$. For computing the interval statistics we need to sort the events.
In [6]:
n = np.random.poisson(N) # number of points
x = np.random.rand(n)*a
points = np.sort(x)
plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])
Out[6]:
We can even get rid of drawing the number of events from the Poisson distribution. However, keep in mind that this causes trouble when doing a bootstrap.
In [7]:
points = np.sort(np.random.rand(N*100)*a*100)
statistics1d(points, N*100, a*100, 0.2)
Distributing points by drawing them from a uniform distribution is additive. We always get Poisson distributed points with an accordingly increased intensity.
In [8]:
points = np.array([])
for k in range(5):
print('Iteration %d:' % (k+1))
# draw some uniformly distributed points:
newpoints = np.random.rand(N*100)*a*100
# add them to the previous ones:
points = np.sort(np.concatenate((points, newpoints)))
plt.figure()
statistics1d(points, N*100*(k+1), a*100, 0.2)
print('')
We can also generate a Poisson process based on the interval distribution. This method is similarly efficient as the one before, because drawing random numbers from expontential distributions is more expensive than drawing them from uniform distributions. This offsets the costs for sorting.
In [9]:
dx = np.random.exponential(1.0/l, N)
x = np.cumsum(dx)
points = x[x<a] # truncate excess points
plt.eventplot(points, colors=['red'])
plt.ylim(0.4, 1.6)
plt.yticks([])
Out[9]:
In [10]:
x = np.cumsum(np.random.exponential(1.0/l, N*100))
points = x[x<a*100]
statistics1d(points, N*100, a*100, 0.2)
In 2-D the mean nearest-neighbor distance is used for testing for randomly placed points. Here we test this for the 1-D case.
The average of the mean interval is not biased when bootstrapping a fixed intensity and not a fixed number of points. The standard error of the mean can be estimated from the standard deviation by dividing by $\sqrt{n}$.
In [11]:
npoints = 100
l = npoints/a
de = a/npoints
sem = a/npoints/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
# generate points:
points = np.sort(np.random.rand(np.random.poisson(npoints)))*a
#points = np.cumsum(np.random.exponential(1.0/l, npoints))
# compute nearest neighbor distance:
min_dists = np.diff(points)
# statistics:
do = np.mean(min_dists)
md[k] = do
sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean : %.3f' % de)
print('observed mean : %.3f' % do)
print('average std : %.3f' % np.mean(sd))
print('sem : %.3f' % (np.mean(sd)/np.sqrt(npoints)))
print('std observed mean: %.3f' % np.std(md))
print('bias : %.3f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
plt.ylabel('pdf')
Out[11]:
When bootstrapping a fixed number of points, we get the wrong distribution. The distribution is non Gaussian and narrower than what is expected from the standard error.
In [12]:
npoints = 100
l = npoints/a
de = a/npoints
sem = a/npoints/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
# generate points:
points = np.sort(np.random.rand(npoints))*a
# compute nearest neighbor distance:
min_dists = np.diff(points)
# statistics:
do = np.mean(min_dists)
md[k] = do
sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean : %.3f' % de)
print('observed mean : %.3f' % do)
print('average std : %.3f' % np.mean(sd))
print('sem : %.3f' % (np.mean(sd)/np.sqrt(npoints)))
print('std observed mean: %.3f' % np.std(md))
print('bias : %.3f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
plt.ylabel('pdf')
Out[12]:
The mean nearest-neighbor distance of the 1-D Poisson is the same as the one for regularly spaced points. Therefore we cannot test for random points based on bootstrapping the mean nearest-neighbor distance.
The following code just confirms bootstrapping the mean nearest-neighbor distance, but is not applicable in real-world scenarios.
In [13]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
# generate "original" Poisson data:
n = 100
a = 2000.0
#x = np.cumsum(np.random.exponential(a/n, n))
x = np.sort(np.random.rand(np.random.poisson(n)))*a
# nearest neighbors statistics:
dx = np.diff(np.sort(x))
md = np.mean(dx)
# significance:
se = a/n/np.sqrt(n)
zann = (md-a/n)/se
p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
if p < 0.05:
p_counter += 1
# bootstrap:
rn = 2000
bmd = np.zeros(rn)
for r in range(rn):
#bx = np.random.rand(n)*a # this does not work!!!
bx = np.random.rand(np.random.poisson(n))*a
bdx = np.diff(np.sort(bx))
bmd[r] = np.mean(bdx)
lp, up = np.percentile(bmd, [2.5, 97.5])
if md < lp or md > up:
bp_counter += 1
if j == 1:
plt.hist(bmd, 30, normed=True)
plt.plot([md, md], [0, 0.4*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score) : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))
Bootstrap and z-score work as expected, in 5% of the cases we get a significant difference from a Poisson process.
In a real world application we have to estimate the parameter of the Poisson process from the data. Consequently, the bootstrapped distribution is centered around the mean and both the z-score and the bootstrap will indicate Poisson data in all cases.
In [14]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
# generate "original" Poisson data:
n = 100
a = 2000.0
#x = np.cumsum(np.random.exponential(a/n, n))
x = np.sort(np.random.rand(np.random.poisson(n)))*a
# nearest neighbors statistics:
dx = np.diff(np.sort(x))
md = np.mean(dx)
# expected mean nearest-neighbor distance:
de = (x[-1]-x[0])/len(x)
# significance:
se = de/np.sqrt(n)
zann = (md-de)/se
p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
if p < 0.05:
p_counter += 1
# bootstrap:
nn = len(x)
aa = md*nn
rn = 1000
bmd = np.zeros(rn)
for r in range(rn):
#bx = np.random.rand(nn)*a # this does not work!!!
bx = np.random.rand(np.random.poisson(nn))*aa
bdx = np.diff(np.sort(bx))
bmd[r] = np.mean(bdx)
lp, up = np.percentile(bmd, [2.5, 97.5])
if md < lp or md > up:
bp_counter += 1
if j == 1:
plt.hist(bmd, 30, normed=True)
plt.plot([md, md], [0, 0.4*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score) : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))
Now let's bootstrap some regularly spaced points:
In [15]:
nt = 100
p_counter = 0
bp_counter = 0
for j in range(nt):
# generate "original", regularly spaced data:
n = 100
a = 2000.0
x = np.linspace(0.0, a, n) + (0.02*a/n)*np.random.randn(n)
# nearest neighbors statistics:
dx = np.diff(np.sort(x))
md = np.mean(dx)
# expected mean nearest-neighbor distance:
de = (x[-1]-x[0])/len(x)
# significance:
se = de/np.sqrt(n)
zann = (md-de)/se
p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
if p < 0.05:
p_counter += 1
# bootstrap:
nn = len(x)
aa = md*nn
rn = 1000
bmd = np.zeros(rn)
for r in range(rn):
#bx = np.random.rand(nn)*a # this does not work!!!
bx = np.random.rand(np.random.poisson(nn))*aa
bdx = np.diff(np.sort(bx))
bmd[r] = np.mean(bdx)
lp, up = np.percentile(bmd, [2.5, 97.5])
if md < lp or md > up:
bp_counter += 1
if j == 1:
plt.hist(bmd, 30, normed=True)
plt.plot([md, md], [0, 0.04*np.sqrt(len(dx))/np.std(dx)], 'r', lw=2)
print('Percentage of significant deviations (z-score) : %5.1f%%' % (100.0*float(p_counter)/float(nt)))
print('Percentage of significant deviations (bootstrap): %5.1f%%' % (100.0*float(bp_counter)/float(nt)))
None of the data is significantly different from the Poisson data ... This is not what we want, but we understand why this is the case.
The intensity of the Poisson process sets the mean number of events $N$ per area $A$ $$\lambda=\frac{\langle N \rangle}{A}$$ The unit of $\lambda$ is the inverse of the unit of $A$.
Within some area $A$, the number of points is Poisson distributed, exactly as in the 1-D case: $$P(N) = \frac{(\lambda A)^N}{N!}e^{-\lambda A}$$ with mean $$\langle N \rangle = \lambda A$$ and variance $$\sigma^2_N = \lambda A$$
In [16]:
# generate poisson events in a plane:
n = 10000
a = 40.0
A = a*a
points = np.random.rand(np.random.poisson(n), 2)*a
l = n/A
print('lambda = %g' % l)
The nearest neighbor for each point is the single point that is closest to the point. The corresponding distance $d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$ is the nearest-neighbor distance. In 2-D, nearest neighbor distances $d$ are distributed according to $$p(d)=2\pi\lambda d e^{-\pi\lambda d^2}$$ with mean nearest-neighbor distance $$\langle d \rangle = \frac{1}{2\sqrt{\lambda}} = \frac{1}{2}\sqrt{\frac{A}{\langle N \rangle}}$$ variance $$\sigma_d^2 = \frac{1}{\lambda}\left(\frac{1}{\pi} - \frac{1}{4}\right)$$ standard deviation $$\sigma_d = \frac{1}{\sqrt{\lambda}}\sqrt{\frac{1}{\pi} - \frac{1}{4}} \approx 0.26136 \cdot \frac{1}{\sqrt{\lambda}} = 0.26136 \cdot \sqrt{\frac{A}{\langle N \rangle}} = 0.52272 \cdot \langle d \rangle$$ and coefficient of variation $$CV_d = 2\sqrt{\frac{1}{\pi} - \frac{1}{4}} \approx 0.52272$$
The 2-D nearest-neighbor distribution is a special case of the Weibull distribution $p_w(x)$ with $k=2$ and $\alpha=1/\sqrt{\pi\lambda}$.
It is also a special case of the Rayleigh distribution $$p_r(x) = \frac{x}{\sigma^2}e^{-\frac{x^2}{2\sigma^2}}$$ with $\sigma^2=1/(2\pi\lambda)$.
In [17]:
import scipy.spatial as sp
def nearest_neighbor_distances(points):
distances = sp.distance_matrix(points, points)
np.fill_diagonal(distances, np.nan)
min_dists = np.nanmin(distances, axis=0)
return min_dists
In [18]:
min_dists = nearest_neighbor_distances(points)
# statistics of nearest-neighbor distances:
meand = 0.5/np.sqrt(l)
mean_d = np.mean(min_dists)
std_d = np.std(min_dists)
print('0.5/sqrt(lambda): %.3f' % meand)
print('mean distances : %.3f' % mean_d)
print('std : %.3f' % std_d)
print('CV : %.3f' % (std_d/mean_d))
plt.title('Nearest-neighbor distribution')
plt.hist(min_dists, 30, normed=True)
x = np.linspace(0.0, meand*3, 200)
pw = st.weibull_min.pdf(x, 2, 0.0, 1.0/np.sqrt(np.pi*l))
plt.plot(x, pw, 'r', lw=2)
plt.xlabel('Nearest-neighbor distance')
plt.ylabel('pdf')
Out[18]:
The average of mean nearest-neighbors, computed from all random points generated within an rectangular area, does not converge to the expected mean $(2\sqrt{\lambda})^{-1}$. The reason is, that the nearest-neighbor distance for the points close to the border of the region is larger than for points from the center. This seems to overestimate the mean neares neighbor distance by one standarad deviation.
The standard deviation of this distribution, the standard error of the mean, seems, however, to be right. This result is independent of whether we draw a fixed number of points on the area or a Poisson distributed one.
In [19]:
npoints = 100
l = npoints/A
de = 0.5*np.sqrt(A/npoints)
sem = 0.26136/np.sqrt(l)/np.sqrt(npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
for k in range(r):
# generate points:
#points = np.random.rand(npoints, 2)*a
points = np.random.rand(np.random.poisson(npoints), 2)*a
# compute nearest neighbor distance:
distances = sp.distance_matrix(points, points)
np.fill_diagonal(distances, np.nan)
min_dists = np.nanmin(distances, axis=0)
# statistics:
do = np.mean(min_dists)
md[k] = do
sd[k] = np.std(min_dists)
do = np.mean(md)
print('expected mean : %.2f' % de)
print('observed mean : %.2f' % do)
print('average std : %.2f' % np.mean(sd))
print('sem : %.2f' % (np.mean(sd)/np.sqrt(npoints)))
print('expected sem : %.2f' % sem)
print('std observed mean: %.2f' % np.std(md))
print('bias : %.2f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
Out[19]:
Now, lets take only points from the center of the area:
In [20]:
npoints = 100
lower = a*0.3
upper = a*0.7
l = npoints/A
de = 0.5*np.sqrt(A/npoints)
r = 1000
md = np.zeros(r)
sd = np.zeros(r)
sed = np.zeros(r)
nn = np.zeros(r)
j = 0
for k in range(r):
# generate points:
#points = np.random.rand(npoints, 2)*a
points = np.random.rand(np.random.poisson(npoints), 2)*a
# compute nearest neighbor distance:
distances = sp.distance_matrix(points, points)
np.fill_diagonal(distances, np.nan)
min_dists = np.nanmin(distances, axis=0)
# select points from the center:
min_dists = min_dists[(points[:,0]>lower)&(points[:,0]<upper)&(points[:,1]>lower)&(points[:,1]<upper)]
if len(min_dists) == 0:
continue
# statistics:
do = np.mean(min_dists)
md[j] = do
sd[j] = np.std(min_dists)
nn[j] = len(min_dists)
sed[j] = sd[j]/np.sqrt(len(min_dists))
j += 1
md = md[:j]
sd = sd[:j]
sed = sed[:j]
nn = nn[:j]
do = np.mean(md)
sem = 0.26136/np.sqrt(l)/np.sqrt(np.mean(nn))
print('expected mean : %.2f' % de)
print('observed mean : %.2f' % do)
print('average std : %.2f' % np.mean(sd))
print('average sem : %.2f' % np.mean(sed))
print('expected sem : %.2f' % sem)
print('std observed mean: %.2f' % np.std(md))
print('bias : %.2f' % (do-de))
plt.hist(md, 30, normed=True)
plt.plot([de, de], [0.0, 0.4/sem], 'r', lw=2)
plt.plot([de+sem, de+sem], [0.0, 0.4/sem], 'g', lw=2)
plt.plot([de-sem, de-sem], [0.0, 0.4/sem], 'g', lw=2)
plt.xlabel('mean nearest neighbor distances')
Out[20]:
Eliminating points from the edges indeed removes the bias. The resulting distribution of mean nearest-neighbor distances is, however, not exactly Gaussian and the expected standard error of the mean as well as the s.e.m. computed from the standard deviation of the nearest-neighbor distances by dividing by square root of the number of points slightly underestimate the standard deviation of the distribution of mean nearest-neighbor distances.
As we just have convinced ourselves, the mean nearest-neighbor distance is not just the average of the nearest-neighbor distances (known as a.n.n.) $$\langle d \rangle_{est} \ne \bar d = \frac{1}{N} \sum_i d_i$$ because of the bias introduced by the points close to the edge.
An improved, unbiased estimate can be achieved by reducing the a.n.n. by one standard error of the mean: $$\langle d \rangle_{est} = \bar d - \sigma_{\bar d}$$ where the standard error is $$\sigma_{\bar d} = 0.52272 \cdot \frac{\langle d \rangle}{\sqrt{N}} $$ That is $$\langle d \rangle_{est} = \bar d \cdot\left(1 - \frac{0.52272}{\sqrt{N}} \right)$$
Why this is exactly(?) one s.e.m. I do not know... --- but see below for more on that.
By definition, the intensity of the Poisson process is $$\lambda = \frac{\langle N \rangle}{A}$$
For estimating the intensity from data we need an estimate the mean number of points as well as the area: $$\lambda_{est} = \frac{\langle N \rangle_{est}}{A_{est}}$$
The mean number of points we estimate by the actually observed number of points: $$\langle N_{est} \rangle = N$$
How to estimate the $A$? The area within the hull of the points certainly is too small.
Recall, that the intensity can also be estimated based on the mean nearest-neighbor distance: $$\lambda_{est} = \frac{1}{4\langle d \rangle^2}$$
Both can be used to estimate the area, from which the random points are drawn: $$A_{est} = 4\langle d \rangle^2 N$$
For this we use the corrected estimate for the mean nearest-neighbor distance: $$A_{est} = 4\langle d \rangle_{est}^2 N$$
In the following we generate randomly placed points on a unit square. From these points we subsequently select those from a rectangular area with one side being shorter than one according to aspect_ratios
. That is, the smaller the aspect ratio, the more deviates the rectangle from the square.
In [21]:
for n in [20, 100, 500]:
a = 1.0
A = a*a
l = n/A
de = 0.5/np.sqrt(l)
nr = 500
aspect_ratios = np.arange(1.0, 0.0, -0.1)
sem = 0.52272*de/np.sqrt(n*aspect_ratios)
areas = A*aspect_ratios
perimeters = 2.0*a*(1.0 + aspect_ratios)
alpha = perimeters/areas
ratio = 0.5*(1.0+1.0/aspect_ratios)
ls = np.zeros((nr, len(aspect_ratios)))
lests = np.zeros((nr, len(aspect_ratios)))
anns = np.zeros((nr, len(aspect_ratios)))
for r in range(nr):
#points = np.random.rand(np.random.poisson(n), 2)*a
points = np.random.rand(n, 2)*a
for k, ar in enumerate(aspect_ratios):
rpoints = points[points[:, 1] < ar*a, :]
if len(rpoints) > 0:
ann = np.mean(nearest_neighbor_distances(rpoints))
ls[r, k] = len(rpoints)/(A*ar)
de = len(rpoints)
lests[r, k] = 0.25/de**2
anns[r, k] = ann
else:
ls[r, k] = np.nan
lests[r, k] = np.nan
anns[r, k] = np.nan
plt.figure()
plt.subplot(1, 2, 1)
plt.title('total n = %d' % n)
plt.errorbar(aspect_ratios, np.nanmean(ls, axis=0), yerr=np.nanstd(ls, axis=0)/np.sqrt(nr), fmt='o')
plt.plot([0.0, 1.0], [l, l], '-b')
plt.xlim(0.0, 1.0)
plt.ylim(0.0, 1.5*l)
plt.xlabel('aspect ratio')
plt.ylabel('intensity')
plt.subplot(1, 2, 2)
plt.title('total n = %d' % n)
plt.errorbar(aspect_ratios, np.nanmean(anns, axis=0), yerr=np.nanstd(anns, axis=0)/np.sqrt(nr), fmt='o')
plt.plot([0.0, 1.0], [de, de], '-b')
plt.plot(aspect_ratios, de+sem, '-c')
plt.plot(aspect_ratios, de+sem*ratio**0.5, '-m')
plt.xlim(0.0, 1.0)
plt.ylim(0.9*de, 1.7*de)
plt.xlabel('aspect ratio')
plt.ylabel('a.n.n.')
plt.tight_layout()
First, the observed average nearest-neighbor distance is always larger than expected from the Poisson process. This is because of the border points.
The more the area deviates from a square, the more deviates the average nearest-neighbor distance from the theory.
The expected distance is quite well explained by the theory plus one s.e.m. If, however, the number of points in the area is less than about 15, the observed average nearest-neighbor distance is even heigher.
Multiplying the s.e.m. by $$\sqrt{\frac{1}{2} \left( 1+\frac{1}{a_r} \right)}$$, where $a_r$ is the aspect ratio of the area improves the estimate in particular for small $n$, but is too large for larger number of points.
Let's bootstrap the distribution of mean nearest-neighbor distances and compare the results with the z-score method:
In [22]:
def bootstrap_mean_distance(points, n, a=1.0, plot=False, title=''):
"""
For some given points in 2-D estimate the significance of being different from Poisson data
using z-scores and bootstrap.
n: average number of points
a: length of side of the square
"""
min_dists = nearest_neighbor_distances(points)
md = np.mean(min_dists)
# significance based on true intensity:
se = 0.26136*a/n
zann = (md-0.5*a/np.sqrt(n))/se
p = 2.0*(1.0 - st.norm.cdf(np.abs(zann)))
# estimate Poisson parameter:
nn = len(points)
# corrected mean-neighbor distance:
cmd = md*(1.0 - 0.52272/np.sqrt(nn))
A = 4.0 * cmd**2 * nn
aa = np.sqrt(A)
#aa = np.sqrt(np.prod(np.max(points, axis=0) - np.min(points, axis=0)))
#aa += md
# significance based on oberved intensity:
seo = 0.26136*aa/nn
zanno = (md-0.5*aa/np.sqrt(nn))/seo
po = 2.0*(1.0 - st.norm.cdf(np.abs(zanno)))
# bootstrap:
rn = 500
bmd = np.zeros(rn)
for r in range(rn):
#bx = np.random.rand(nn, 2)*aa
bx = np.random.rand(np.random.poisson(nn), 2)*aa
distances = sp.distance_matrix(bx, bx)
np.fill_diagonal(distances, np.nan)
min_dists = np.nanmin(distances, axis=0)
bmd[r] = np.mean(min_dists)
pb = 0.01*st.percentileofscore(bmd, md)
if pb > 0.5:
pb = 1.0-pb
pb *= 2.0
if plot == 1:
plt.subplot(1, 2, 1)
plt.plot(points[:, 0], points[:, 1], 'o')
plt.xlim(-0.1*aa, 1.1*aa)
plt.ylim(-0.1*aa, 1.1*aa)
plt.xlabel('x')
plt.xlabel('y')
plt.subplot(1, 2, 2)
plt.hist(bmd, 20, normed=True)
plt.plot([md, md], [0, 0.4*np.sqrt(len(min_dists))/np.std(min_dists)], 'r', lw=2)
plt.xlabel('mean nearest-neighbor distances')
plt.ylabel('pdf')
plt.suptitle(title)
return p, po, pb
nt = 100
p_counter = 0
po_counter = 0
bp_counter = 0
for j in range(nt):
# generate "original" Poisson data:
n = 100
a = 20.0
points = np.random.rand(np.random.poisson(n), 2)*a
p, po, pb = bootstrap_mean_distance(points, n, a, j==1, 'Poisson')
if p < 0.05:
p_counter += 1
if po < 0.05:
po_counter += 1
if pb < 0.05:
bp_counter += 1
print('Poisson data: percentage of significant deviations')
print(' true z-score : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print(' observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print(' bootstrap : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
The Z-score based on the true mean number of points and area strongly overestimates significant deviations from randomness.
The Z-score and the bootstrap based on enlarging the area such that the Poisson mean nearest-neighbor distance is the same as the observed one results of course in no deviation from the Poisson case - independent whether the original points are Poisson distributed or not!
In [23]:
def bootstrap_grid(noisesd=0.0):
nt = 100
p_counter = 0
po_counter = 0
bp_counter = 0
for j in range(nt):
# generate "original" regularly spaced data:
n = 100 # must be square number
a = 20.0
dx = a/np.sqrt(n)
points = np.zeros((n, 2))
i = 0
for x in np.linspace(0.0, a, np.sqrt(n)):
for y in np.linspace(0.0, a, np.sqrt(n)):
points[i,:] = np.array([x, y]) + noisesd*dx*np.random.randn(2)
i += 1
p, po, pb = bootstrap_mean_distance(points, n, a, j==1, 'Grid $\sigma$=%g' % noisesd)
if p < 0.05:
p_counter += 1
if po < 0.05:
po_counter += 1
if pb < 0.05:
bp_counter += 1
print('Regular grid with sd noise = %g: percentage of significant deviations' % noisesd)
print(' true z-score : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print(' observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print(' bootstrap : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
print('')
for nsd in [0.0, 0.5, 1.0, 3.0]:
plt.figure()
bootstrap_grid(nsd)
Because we are using the observed mean nearest-neighbor distance to come up with a good estimate of the intensity of the Poisson process we cannot use it for testing whether the data are Poisson!
In [24]:
def bootstrap_cv(points, n, a=1.0, plot=False, title=''):
"""
For some given points in 2-D estimate the significance of being different from Poisson data
using z-scores and bootstrap.
n: average number of points
a: length of side of the square
"""
min_dists = nearest_neighbor_distances(points)
md = np.mean(min_dists)
sd = np.std(min_dists)
cv = sd/md
# significance based on z-score:
se = 2.0*0.26136/np.sqrt(n)
zcv = (cv - 0.52272)/se
p = 2.0*(1.0 - st.norm.cdf(np.abs(zcv)))
# estimate Poisson parameter:
nn = len(points)
# corrected mean-neighbor distance:
cmd = md*(1.0 - 0.52272/np.sqrt(nn))
A = 4.0 * cmd**2 * nn
aa = np.sqrt(A)
#aa = np.sqrt(np.prod(np.max(points, axis=0) - np.min(points, axis=0)))
#aa += md
# significance based on oberved intensity:
cve = 0.52272
seo = 2.0*0.26136/np.sqrt(nn)
zcvo = (cv - 0.52272)/seo
po = 2.0*(1.0 - st.norm.cdf(np.abs(zcvo)))
# bootstrap:
rn = 500
bcv = np.zeros(rn)
for r in range(rn):
#bx = np.random.rand(nn, 2)*aa
bx = np.random.rand(np.random.poisson(nn), 2)*aa
distances = sp.distance_matrix(bx, bx)
np.fill_diagonal(distances, np.nan)
min_dists = np.nanmin(distances, axis=0)
bcv[r] = np.std(min_dists)/np.mean(min_dists)
pb = 0.01*st.percentileofscore(bcv, cv)
if pb > 0.5:
pb = 1.0-pb
pb *= 2.0
if plot == 1:
plt.subplot(1, 2, 1)
plt.plot(points[:, 0], points[:, 1], 'o')
plt.xlim(-0.1*aa, 1.1*aa)
plt.ylim(-0.1*aa, 1.1*aa)
plt.xlabel('x')
plt.xlabel('y')
plt.subplot(1, 2, 2)
plt.hist(bcv, 20, normed=True)
x = np.linspace(cv-4.0*seo, cv+4.0*seo, 200)
y = st.norm.pdf(x, loc=cve, scale=seo)
plt.plot(x, y, 'r', lw=2)
plt.plot([cv, cv], [0, 0.4/seo], 'r', lw=2)
plt.xlabel('CV')
plt.ylabel('pdf')
plt.suptitle(title)
return cv, p, po, pb
nt = 50
p_counter = 0
po_counter = 0
bp_counter = 0
cvs = np.zeros(nt)
for j in range(nt):
# generate "original" Poisson data:
n = 100
a = 20.0
points = np.random.rand(np.random.poisson(n), 2)*a
cv, p, po, pb = bootstrap_cv(points, n, a, j==1, 'Poisson')
cvs[j] = cv
if p < 0.05:
p_counter += 1
if po < 0.05:
po_counter += 1
if pb < 0.05:
bp_counter += 1
print('Poisson data: percentage of significant deviations')
print(' true z-score : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print(' observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print(' bootstrap : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
print(' observed CV : %.3f +- %.3f' % (np.mean(cvs), np.std(cvs)))
The bootstrapped distribution of CVs is biased to larger CVs, probably because of the edges.
In [25]:
def bootstrap_grid_cv(noisesd=0.0):
nt = 50
p_counter = 0
po_counter = 0
bp_counter = 0
cvs = np.zeros(nt)
for j in range(nt):
# generate "original" regularly spaced data:
n = 100 # must be square number
a = 20.0
dx = a/np.sqrt(n)
points = np.zeros((n, 2))
i = 0
for x in np.linspace(0.0, a, np.sqrt(n)):
for y in np.linspace(0.0, a, np.sqrt(n)):
points[i,:] = np.array([x, y]) + noisesd*dx*np.random.randn(2)
i += 1
cv, p, po, pb = bootstrap_cv(points, n, a, j==1, 'Grid $\sigma$=%g' % noisesd)
cvs[j] = cv
if p < 0.05:
p_counter += 1
if po < 0.05:
po_counter += 1
if pb < 0.05:
bp_counter += 1
print('Regular grid with sd noise = %g: percentage of significant deviations' % noisesd)
print(' true z-score : %.1f%%' % (100.0*float(p_counter)/float(nt)))
print(' observed z-score: %.1f%%' % (100.0*float(po_counter)/float(nt)))
print(' bootstrap : %.1f%%' % (100.0*float(bp_counter)/float(nt)))
print(' observed CV : %.3f +- %.3f' % (np.mean(cvs), np.std(cvs)))
print('')
for nsd in [0.0, 0.5, 0.8, 2.0]:
plt.figure()
bootstrap_grid_cv(nsd)