Testing look-elsewhere effect by creating 2d chi-square random fields with a Gaussian Process

by Kyle Cranmer, Dec 7, 2015

The correction for 2d look-elsewhere effect presented in Estimating the significance of a signal in a multi-dimensional search by Ofer Vitells and Eilam Gross http://arxiv.org/pdf/1105.4355v1.pdf

is based on the fact that the test statistic

\begin{equation} q(\nu_1, \nu_2) = -2 \log \frac{ \max_{\theta} L(\mu=0, \nu_1, \nu_2, \theta)}{ \max_{\mu, \theta} L(\mu, \nu_1, \nu_2, \theta)} \end{equation}

is a chi-square random field (with 1 degree of freedom). That means that, for any point in $\nu_1, \nu_2$, the quantity $q(\nu_1, \nu_2)$ would have a chi-square distribution if you repeated the experiment many times.

That is what you expect if you have a background model $p_b(x|\theta)$ and you look for a signal on top of it with signal strength $\mu$. Creating that scan is somewhat time consuming, so here we make realizations of a chi-square random field by using a Gaussian Process. The main trick we will use is that a chi-square distribution for one degree of freedom is the same as the distribution of $x^2$ if $x$ is normally distributed. As you might have guessed, a Gaussian Process (GP) is like a chi-square random field, but it is Gaussian-distributed at each point.

Note, the distributions are not independent at each point, there is some covaraince. So if the $q(\nu_1, \nu_2)$ is high at one point, you can expect it to be high near by. We can control this behavior via the GP's kernel.

For more on the theory of Gaussian Processes, the best resource is available for free online: Rasmussen & Williams (2006). We will george -- a nice python package for Gaussian Processes (GP).


In [1]:
%pylab inline --no-import-all


Populating the interactive namespace from numpy and matplotlib
//anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

The main trick we will use is that a chi-square distribution for one degree of freedom is the same as the distribution of $x^2$ if $x$ is normally distributed. Here's a quick demonstration of that:


In [2]:
from scipy.stats import chi2, norm
chi2_array = chi2.rvs(1, size=10000)
norm_array = norm.rvs(size=10000)
_ = plt.hist(chi2_array, bins=100, alpha=.5, label='chi-square')
_ = plt.hist(norm_array**2, bins=100, alpha=.5, color='r', label='x^2')
plt.yscale('log', nonposy='clip')
plt.legend(('chi-square', 'x^2'))
#plt.semilogy()


Out[2]:
<matplotlib.legend.Legend at 0x10d98d5d0>

Ok, now to the Gaussian processes.


In [3]:
import george
from george.kernels import ExpSquaredKernel

In [4]:
length_scale_of_correaltion=0.1
kernel = ExpSquaredKernel(length_scale_of_correaltion, ndim=2)

In [5]:
# Create the Gaussian process
# gp = george.GP(kernel)
gp = george.GP(kernel, solver=george.HODLRSolver) #faster

In [6]:
n_scan_points=50
aspect_ratio =  10. # make excesses look like stripes
x_scan = np.arange(0,aspect_ratio,aspect_ratio/n_scan_points)
y_scan = np.arange(0,1,1./n_scan_points)
xx, yy = np.meshgrid(x_scan, y_scan)

In [7]:
# reformat the independent coordinates where we evaluate the GP
indep = np.vstack((np.hstack(xx),np.hstack(yy))).T

In [8]:
# illustration of what is being done here
np.vstack([[1,2],[3,4]]).T


Out[8]:
array([[1, 3],
       [2, 4]])

In [9]:
# slow part: pre-compute internal stuff for the GP
gp.compute(indep)

In [10]:
# evaluate one realization of the GP
z = gp.sample(indep)

In [11]:
# reformat output for plotting
zz = z.reshape((n_scan_points,n_scan_points))

In [12]:
# plot the chi-square random field
plt.imshow(zz**2, cmap='gray')
plt.colorbar()


Out[12]:
<matplotlib.colorbar.Colorbar at 0x1122b8d10>

Now lets histogram the values of the random field. Don't get confused here... if you pick a single point and histogram the value of over many instances, you expect a Gaussian. However, for a single instance, you don't expect the histogram for the value of the field to be Gaussian (because of the correlations). Thought experiments: if you make length_scale_of_correaltion very small, then each point is essentially independent and you do expect to see a Gaussian; however, if length_scale_of_correaltion is very large then you expect the field to be nearly constant and the histogram below would be a delta function.


In [13]:
# plot the gaussian distributed x and chi-square distributed x**2
plt.subplot(1,2,1)
count, edges, patches = plt.hist(np.hstack(zz), bins=100)
plt.xlabel('z')
plt.subplot(1,2,2)
count, edges, patches = plt.hist(np.hstack(zz)**2, bins=100)
plt.xlabel('q=z**2')
plt.yscale('log', nonposy='clip')


Ok, now let's repeat that several times and test lee2d


In [14]:
from lee2d import *

In [15]:
from scipy.ndimage import grey_closing, binary_closing

def fill_holes(array):
    zero_array = array==0.
    temp = grey_closing(array, size=2)*zero_array
    return temp+array

Generate 25 realizations of the GP, calculate the Euler characteristic for two thresholds, and use the mean of those Euler characteristics to estimate $N_1$ and $N_2$


In [21]:
n_samples = 100
z_array = gp.sample(indep,n_samples)

q_max = np.zeros(n_samples)
phis = np.zeros((n_samples,2))

u1,u2 = 0.5, 1.

n_plots = 3
plt.figure(figsize=(9,n_plots*3))
for scan_no, z in enumerate(z_array):

    scan = z.reshape((n_scan_points,n_scan_points))**2
    
    q_max[scan_no] = np.max(scan)
    
    # fill holes from failures in original likelihood
    scan = fill_holes(scan)

    #get excursion sets above those two levels
    exc1 = (scan>u1) + 0. #add 0. to convert from bool to double
    exc2 = (scan>u2) + 0.
    #print '\nu1,u2 = ', u1, u2
    
    #print 'diff = ', np.sum(exc1), np.sum(exc2)
    
    if scan_no < n_plots:
        aspect = 1.
        plt.subplot(n_plots,3,3*scan_no+1)
        aspect = 1.*scan.shape[0]/scan.shape[1]
        plt.imshow(scan.T, cmap='gray', aspect=aspect)
        plt.subplot(n_plots,3,3*scan_no+2)
        plt.imshow(exc1.T, cmap='gray', aspect=aspect, interpolation='none')
        plt.subplot(n_plots,3,3*scan_no+3)
        plt.imshow(exc2.T, cmap='gray', aspect=aspect, interpolation='none')

    phi1 = calculate_euler_characteristic(exc1)
    phi2 = calculate_euler_characteristic(exc2)
    #print 'phi1, phi2 = ', phi1, phi2
    #print 'q_max = ', np.max(scan)

    phis[scan_no] = [phi1, phi2]

plt.savefig('chi-square-random-fields.png')



In [71]:
exp_phi_1, exp_phi_2 = np.mean(phis[:,0]), np.mean(phis[:,1])
exp_phi_1, exp_phi_2


Out[71]:
(14.359999999999999, 13.76)

In [72]:
n1, n2 = get_coefficients(u1=u1, u2=u2, exp_phi_1=exp_phi_1, exp_phi_2=exp_phi_2)
print n1, n2


7.34442273108 14.81882537

With estimates of $N_1$ and $N_2$ predict the global p-value vs. u


In [73]:
u = np.linspace(5,25,100)
global_p = global_pvalue(u,n1,n2)

Generate 5000 instances of the Gaussian Process, find maximum local significance for each, and check the prediction for the LEE-corrected global p-value


In [74]:
n_samples = 5000
z_array = gp.sample(indep,n_samples)

q_max = np.zeros(n_samples)

for scan_no, z in enumerate(z_array):
    scan = z.reshape((n_scan_points,n_scan_points))**2
    q_max[scan_no] = np.max(scan)

In [75]:
bins, edges, patches = plt.hist(q_max, bins=30)
icdf = 1.-np.cumsum(bins/n_samples)
icdf = np.hstack((1.,icdf))
icdf_error = np.sqrt(np.cumsum(bins))/n_samples
icdf_error = np.hstack((0.,icdf_error))
plt.xlabel('q_max')
plt.ylabel('counts / bin')


Out[75]:
<matplotlib.text.Text at 0x11c840590>

In [76]:
# plot the p-value 
plt.subplot(121)
plt.plot(edges,icdf, c='r')
plt.errorbar(edges,icdf,yerr=icdf_error)
plt.plot(u, global_p)
plt.xlabel('u')
plt.ylabel('P(q_max >u)')
plt.xlim(0,25)
plt.subplot(122)
plt.plot(edges,icdf, c='r', label='toys')
plt.errorbar(edges,icdf,yerr=icdf_error)
plt.plot(u, global_p, label='prediction')
plt.xlabel('u')
plt.legend(('toys', 'prediction'))
#plt.ylabel('P(q>u)')
plt.ylim(1E-3,10)
plt.xlim(0,25)
plt.semilogy()


Out[76]:
[]

Study statistical uncertainty

Outline:

  1. generate n_samples likelihood scans using the GP
  2. make exclusion sets, calculate phi1, phi2 for levels u1, u2
  3. look at histogram of phi1, phi2 (notice that they are narrower than Poisson)
  4. look at 2-d scatter of phi1, phi2 (notice that they are positively correlated)
  5. look at 2-d scatter of coefficients n1, n2 (notice tha they are negatively correlated)
  6. Compare three ways of propagating error to global p-value
    1. Poisson, no correlations: estimate uncertainty on Exp[phi1] as sqrt(exp_phi_1)/sqrt(n_samples)
    2. Gaus approx of observed, no correlations: estimate uncertainty on Exp[phi1] as std(exp_phi_1)/sqrt(n_samples)
    3. Gaus approx of observed, with correlations: estimate covariance of (Exp[phi1], Exp[phi2]) with cov(phi1, phi2)/n_samples -- note since it's covariance we divide by n_samples not sqrt(n_samples)

Conclusions:

The number of islands (as quantified by the Euler characteristic) is not Poisson distributed. Deviation from the Poisson distribution will depend on the properties of the underlying 2-d fit (equivalently, the Gaussian Process kernel). In this example, the deviation isn't that big. It is probably generic that the uncertainty in phi is smaller than Poisson because one can only fit in so many islands into the scan... so it's probably more like a Binomial.

Unsurpringly there is also a positive correlation between the number of islands at levels u1 and u2. This turns into an anti-correlation on the coefficients n1 and n2.

The two effects lead to the Poisson approximation over estimating the uncertainty on the global p-value.


In [142]:
from scipy.stats import poisson

In [143]:
n_samples = 1000
z_array = gp.sample(indep,n_samples)
phis = np.zeros((n_samples,2))

for scan_no, z in enumerate(z_array):
    scan = z.reshape((n_scan_points,n_scan_points))**2

    #get excursion sets above those two levels
    exc1 = (scan>u1) + 0. #add 0. to convert from bool to double
    exc2 = (scan>u2) + 0.

    phi1 = calculate_euler_characteristic(exc1)
    phi2 = calculate_euler_characteristic(exc2)
    phis[scan_no] = [phi1, phi2]

In [147]:
bins = np.arange(0,25)
counts, bins, patches = plt.hist(phis[:,0], bins=bins, normed=True, alpha=.3, color='b')
_ = plt.hist(phis[:,1], bins=bins, normed=True,alpha=.3, color='r')
plt.plot(bins,poisson.pmf(bins,np.mean(phis[:,0])), c='b')
plt.plot(bins,poisson.pmf(bins,np.mean(phis[:,1])), c='r')
plt.xlabel('phi_i')
plt.legend(('obs phi1', 'obs phi2', 'poisson(mean(phi1)', 'poisson(mean(phi2))'), loc='upper left')


Out[147]:
<matplotlib.legend.Legend at 0x1396e7510>

In [149]:
print 'Check Poisson phi1', np.mean(phis[:,0]), np.std(phis[:,0]), np.sqrt(np.mean(phis[:,0]))
print 'Check Poisson phi1', np.mean(phis[:,1]), np.std(phis[:,1]), np.sqrt(np.mean(phis[:,1]))

print 'correlation coefficients:'
print np.corrcoef(phis[:,0], phis[:,1])
print 'covariance:'
print np.cov(phis[:,0], phis[:,1])


Check Poisson phi1 14.628 2.87256261899 3.82465684735
Check Poisson phi1 13.88 2.38528824254 3.72558720204
correlation coefficients:
[[ 1.          0.43701228]
 [ 0.43701228  1.        ]]
covariance:
[[ 8.25987588  2.99735736]
 [ 2.99735736  5.6952953 ]]

In [150]:
x, y = np.random.multivariate_normal([np.mean(phis[:,0]),np.mean(phis[:,0])], np.cov(phis[:,0], phis[:,1]), 5000).T
_ = plt.scatter(phis[:,0], phis[:,1], alpha=0.1)
plt.plot(x, y, 'x', alpha=0.1)
plt.axis('equal')
plt.xlabel('phi_0')
plt.ylabel('phi_1')


Out[150]:
<matplotlib.text.Text at 0x139991a90>

In [151]:
toy_n1, toy_n2 = np.zeros(x.size),np.zeros(x.size)
for i, (toy_exp_phi_1, toy_exp_phi_2) in enumerate(zip(x,y)):
    n1, n2 = get_coefficients(u1=u1, u2=u2, exp_phi_1=toy_exp_phi_1, exp_phi_2=toy_exp_phi_2)
    toy_n1[i] = n1
    toy_n2[i] = n2
plt.scatter(toy_n1, toy_n2, alpha=.1)
plt.xlabel('n1')
plt.ylabel('n2')


Out[151]:
<matplotlib.text.Text at 0x13970dd10>

In [152]:
# now propagate error exp_phi_1 and exp_phi_2 (by dividing cov matrix by n_samples) including correlations
x, y = np.random.multivariate_normal([np.mean(phis[:,0]),np.mean(phis[:,1])], 
                                     np.cov(phis[:,0], phis[:,1])/n_samples,
                                     5000).T

'''
# check consistency with next cell by using diagonal covariance
dummy_cov = np.cov(phis[:,0], phis[:,1])/n_samples
dummy_cov[0,1]=0
dummy_cov[1,0]=0
print dummy_cov
x, y = np.random.multivariate_normal([np.mean(phis[:,0]),np.mean(phis[:,1])], 
                                     dummy_cov,
                                     5000).T
'''

toy_global_p = np.zeros(x.size)
for i, (toy_exp_phi_1, toy_exp_phi_2) in enumerate(zip(x,y)):
    n1, n2 = get_coefficients(u1=u1, u2=u2, exp_phi_1=toy_exp_phi_1, exp_phi_2=toy_exp_phi_2)
    u = 16
    #global_p = global_pvalue(u,n1,n2)
    toy_global_p[i] = global_pvalue(u,n1,n2)

In [153]:
# now propagate error assuming uncorrelated but observed std. on phi_1 and phi_2 / sqrt(n_samples)
x = np.random.normal(np.mean(phis[:,0]), np.std(phis[:,0])/np.sqrt(n_samples), 5000)
y = np.random.normal(np.mean(phis[:,1]), np.std(phis[:,1])/np.sqrt(n_samples), 5000)

toy_global_p_uncor = np.zeros(x.size)
for i, (toy_exp_phi_1, toy_exp_phi_2) in enumerate(zip(x,y)):
    n1, n2 = get_coefficients(u1=u1, u2=u2, exp_phi_1=toy_exp_phi_1, exp_phi_2=toy_exp_phi_2)
    u = 16
    #global_p = global_pvalue(u,n1,n2)
    toy_global_p_uncor[i] = global_pvalue(u,n1,n2)

In [154]:
# now propagate error assuming uncorrelated Poisson stats on phi_1 and phi_2
x = np.random.normal(np.mean(phis[:,0]), np.sqrt(np.mean(phis[:,0]))/np.sqrt(n_samples), 5000)
y = np.random.normal(np.mean(phis[:,1]), np.sqrt(np.mean(phis[:,1]))/np.sqrt(n_samples), 5000)

toy_global_p_uncor_pois = np.zeros(x.size)
for i, (toy_exp_phi_1, toy_exp_phi_2) in enumerate(zip(x,y)):
    n1, n2 = get_coefficients(u1=u1, u2=u2, exp_phi_1=toy_exp_phi_1, exp_phi_2=toy_exp_phi_2)
    u = 16
    #global_p = global_pvalue(u,n1,n2)
    toy_global_p_uncor_pois[i] = global_pvalue(u,n1,n2)

In [156]:
counts, bins, patches = plt.hist(toy_global_p_uncor_pois, bins=50, normed=True, color='g', alpha=.3)
counts, bins, patches = plt.hist(toy_global_p_uncor, bins=bins, normed=True, color='r', alpha=.3)
counts, bins, patches = plt.hist(toy_global_p, bins=bins, normed=True, color='b', alpha=.3)

plt.xlabel('global p-value')
#plt.ylim(0,1.4*np.max(counts))
plt.legend(('uncorrelated Poisson approx from mean',
            'uncorrelated Gaus. approx of observed dist',
            'correlated Gaus. approx of observed dist'),
           bbox_to_anchor=(1., 1.3))


Out[156]:
<matplotlib.legend.Legend at 0x13aa76650>

Conclusion: The two effects lead to the Poisson approximation over estimating the uncertainty on the global p-value.