In [1]:
from __future__ import print_function, division
import numpy as np
from matplotlib import pyplot as plt
from astroML.decorators import pickle_results
from astroML.datasets import fetch_sdss_specgals
from astroML.correlation import bootstrap_two_point_angular, \
two_point_angular, \
uniform_sphere
from astroML.plotting import hist
from astroML.density_estimation import knuth_bin_width
%matplotlib inline
Apply the concept of correlation functions to a density function $\rho(\mathbf{x})$ of position $\mathbf{x}$.
In this case, positions are given by a spherical coordinate system $\mathbf{n}=(\theta,\phi)$, and $\rho$ is the number density (per unit solid angle $\Omega$).
For an isotropic distribution (unclustered $\rho$, uniform $\bar\rho=\langle\rho(\mathbf{n})\rangle$), variations in $\rho$ at two fixed points are statistically independent over randomly generated distributions. $$\langle\rho(\mathbf{n}_1)\rho(\mathbf{n}_2)\rangle=\bar\rho^2\quad\text{(unclustered)}$$
Equivalently, there is no covariance (recall from eqn. 3.74)
$$\text{Cov}[\rho(\mathbf{n}_1),\rho(\mathbf{n}_2)]=\langle[\rho(\mathbf{n}_1)-\bar\rho][\rho(\mathbf{n}_2)-\bar\rho]\rangle\equiv\langle\delta\rho(\mathbf{n}_1)\delta\rho(\mathbf{n}_2)\rangle=0\quad\text{(unclustered)}$$or 2-point correlation function (note the difference from the correlation coefficient in eqn. 3.81)
$$w(\mathbf{n}_1,\mathbf{n}_2)=\left\langle\frac{\delta\rho(\mathbf{n}_1)}{\bar\rho}\frac{\delta\rho(\mathbf{n}_2)}{\bar\rho}\right\rangle=0\quad\text{(unclustered)}.$$Apply the Cosmological Principle:
Structure was seeded by randomly generated primordial density fluctuations from quantum fluctuations,
statistically uniform and isotropic.
Structure formation from seeds is deterministic:
A deterministic operator determines the observed galaxy distribution from the primordial seeds.
This same operator transforms the PDF of the seeds to the PDF of the galaxy distribution.
The galaxy point PDF must therefore be statistically uniform and isotropic.
This justifies a uniform $\bar\rho$, and correlation functions can only depend on $\textbf{distances}$ between points (the only available property not affected by translations/rotations). $$w(\mathbf{n}_1,\mathbf{n}_2)=w(\psi_{12})$$
Thus, we can average the density at all pairs of positions the same distance apart over a $\textbf{statistically representative}$ sample.
The result is an estimator of $\textbf{the}$ correlation function of the galaxy point PDF. This is why the galaxy distribution is a powerful probe of cosmology.
Measuring the galaxy correlation function requires estimating the PDF of the measurement:
- how was the survey footprint chosen (is it a statistically representative volume)?
- selection function/completeness of survey?
- angular precision?
- distance estimation (for 3d correlation, or angular correlation in a fixed spherical shell)?
- number of observed objects in volume?
We can measure a correlation function of any distribution this way--even over volumes of extent smaller than the correlation length of the distribution.
Example in text: Consider a field of view approximately the size of the SDSS Great Wall, centered on the Great Wall.
The interpretation is now different:
We lose the interpretation of an ensemble average over a physical PDF that generated the distribution.
The estimate of the correlation function represents an excess of clustering at a given distance scale over a random distribution of points.
The robustness of this clustering measurement (how non-zero is it?) is recommended to be estimated via bootstrap, by seeing how much the correlation function varies over random galaxy subsamples (by resampling with replacement).
A cosmological interpretation of the significance of this clustering must be made in the context of a more globally representative sample.
The information we are measuring is:
What is the amplitude of clustering at a particular length scale?
How significant is the evidence for clustering?
How significant is the evidence for different clustering of different populations in the same field of view?
Neglect for now the following $\textbf{systematic effects}$:
The density field is a sum of (2-D angular) Dirac delta functions. $$\rho(\mathbf{n})=\sum_{i=1}^N\delta(\mathbf{n}-\mathbf{n}_i)$$ has mean density $$\bar\rho=\int\frac{d\mathbf{n}}{4\pi}\rho(\mathbf{n})=\frac{N}{4\pi}.$$
Consider an angular bin $\Delta\psi$ with $\psi_1\leq\psi\leq\psi_2$.
$$w(\Delta\psi)=\int_{\psi_1}^{\psi_2}\frac{d\psi\sin\psi}{\cos\psi_1-\cos\psi_2}\int_0^{2\pi}\frac{d\phi_\psi}{2\pi}\int\frac{d\mathbf{n}}{4\pi}\left[\frac{\rho(\mathbf{n})-\bar\rho}{\bar\rho}\right]\left[\frac{\rho(\mathbf{n}+\mathbf{\Psi})-\bar\rho}{\bar\rho}\right]$$$$\equiv\int_{\psi_1}^{\psi_2}\frac{d\psi\sin\psi}{\cos\psi_1-\cos\psi_2}\int_0^{2\pi}\frac{d\phi_\psi}{2\pi}\bar{w}(\mathbf{\Psi})\quad\quad\quad\quad\quad\quad\quad\quad\ $$We take a "ring average" of $$\bar w(\mathbf{\Psi})=\int\frac{d\mathbf{n}}{4\pi}\frac{\rho(\mathbf{n})}{\bar\rho}\frac{\rho(\mathbf{n}+\mathbf{\Psi})}{\bar\rho}-1-1+1\quad\quad\quad\quad\quad\quad\ \,$$ $$=\frac{4\pi}{N^2}\sum_{i=1}^N\sum_{j=1}^N\int d\mathbf{n}\ \delta(\mathbf{n}-\mathbf{n}_i)\ \delta(\mathbf{n}-\mathbf{n}_j+\mathbf{\Psi})-1$$ $$=\frac{4\pi}{N^2}\sum_{i=1}^N\sum_{j=1}^N\delta(\mathbf{n}_i-\mathbf{n}_j+\mathbf{\Psi})-1\quad\quad\quad\quad\quad\ \ $$ $$=\frac{4\pi}{N^2}\sum_i\sum_{j\neq i}\delta(\mathbf{n}_i-\mathbf{n}_j+\mathbf{\Psi})+\frac{4\pi}{N}\delta(\mathbf{\Psi})-1.\quad\ \,$$
Let $DD(\Delta\psi)$ be the number of pairs of points with distance in our angular bin.
$$DD\equiv\sum_i\sum_{j\neq i}\int_{\psi_1}^{\psi_2}d\psi\sin\psi\int_0^{2\pi}d\phi_\psi\ \delta(\mathbf{n}_i-\mathbf{n}_j+\mathbf{\Psi})$$The ring average then gives
$$w(\Delta\psi)=\frac{DD}{RR}+\frac{2\delta_{\psi_1,0}}{1-\cos\psi_2}\frac{1}{N}-1$$where
$$RR=N^2\frac{\cos\psi_1-\cos\psi_2}{2}=N^2\sin\psi\sin\frac{\Delta\psi}{2}$$is the corresponding number of observed pairs for a random distribution of points.
The 1-point term with $\delta_{\psi_1,0}$ is only relevant for an angular bin including $\psi=0$, that is a disk.
Edge effects and selection effects can be taken into account by simulating a random distribution with those effects.
This gives the expected number of $\Delta\psi$-separated pairs $RR$. The angular correlation function is the relative excess number of pairs $DD$ in the data.
$$w(\Delta\psi)=\frac{DD}{RR}-1$$The astroML implementation does not model position uncertainties, and it assumes a complete catalog (use data with an appropriate magnitude cut) so that selection effects are not present. The field of view is always a rectangle in right ascension and declination coordinates, but other footprint geometries can easily be implemented.
Under the $\textbf{ensemble average}$ interpretation of correlation function under a statistically uniform and isotropic PDF, the Landy-Szalay estimator has the same unbiased mean as the standard estimator, but has better interaction with edge effects, and has a lower variance under cosmic variance of the sky ensemble.
$$w(\Delta\psi)=\frac{DD-2DR+RR}{RR}$$It is worth keeping in mind that when applied to an exact list of point positions, these estimators are not statistical quantities, but are different definitions of a 2-point correlation function. They are exact quantities.
In [2]:
#----------------------------------------------------------------------
# This function adjusts matplotlib settings for a uniform feel in the textbook.
# Note that with usetex=True, fonts are rendered with LaTeX. This may
# result in an error if LaTeX is not installed on your system. In that case,
# you can set usetex to False.
from astroML.plotting import setup_text_plots
setup_text_plots(fontsize=8, usetex=True)
#------------------------------------------------------------
# Get data and do some quality cuts
data = fetch_sdss_specgals()
m_max = 17.7
# redshift and magnitude cuts
data = data[data['z'] > 0.08]
data = data[data['z'] < 0.12]
data = data[data['petroMag_r'] < m_max]
# RA/DEC cuts
RAmin, RAmax = 140, 220
DECmin, DECmax = 5, 45
data = data[data['ra'] < RAmax]
data = data[data['ra'] > RAmin]
data = data[data['dec'] < DECmax]
data = data[data['dec'] > DECmin]
In [3]:
ur = data['modelMag_u'] - data['modelMag_r']
flag_red = (ur > 2.22)
flag_blue = ~flag_red
data_red = data[flag_red]
data_blue = data[flag_blue]
print("data size:")
print(" red gals: ", len(data_red))
print(" blue gals:", len(data_blue))
In [4]:
#------------------------------------------------------------
# Set up correlation function computation
# This calculation takes a long time with the bootstrap resampling,
# so we'll save the results.
@pickle_results("correlation_functions.pkl")
def compute_results(Nbins=16, Nbootstraps=10, method='landy-szalay', rseed=0):
np.random.seed(rseed)
bins = 10 ** np.linspace(np.log10(1 / 60.), np.log10(6), 16)
results = [bins]
for D in [data_red, data_blue]:
results += bootstrap_two_point_angular(D['ra'],
D['dec'],
bins=bins,
method=method,
Nbootstraps=Nbootstraps)
return results
(bins, r_corr, r_corr_err, r_bootstraps,
b_corr, b_corr_err, b_bootstraps) = compute_results()
In [5]:
bin_centers = 0.5 * (bins[1:] + bins[:-1])
#------------------------------------------------------------
# Plot the results
corr = [r_corr, b_corr]
corr_err = [r_corr_err, b_corr_err]
bootstraps = [r_bootstraps, b_bootstraps]
labels = ['$u-r > 2.22$\n$N=%i$' % len(data_red),
'$u-r < 2.22$\n$N=%i$' % len(data_blue)]
fig = plt.figure(figsize=(5, 2.5))
fig.subplots_adjust(bottom=0.2, top=0.9,
left=0.13, right=0.95)
for i in range(2):
ax = fig.add_subplot(121 + i, xscale='log', yscale='log')
ax.errorbar(bin_centers, corr[i], corr_err[i],
fmt='.k', ecolor='gray', lw=1)
t = np.array([0.01, 10])
ax.plot(t, 10 * (t / 0.01) ** -0.8, ':k', linewidth=1)
ax.text(0.95, 0.95, labels[i],
ha='right', va='top', transform=ax.transAxes)
ax.set_xlabel(r'$\theta\ (deg)$')
if i == 0:
ax.set_ylabel(r'$\hat{w}(\theta)$')
plt.show()
In [ ]:
def bootstrap_two_point_angular(ra, dec, bins, method='standard',
Nbootstraps=10, random_state=None):
"""Angular two-point correlation function
A separate function is needed because angular distances are not
euclidean, and random sampling needs to take into account the
spherical volume element.
Parameters
----------
ra : array_like
input right ascention, shape = (n_samples,)
dec : array_like
input declination
bins : array_like
bins within which to compute the 2-point correlation.
shape = Nbins + 1
method : string
"standard" or "landy-szalay".
Nbootstraps : int
number of bootstrap resamples
random_state : integer, np.random.RandomState, or None
specify the random state to use for generating background
Returns
-------
corr : ndarray
the estimate of the correlation function within each bin
shape = Nbins
dcorr : ndarray
error estimate on dcorr (sample standard deviation of
bootstrap resamples)
bootstraps : ndarray
The full sample of bootstraps used to compute corr and dcorr
"""
ra = np.asarray(ra)
dec = np.asarray(dec)
rng = check_random_state(random_state)
if method not in ['standard', 'landy-szalay']:
raise ValueError("method must be 'standard' or 'landy-szalay'")
if bins.ndim != 1:
raise ValueError("bins must be a 1D array")
if (ra.ndim != 1) or (dec.ndim != 1) or (ra.shape != dec.shape):
raise ValueError('ra and dec must be 1-dimensional '
'arrays of the same length')
n_features = len(ra)
Nbins = len(bins) - 1
data = np.asarray(ra_dec_to_xyz(ra, dec), order='F').T
# convert spherical bins to cartesian bins
bins_transform = angular_dist_to_euclidean_dist(bins)
bootstraps = []
In [ ]:
for i in range(Nbootstraps):
# draw a random sample with N points
ra_R, dec_R = uniform_sphere((min(ra), max(ra)),
(min(dec), max(dec)),
2 * len(ra))
data_R = np.asarray(ra_dec_to_xyz(ra_R, dec_R), order='F').T
if i > 0:
# random sample of the data
ind = np.random.randint(0, data.shape[0], data.shape[0])
data_b = data[ind]
else:
data_b = data
bootstraps.append(two_point(data_b, bins_transform, method=method,
data_R=data_R, random_state=rng))
bootstraps = np.asarray(bootstraps)
corr = np.mean(bootstraps, 0)
corr_err = np.std(bootstraps, 0, ddof=1)
return corr, corr_err, bootstraps
In [ ]:
def two_point(data, bins, method='standard',
data_R=None, random_state=None):
"""Two-point correlation function
Parameters
----------
data : array_like
input data, shape = [n_samples, n_features]
bins : array_like
bins within which to compute the 2-point correlation.
shape = Nbins + 1
method : string
"standard" or "landy-szalay".
data_R : array_like (optional)
if specified, use this as the random comparison sample
random_state : integer, np.random.RandomState, or None
specify the random state to use for generating background
Returns
-------
corr : ndarray
the estimate of the correlation function within each bin
shape = Nbins
"""
data = np.asarray(data)
bins = np.asarray(bins)
rng = check_random_state(random_state)
if method not in ['standard', 'landy-szalay']:
raise ValueError("method must be 'standard' or 'landy-szalay'")
if bins.ndim != 1:
raise ValueError("bins must be a 1D array")
if data.ndim == 1:
data = data[:, np.newaxis]
elif data.ndim != 2:
raise ValueError("data should be 1D or 2D")
n_samples, n_features = data.shape
Nbins = len(bins) - 1
# shuffle all but one axis to get background distribution
if data_R is None:
data_R = data.copy()
for i in range(n_features - 1):
rng.shuffle(data_R[:, i])
else:
data_R = np.asarray(data_R)
if (data_R.ndim != 2) or (data_R.shape[-1] != n_features):
raise ValueError('data_R must have same n_features as data')
factor = len(data_R) * 1. / len(data)
In [ ]:
if sklearn_has_two_point:
# Fast two-point correlation functions added in scikit-learn v. 0.14
KDT_D = KDTree(data)
KDT_R = KDTree(data_R)
counts_DD = KDT_D.two_point_correlation(data, bins)
counts_RR = KDT_R.two_point_correlation(data_R, bins)
else:
warnings.warn("Version 0.3 of astroML will require scikit-learn "
"version 0.14 or higher for correlation function "
"calculations. Upgrade to sklearn 0.14+ now for much "
"faster correlation function calculations.")
BT_D = BallTree(data)
BT_R = BallTree(data_R)
counts_DD = np.zeros(Nbins + 1)
counts_RR = np.zeros(Nbins + 1)
for i in range(Nbins + 1):
counts_DD[i] = np.sum(BT_D.query_radius(data, bins[i],
count_only=True))
counts_RR[i] = np.sum(BT_R.query_radius(data_R, bins[i],
count_only=True))
DD = np.diff(counts_DD)
RR = np.diff(counts_RR)
# check for zero in the denominator
RR_zero = (RR == 0)
RR[RR_zero] = 1
if method == 'standard':
corr = factor ** 2 * DD / RR - 1
elif method == 'landy-szalay':
if sklearn_has_two_point:
counts_DR = KDT_R.two_point_correlation(data, bins)
else:
counts_DR = np.zeros(Nbins + 1)
for i in range(Nbins + 1):
counts_DR[i] = np.sum(BT_R.query_radius(data, bins[i],
count_only=True))
DR = np.diff(counts_DR)
corr = (factor ** 2 * DD - 2 * factor * DR + RR) / RR
corr[RR_zero] = np.nan
return corr
In [6]:
@pickle_results("correlation_functions2.pkl")
def compute_results2(Nbins=16, Nbootstraps=10, method='landy-szalay', rseed=0):
np.random.seed(rseed)
bins = 10 ** np.linspace(np.log10(1 / 60.), np.log10(6), 16)
results = [bins]
for D in [data_red, data_blue]:
results += bootstrap_two_point_angular(D['ra'],
D['dec'],
bins=bins,
method=method,
Nbootstraps=Nbootstraps)
return results
(bins, r_corr, r_corr_err, r_bootstraps,
b_corr, b_corr_err, b_bootstraps) = compute_results2(method='standard')
#------------------------------------------------------------
# Plot the results
corr = [r_corr, b_corr]
corr_err = [r_corr_err, b_corr_err]
bootstraps = [r_bootstraps, b_bootstraps]
labels = ['$u-r > 2.22$\n$N=%i$' % len(data_red),
'$u-r < 2.22$\n$N=%i$' % len(data_blue)]
fig = plt.figure(figsize=(5, 2.5))
fig.subplots_adjust(bottom=0.2, top=0.9,
left=0.13, right=0.95)
for i in range(2):
ax = fig.add_subplot(121 + i, xscale='log', yscale='log')
ax.errorbar(bin_centers, corr[i], corr_err[i],
fmt='.k', ecolor='gray', lw=1)
t = np.array([0.01, 10])
ax.plot(t, 10 * (t / 0.01) ** -0.8, ':k', linewidth=1)
ax.text(0.95, 0.95, labels[i],
ha='right', va='top', transform=ax.transAxes)
ax.set_xlabel(r'$\theta\ (deg)$')
if i == 0:
ax.set_ylabel(r'$\hat{w}(\theta)$')
plt.show()
Interpretation of the correlation is easy: excess clustering at some distance scale, with respect to random point distributions.
Bootstrap is useful to apply to a large galaxy survey that is statistically representative of the Universe, to estimate the uncertainty of measurement of the galaxy correlation function.
Assuming the Great Wall catalog is complete to the given magnitude limit, the uncertainty of the correlation function is probably dominated by the uncertainty of the photometric positions.
Given $N$ galaxies in a field of view of given geometry, observed with known selection bias, the joint probability distribution of correlation functions of random uniform point distributions for any set of distance scales can be determined precisely (either with Monte-Carlo, or semi-analytically as described for example in the Landy and Szalay paper). Thus, the significance of an observation of clustering can be precisely quantified.
Is there any way to quantify how significant is a difference in correlation function between two populations in a non-statistically-representative sample? Not until we identify a self-consistent statistical ensemble for which the chosen field of view is a random example.
For example, the ensemble of all possible samples in the Universe with the same geometric volume dimensions of the Great Wall sample and have the same number of red galaxies and the same number of blue galaxies. Easy to define, but difficult to simulate random samples of such volumes in N-body simulations, though possible with constrained simulations.
In [7]:
#----------------------------------------------------
# What is the statistical significance of clustering in the highest
# angular distance bin?
highbin = bins[-2:]
npts = len(data_blue)
# Generate nsamples number of uniform distributions and calculate their
# correlation functions.
nsamples = 100
ra, dec = [], []
wrand = []
for i in range(nsamples):
ra, dec = uniform_sphere((RAmin, RAmax), (DECmin, DECmax), size=npts)
wrand.append(two_point_angular(ra, dec, highbin, method='standard',
random_state=i))
wrand = np.asarray(wrand)
# The mean and standard deviation of the sample of w for uniform dists.
wrandmean = np.mean(wrand, 0)
wrandsig = np.std(wrand, 0, ddof=1)
print("mean uniform w: ", wrandmean)
print("standard deviation:", wrandsig)
fig, ax = plt.subplots(figsize=(10, 10))
hist(wrand[:,0], bins='knuth', ax=ax,
histtype='stepfilled', ec='k', fc='#AAAAAA')
# Get the bin widths to properly normalize the fit PDF.
dx, bins = knuth_bin_width(wrand[:,0], True, disp=False)
x = np.linspace(-3*wrandsig, 3*wrandsig, 10000)
gauss = nsamples * dx * np.exp(-0.5 * (x / wrandsig)**2) / \
np.sqrt(2 * np.pi) / wrandsig
ax.plot(x, gauss, '-', c='black', lw=3)
ax.set_xlabel('$w$')
ax.set_ylabel('$N(w)$')
ax.set_xlim(-3*wrandsig, 3*wrandsig)
ax.set_ylim(0, 0.5*nsamples*dx/wrandsig)
plt.show()
If $w_\text{rand}(\Delta\psi)$ is approximately a Gaussian random variable with $\bar w_\text{rand}=0$ and $\sigma_w\simeq0.004$, then the 2-point clustering in $\Delta\psi$ of $w=0.013$ is non-uniform to about $3\sigma$ significance.
In terms of answering the question of how significantly non-zero is a measurement of the correlation function, I would argue this to be a more meaningful error bar than the smaller bootstrap uncertainty of 0.003.
Method |
Accuracy |
Interpretability |
Simplicity |
Speed |
K-nearest-neighbor |
H |
H |
H |
M |
Kernel density estimation |
H |
H |
H |
H |
Gaussian mixture models |
H |
M |
M |
M |
Extreme Deconvolution |
H |
H |
M |
M |
K-means |
L |
M |
H |
M |
Max-radius
minimization |
L |
M |
M |
M |
Mean shift |
M |
H |
H |
M |
Hierarchical
clustering |
H |
L |
L |
L |
Correlation functions |
H |
M |
M |
M |
Evaluate the algorithms using 4 criteria.
H = High
M = Medium
L = Low
In [ ]: