"Spatial Clustering" - the Galaxy Correlation Function

  • The degree to which objects positions are correlated with each other - "clustered" - is of great interest in astronomy.
  • We expect galaxies to appear in groups and clusters, as they fall together under gravity: the statistics of galaxy clustering should contain information about galaxy evolution during hierarchical structure formation.
  • Let's try and measure a clustering signal in our SDSS photometric object catalog.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import SDSS
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import copy

In [3]:
# We want to select galaxies, and then are only interested in their positions on the sky.

data = pd.read_csv("downloads/SDSSobjects.csv",usecols=['ra','dec','u','g',\
                                                'r','i','size'])

# Filter out objects with bad magnitude or size measurements:
data = data[(data['u'] > 0) & (data['g'] > 0) & (data['r'] > 0) & (data['i'] > 0) & (data['size'] > 0)]

# Make size cuts, to exclude stars and nearby galaxies, and magnitude cuts, to get good galaxy detections:
data = data[(data['size'] > 0.8) & (data['size'] < 10.0) & (data['i'] > 17) & (data['i'] < 22)]

# Drop the things we're not so interested in:
del data['u'], data['g'], data['r'], data['i'],data['size']

data.head()


Out[3]:
ra dec
0 185.039442 15.048475
3 185.039157 15.142600
4 185.039157 15.142600
5 185.039358 15.139419
9 185.001771 15.151689

In [4]:
Ngals = len(data)
ramin,ramax = np.min(data['ra']),np.max(data['ra'])
decmin,decmax = np.min(data['dec']),np.max(data['dec'])
print Ngals,"galaxy-like objects in (ra,dec) range (",ramin,":",ramax,",",decmin,":",decmax,")"


987 galaxy-like objects in (ra,dec) range ( 185.000415714 : 185.199844645 , 15.000287913 : 15.1996295363 )

The Correlation Function

  • The 2-point correlation function $\xi(\theta)$ is defined as "the probability of finding two galaxies separated by an angular distance $\theta$ with respect to that expected for a random distribution" (Peebles 1980), and is an excellent summary statistic for quantifying the clustering of galaxies.
  • The simplest possible estimator for this excess probability is just $\hat{\xi}(\theta) = \frac{DD - RR}{RR}$, where $DD(\theta) = N_{\rm pairs}(\theta) / N_D(N_D-1)/2$. Here, $N_D$ is the total number of galaxies in the dataset, and $N_{\rm pairs}(\theta)$ is the number of galaxy pairs with separation lying in a bin centered on $\theta$. $RR(\theta)$ is the same quantity computed in a "random catalog," covering the same field of view but with uniformly randomly distributed positions.
  • Correlations between mock galaxies distributed uniformly randomly over the survey "footprint" helps account for spurious effects in the correlation function that might arise from weird survey area design.
  • We'll use Mike Jarvis' TreeCorr code (Jarvis et al 2004) to compute this correlation function estimator efficiently. You can read more about better estimators starting from the TreeCorr wiki.

In [5]:
# !pip install --upgrade TreeCorr

Random Catalogs

First we'll need a random catalog. Let's make it the same size as the data one.


In [6]:
random = pd.DataFrame({'ra' : ramin + (ramax-ramin)*np.random.rand(Ngals), 'dec' : decmin + (decmax-decmin)*np.random.rand(Ngals)})

In [7]:
print len(random), type(random)


987 <class 'pandas.core.frame.DataFrame'>

Now let's plot both catalogs, and compare.


In [8]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(15, 6)
plt.subplots_adjust(wspace=0.2)
    
random.plot(kind='scatter', x='ra', y='dec', ax=ax[0], title='Random')
ax[0].set_xlabel('RA / deg')
ax[0].set_ylabel('Dec. / deg')

data.plot(kind='scatter', x='ra', y='dec', ax=ax[1], title='Data')
ax[1].set_xlabel('RA / deg')
ax[1].set_ylabel('Dec. / deg')


Out[8]:
<matplotlib.text.Text at 0x102a00a10>

Estimating $\xi(\theta)$


In [9]:
import treecorr

random_cat = treecorr.Catalog(ra=random['ra'], dec=random['dec'], ra_units='deg', dec_units='deg')
data_cat = treecorr.Catalog(ra=data['ra'], dec=data['dec'], ra_units='deg', dec_units='deg')

# Set up some correlation function estimator objects:

sep_units='arcmin'
min_sep=0.5
max_sep=10.0
N = 7
bin_size = np.log10(1.0*max_sep/min_sep)/(1.0*N)

dd = treecorr.NNCorrelation(bin_size=bin_size, min_sep=min_sep, max_sep=max_sep, sep_units=sep_units, bin_slop=0.05/bin_size)
rr = treecorr.NNCorrelation(bin_size=bin_size, min_sep=min_sep, max_sep=max_sep, sep_units=sep_units, bin_slop=0.05/bin_size)

# Process the data:
dd.process(data_cat)
rr.process(random_cat)

# Combine into a correlation function and its variance:
xi, varxi = dd.calculateXi(rr)

In [10]:
plt.figure(figsize=(15,8))
plt.rc('xtick', labelsize=16) 
plt.rc('ytick', labelsize=16)
plt.errorbar(np.exp(dd.logr),xi,np.sqrt(varxi),c='blue',linewidth=2)
# plt.xscale('log')
plt.xlabel('$\\theta / {\\rm arcmin}$',fontsize=20)
plt.ylabel('$\\xi(\\theta)$',fontsize=20)
plt.ylim([-0.1,0.2])
plt.grid(True)


Q: Are galaxies uniformly randomly distributed?

Discuss the clustering signal (or lack thereof) in the above plot with your neighbor. What would you want to do better, in a second pass at this?


In [ ]: