"Spatial Clustering" - the Galaxy Correlation Function

  • The degree to which objects positions are correlated with each other - "clustered" - is of great interest in astronomy.
  • We expect galaxies to appear in groups and clusters, as they fall together under gravity: the statistics of galaxy clustering should contain information about galaxy evolution during hierarchical structure formation.
  • Let's try and measure a clustering signal in our SDSS photometric object catalog.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import numpy as np
import SDSS
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import copy

In [3]:
# We want to select galaxies, and then are only interested in their positions on the sky.

data = pd.read_csv("downloads/SDSSobjects.csv",usecols=['ra','dec','u','g',\
                                                'r','i','size'])

# Filter out objects with bad magnitude or size measurements:
data = data[(data['u'] > 0) & (data['g'] > 0) & (data['r'] > 0) & (data['i'] > 0) & (data['size'] > 0)]

# Make size cuts, to exclude stars and nearby galaxies, and magnitude cuts, to get good galaxy detections:
data = data[(data['size'] > 0.8) & (data['size'] < 10.0) & (data['i'] > 17) & (data['i'] < 22)]

# Drop the things we're not so interested in:
del data['u'], data['g'], data['r'], data['i'],data['size']

data.head()


---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-3-3ac8e020f7e8> in <module>()
      1 # We want to select galaxies, and then are only interested in their positions on the sky.
      2 
----> 3 data = pd.read_csv("downloads/SDSSobjects.csv",usecols=['ra','dec','u','g',                                                'r','i','size'])
      4 
      5 # Filter out objects with bad magnitude or size measurements:

/usr/lib/python2.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    472                     skip_blank_lines=skip_blank_lines)
    473 
--> 474         return _read(filepath_or_buffer, kwds)
    475 
    476     parser_f.__name__ = name

/usr/lib/python2.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    248 
    249     # Create the parser.
--> 250     parser = TextFileReader(filepath_or_buffer, **kwds)
    251 
    252     if (nrows is not None) and (chunksize is not None):

/usr/lib/python2.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    564             self.options['has_index_names'] = kwds['has_index_names']
    565 
--> 566         self._make_engine(self.engine)
    567 
    568     def _get_options_with_defaults(self, engine):

/usr/lib/python2.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
    703     def _make_engine(self, engine='c'):
    704         if engine == 'c':
--> 705             self._engine = CParserWrapper(self.f, **self.options)
    706         else:
    707             if engine == 'python':

/usr/lib/python2.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1070         kwds['allow_leading_cols'] = self.index_col is not False
   1071 
-> 1072         self._reader = _parser.TextReader(src, **kwds)
   1073 
   1074         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3173)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:5912)()

IOError: File downloads/SDSSobjects.csv does not exist

In [4]:
Ngals = len(data)
ramin,ramax = np.min(data['ra']),np.max(data['ra'])
decmin,decmax = np.min(data['dec']),np.max(data['dec'])
print Ngals,"galaxy-like objects in (ra,dec) range (",ramin,":",ramax,",",decmin,":",decmax,")"


987 galaxy-like objects in (ra,dec) range ( 185.000415714 : 185.199844645 , 15.000287913 : 15.1996295363 )

The Correlation Function

  • The 2-point correlation function $\xi(\theta)$ is defined as "the probability of finding two galaxies separated by an angular distance $\theta$ with respect to that expected for a random distribution" (Peebles 1980), and is an excellent summary statistic for quantifying the clustering of galaxies.
  • The simplest possible estimator for this excess probability is just $\hat{\xi}(\theta) = \frac{DD - RR}{RR}$, where $DD(\theta) = N_{\rm pairs}(\theta) / N_D(N_D-1)/2$. Here, $N_D$ is the total number of galaxies in the dataset, and $N_{\rm pairs}(\theta)$ is the number of galaxy pairs with separation lying in a bin centered on $\theta$. $RR(\theta)$ is the same quantity computed in a "random catalog," covering the same field of view but with uniformly randomly distributed positions.
  • We'll use Mike Jarvis' TreeCorr code (Jarvis et al 2004) to compute this correlation function estimator efficiently. You can read more about better estimators starting from the TreeCorr wiki.

In [5]:
# !pip install --upgrade TreeCorr

Random Catalogs

First we'll need a random catalog. Let's make it the same size as the data one.


In [6]:
random = pd.DataFrame({'ra' : ramin + (ramax-ramin)*np.random.rand(Ngals), 'dec' : decmin + (decmax-decmin)*np.random.rand(Ngals)})

In [7]:
print len(random), type(random)


987 <class 'pandas.core.frame.DataFrame'>

Now let's plot both catalogs, and compare.


In [8]:
fig, ax = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches(15, 6)
plt.subplots_adjust(wspace=0.2)
    
random.plot(kind='scatter', x='ra', y='dec', ax=ax[0], title='Random')
ax[0].set_xlabel('RA / deg')
ax[0].set_ylabel('Dec. / deg')

data.plot(kind='scatter', x='ra', y='dec', ax=ax[1], title='Data')
ax[1].set_xlabel('RA / deg')
ax[1].set_ylabel('Dec. / deg')


Out[8]:
<matplotlib.text.Text at 0x102a00a10>

Estimating $\xi(\theta)$


In [9]:
import treecorr

random_cat = treecorr.Catalog(ra=random['ra'], dec=random['dec'], ra_units='deg', dec_units='deg')
data_cat = treecorr.Catalog(ra=data['ra'], dec=data['dec'], ra_units='deg', dec_units='deg')

# Set up some correlation function estimator objects:

sep_units='arcmin'
min_sep=0.5
max_sep=10.0
N = 7
bin_size = np.log10(1.0*max_sep/min_sep)/(1.0*N)

dd = treecorr.NNCorrelation(bin_size=bin_size, min_sep=min_sep, max_sep=max_sep, sep_units=sep_units, bin_slop=0.05/bin_size)
rr = treecorr.NNCorrelation(bin_size=bin_size, min_sep=min_sep, max_sep=max_sep, sep_units=sep_units, bin_slop=0.05/bin_size)

# Process the data:
dd.process(data_cat)
rr.process(random_cat)

# Combine into a correlation function and its variance:
xi, varxi = dd.calculateXi(rr)

In [10]:
plt.figure(figsize=(15,8))
plt.rc('xtick', labelsize=16) 
plt.rc('ytick', labelsize=16)
plt.errorbar(np.exp(dd.logr),xi,np.sqrt(varxi),c='blue',linewidth=2)
# plt.xscale('log')
plt.xlabel('$\\theta / {\\rm arcmin}$',fontsize=20)
plt.ylabel('$\\xi(\\theta)$',fontsize=20)
plt.ylim([-0.1,0.2])
plt.grid(True)


Q: Are galaxies uniformly randomly distributed?

Discuss the clustering signal (or lack thereof) in the above plot with your neighbor. What would you want to do better, in a second pass at this?


In [ ]: