Pipeline Tutorial

This notebook serves as a tutorial for the ugali pipeline analysis. Its purpose is to document the steps taken to perform an end-to-end likelihood analysis on a raw data set. The ugali pipeline is comprised of the following steps:

  • run_01.0_download_data.py - This step is used for downloading data from the database
  • run_02.0_preprocess.py - This step preprocesses the input data into the structure ugali expects
  • run_03.0_likelihood.py - This step runs the likelihood grid search over the data set
  • run_04.0_peak_finder.py - Takes the output likelihood grid and finds peaks in 3D
  • run_05.0_followup.py - This step runs mcmc parameter fitting
  • run_06.0_simulate.py - Runs and analyzes simulations

Step 0: Setup

Our first step is to do some generic notebook imports, create a working directory, and install some test data from github.


In [ ]:
%matplotlib inline
import os

import numpy as np
import healpy as hp
import fitsio

In [ ]:
# Create the working directory and chdir into it
!mkdir -p work
os.chdir('./work')

In [ ]:
# Download the test data
!wget https://github.com/DarkEnergySurvey/ugali/releases/download/v1.7.0/ugali-test-data.tar.gz
!tar -xzf ugali-test-data.tar.gz && rm -f ugali-test-data.tar.gz

For easier access, we'll create symlinks to some of the ugali components that we'll be using.


In [ ]:
# Links to package components
!rm -f ugali && ln -s ../ugali
!rm -f pipeline && ln -s ugali/pipeline
# Link to configuration and source model files
!rm -f config.yaml && ln -s ../tests/config.yaml
!rm -f srcmdl.yaml && ln -s ../tests/srcmdl.yaml

Let's take a look at what we've got


In [ ]:
!ls -lh *

The components of our working directory are now:

  • healpix - this directory contains the catalog data divided into files by healpix pixel (and labeled as such)
  • mask - mask of the magnitude limit in "sparse" healpix representation corresponding to each catalog pixel
  • pipeline - pipeline scripts
  • config.yaml - pipeline configuration file
  • srcmdl.yaml - source model file

We can examine some of these constituents to get a better feel for the files we are working with.


In [ ]:
# Investigate the catalog
data = fitsio.read('healpix/catalog_hpx0687.fits')
print(data.dtype.names)
print(data[:3])

# Displaying the catalog as a healpix map
nside=4096
c = np.zeros(hp.nside2npix(nside))
pix,cts = np.unique(data['PIX4096'],return_counts=True)
c[pix] = cts

hp.cartview(c,lonra=[53.5,55],latra=[-55,-53.5])

In [ ]:
# Investigate the magnitude limit mask
mask = fitsio.read('mask/maglim_g_hpx0687.fits')
print(mask.dtype.names)
print(mask[:3])

# Displaying the catalog as a healpix map
m = np.zeros(hp.nside2npix(nside))
m[mask['PIXEL']] = mask['MAGLIM']

hp.cartview(m,lonra=[53.5,55],latra=[-55,-53.5])

Configuration

The configuration file is the key to the ugali pipeline analysis. We are going to use config.yaml. This file contains the path to the catalog and mask files that we will use, as well as other configuration parameters for the various pipeline steps. In subsequent steps we will refer to specific configuration parameters in this file.


In [ ]:
!head -n 25 config.yaml

Step 1: Downloading Data

While there is a pipeline step for downloading some data sets (i.e., from SDSS and DES), this is mostly vistigial code that has been left as an example for the user. In almost all cases that user will provide their own data set from some arbitrary source of data.

The requirements of the input data are kept fairly minimal, and can be found in the config['catalog'] section:

catalog:
  dirname: ./healpix                    # directory of catalog files                 
  basename: "catalog_hpx%04i.fits"      # catalog file basename format               
  objid_field       : COADD_OBJECT_ID   # unique object identifier                   
  lon_field         : RA                # name of longitude field                    
  lat_field         : DEC               # name of latitude field                     
  # Color always defined as mag_1 - mag_2                                            
  mag_1_band        : &band1 g          # name of mag_1 filter band                  
  mag_1_field       : WAVG_MAG_PSF_G    # name of mag_1 field                        
  mag_err_1_field   : WAVG_MAGERR_PSF_G # name of magerr_1 field                     
  mag_2_band        : &band2 r          # name of mag_2 filter band                  
  mag_2_field       : WAVG_MAG_PSF_R    # name of mag_2 field                        
  mag_err_2_field   : WAVG_MAGERR_PSF_R # name of magerr_2 field                     
  # True = band 1 is detection band; False = band 2 is detection band                
  band_1_detection  : &detection True   # detection occurs in band_1?                
  mc_source_id_field: MC_SOURCE_ID      # field for simulated objects                
  # Additional selection parameters to get a sample of stars                         
  selection         : "(np.abs(self.data['WAVG_SPREAD_MODEL_I']) < 0.003)"
  • objid_field - This is some unique identifier for objects in the catalog
  • lon_field, lat_field - The longitude and latitude of each object in the catalogs. This is usually RA and DEC with coordsys = CEL; however, Galactic coordinates are also supported (though less well tested).
  • mag_[1,2]_band, mag_[1,2]_field, mag_err_[1,2]_field - These are columns corresponding to the magnitude and magnitude error of each object in the catalog. Magnitudes are assumed to be extiniction corrected.
  • selection - This column can be used to specify any additional selection (e.g., star-galaxy classification, quality cuts) that need to be applied to the catalog before analyzing.

Step 2: Preprocessing

In order to run ugali the data needs to be preprocessed into a prescribed directory structure. This step involves assembling the data products into a file structure that ugali assumes. This step in the pipeline makes use of the run_02.0_preprocess.py script. You can run one step at a time or multiple steps at once using the --run <XXX> option. For example:

python pipeline/run_02.0_preprocess.py config.yaml -r pixelize

will run just pixelize, while

python pipeline/run_02.0_preprocess.py config.yaml -r pixelize -r split

will run both pixelize and split in order.

This is a general feature of the ugali pipeline.

Step 2.1: Pixelizing the data

Input data can be organized in any arbitrary structure; however, ugali expects the data to be spatially sorted into healpix pixels and stored in files with appropriate names. The input catalog data files are specified in config.yaml in config['data']['dirname']

data:
  dirname: /home/s1/kadrlica/projects/bliss/dsphs/v2/raw

The code will glob for all files in dirname ending in *.fits.

The output resolution of the pixelized catalog files is defined in config.yaml

coords:
  nside_catalog   : 32     # Size of patches for catalog binning

The output pixelized files will be placed in config['catalog']['dirname'] conforming to config['catalog']['basename'].

catalog:
  dirname: /home/s1/kadrlica/projects/bliss/dsphs/v2/healpix
  basename: "cat_hpx_%05i.fits"

In this test case the input data is already pixelized, so this is a trivial operation.


In [ ]:
!python pipeline/run_02.0_preprocess.py config.yaml -r pixelize

Step 2.2: Creating the magnitude limit mask

Ugali relies on some knowledge of the footprint of the data being analyzed. This is used to calculated the model predicted stellar counts and to exclude regions that are uncovered by a specific data set.

The mask takes as input the footprint of the data set, specified in:

data:
  footprint: ./maps/footprint_n4096_equ.fits.gz

The magnitude limit mask can be derived from the footprint in 3 different ways:

  1. A uniform maglim for every pixel in the footprint (this default maglim is set per survey in ugali/utils/constants.py). This is run with the -r simple option to run_02.0_preprocess.py.
  2. Derived from the data using the the magnitude error. This is set with the -r maglim option
  3. From a pre-derived magnitude limit file. This is just splitting an all-sky healpix map into a set of partial healpix maps.

The output of this step will be placed in:

mask:
  dirname    : ./mask
  basename_1 : "maglim_g_hpx%04i.fits"
  basename_2 : "maglim_r_hpx%04i.fits"

In [ ]:
!python pipeline/run_02.0_preprocess.py config.yaml -r simple

Step 3: Likelihood Grid Scan

Our next step will be to perform a likelihood grid scan over our test data set. Before executing the script, let's figure out what the arguments are:


In [ ]:
!python pipeline/run_03.0_likelihood.py --help

The command that we will run is

python pipeline/run_03.0_likelihood.py config.yaml -r scan -q local

Breaking this down:

  • python pipeline/run_03.0_likelihood.py - the script itself
  • config.yaml - the first positional argument of all the pipeline scripts is the configuration file
  • -r scan - the component of the pipeline we want to run; in this case the grid scan
  • -q local - we are going to execute this component locally

Now to run the script...


In [ ]:
!python pipeline/run_03.0_likelihood.py config.yaml -r scan -q local

It will take a while for the script to run. The result will be a set of output files in the scan directory.


In [ ]:
!ls -lh scan

The next step is to merge the output likelihood files


In [ ]:
!python pipeline/run_03.0_likelihood.py config.yaml -r merge -q local

In [ ]:
import os
os.chdir('../work')

Step 4: Find Likelihood Candidates

After the likelihood scan is run we are left with a merged healpix file containing the likelihood evaluated at each healpix coordinate and distance modulus. We can search this likelihood cube for over densities in the likelihood corresponding to satellite candidates.


In [ ]:
!python pipeline/run_04.0_peak_finder.py config.yaml

Step 5: Fit Candidate Parameters

After identifying satellite candidates, we would like to fit their observed parameters (location, distance, size, luminosity, etc.). This is implemented by calculating the posterior probability with MCMC sampling. The first step is to create a "source model" file for parameterizing the parameters of the satelliite model. An example source model is provided in


In [ ]:
!python pipeline/run_05.0_followup.py config.yaml -t targets.txt -q local -r mcmc