Running target selection and working with desitarget

Author: Adam D. Myers, University of Wyoming

This Notebook describes how to produce Target Selection files for a Data Release, focusing on Data Release 7 (DR7) of the Legacy Surveys. In addition, the final section contains tips and tricks for working with the code.

Setting up your environment

First, ensure that your environment matches a standard DESI environment. For example:

module unload desimodules
source /project/projectdirs/desi/software/desi_environment.sh 18.7

desitarget relies on desiutil and desimodel, so you may also need to set up a wider DESI environment, as detailed at:

https://desi.lbl.gov/trac/wiki/Pipeline/GettingStarted/Laptop/JuneMeeting

It may also be useful to set up some additional environment variables that are used in some versions of the desitarget code (you could also place these in your .bash_profile.ext file):

export DESIMODEL=$HOME/git/desimodel
export DUST_DIR=/project/projectdirs/desi/software/edison/dust/v0_1/maps
export GAIA_DIR=/project/projectdirs/desi/target/gaia_dr2

Here, I've set DESIMODEL to a reasonable location. For a more detailed description of checking out the desimodel data files from svn see:

https://desi.lbl.gov/trac/wiki/Pipeline/GettingStarted/Laptop/JuneMeeting#Datafilesfordesimodel

Basic production of target files at the command line

Here is an example of running an entire DR7 release using file paths at NERSC and a tagged version of the desitarget code (0.22.0 for this example).

First set up the needed directories and a version number:

export LSDIR='/global/project/projectdirs/cosmo/data/legacysurvey/dr7'
export TARGDIR='/project/projectdirs/desi/target/catalogs/dr7.1/0.22.0'
export WEBDIR='/project/projectdirs/desi/www/users/adamyers'
export DR='7.1'
export VERSION='0.22.0'

Now swap the currently loaded version of desitarget for the tagged version:

module swap desitarget/$VERSION

Make the directory for output targets, if it doesn't exist:

mkdir $TARGDIR

From this point forward, we will be running production-level code at NERSC. You may want to familiarize yourself with how to run batch jobs at NERSC. A straightforward way to run the code is to request interactive nodes. For example, to reserve one interactive node for 2 hours and 5 minutes:

salloc -N 1 -C haswell -t 02:05:00 --qos interactive -L SCRATCH,project

Now, to produce the file of targets:

select_targets $LSDIR/sweep/$DR/ $TARGDIR/targets-dr$DR-$VERSION.fits

(this takes a little over an hour on 1 Cori node, running with the default 32 processes per node).

You will now have produced an output targets file in $TARGDIR/targets-dr$DR-$VERSION.fits.

You should remember to change permissions on the file if other people will use it, e.g.:

chgrp desi $TARGDIR/targets-dr$DR-$VERSION.fits ; chmod 660 $TARGDIR/targets-dr$DR-$VERSION.fits

In addition to the select_targets command line code, there are select_cmx_targets and select_sv_targets versions that can be used for commissioning and SV.

Parallelizing target files across nodes

desitarget already parallelizes sweeps files across processors on a single node. But, it can further parallelize target selection by HEALPixel across multiple nodes. In addition, this form of parallelization can be used to produce smaller target files without having to write a single monolithic output file of all targets. The bundlefiles option can be used to produce a slurm script, and some general guidance, for parallelizing target selection across nodes using HEALPixels.

First, as in the previous section, set up the needed directories and a version number. For example:

export LSDIR='/global/project/projectdirs/cosmo/data/legacysurvey/dr7'
export DR='7.1'

module swap desitarget/$VERSION

Now, pass select_targets the bundlefiles option and an nside corresponding to the HEALPixels across which you want to parallelize the code. The bundlefiles option specifies the number of input sweeps files that will be parallelized across (i.e. for bundlefiles=100 the input sweep files will be bundled across HEALPixels so that approximately 100 files touch each pixel. The optimal nside is likely to correspond to the area covered by a sweep file (i.e. nside=8 for about 50 square degrees). Some trial and error as to the value of bundlefiles might be necessary given the available resources, however. But, for instance:

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/blat.fits --bundlefiles 100 --nside 2

will produce the following output:

INFO:cuts.py:2306:select_targets: Running on the main survey
#######################################################
Numbers of bricks or files in each set of HEALPixels:

['17: 42', '18: 40', '21: 18', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['19: 38', '27: 38', '30: 24', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['4: 36', '8: 37', '28: 1', '35: 26', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['12: 25', '25: 36', '26: 36', '33: 3', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['0: 23', '16: 25', '22: 19', '23: 10', '31: 23', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['2: 13', '5: 14', '9: 15', '10: 15', '14: 2', '20: 6', '24: 16', '45: 2', '47: 17', 'Total: 100', 'Estimated time to run in hours (for 32 processors per node): 0.28h']
['6: 13', '13: 13', '29: 12', '34: 5', '39: 11', '43: 12', 'Total: 66', 'Estimated time to run in hours (for 32 processors per node): 0.20h']


#######################################################
Possible salloc command if you want to run on the interactive queue:

salloc -N 7 -C haswell -t 01:00:00 --qos interactive -L SCRATCH,project

#######################################################
Example shell script for slurm:

#!/bin/bash -l
#SBATCH -q regular
#SBATCH -N 7
#SBATCH -t 01:00:00
#SBATCH -L SCRATCH,project
#SBATCH -C haswell

srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-17,18,21.fits --numproc 32 --nside 2 --healpixels 17,18,21 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-19,27,30.fits --numproc 32 --nside 2 --healpixels 19,27,30 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-4,8,28,35.fits --numproc 32 --nside 2 --healpixels 4,8,28,35 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-12,25,26,33.fits --numproc 32 --nside 2 --healpixels 12,25,26,33 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-0,16,22,23,31.fits --numproc 32 --nside 2 --healpixels 0,16,22,23,31 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-2,5,9,10,14,20,24,45,47.fits --numproc 32 --nside 2 --healpixels 2,5,9,10,14,20,24,45,47 &
srun -N 1 select_targets /global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1 $CSCRATCH/targets-dr7-hp-6,13,29,34,39,43.fits --numproc 32 --nside 2 --healpixels 6,13,29,34,39,43 &
wait

#gather_targets '$CSCRATCH/targets-dr7-hp-17,18,21.fits;$CSCRATCH/targets-dr7-hp-19,27,30.fits;$CSCRATCH/targets-dr7-hp-4,8,28,35.fits;$CSCRATCH/targets-dr7-hp-12,25,26,33.fits;$CSCRATCH/targets-dr7-hp-0,16,22,23,31.fits;$CSCRATCH/targets-dr7-hp-2,5,9,10,14,20,24,45,47.fits;$CSCRATCH/targets-dr7-hp-6,13,29,34,39,43.fits' $CSCRATCH/targets-dr7.fits targets

The top half of this output tells you the approximate time it will take to process the inputs, and an salloc command for requesting nodes. The second half of provides an example slurm script to run on the requested nodes. So, to use this information, you would place everything below #!/bin/bash -l in a file called, say, slurm.script. You would then execute salloc -N 7 -C haswell -t 01:00:00 --qos interactive -L SCRATCH,project at the command line, wait to be allocated nodes, and then run slurm.script on your allocated nodes.

A monolothic output file of all of the targets can be produced by removing the comment (#) before the gather_targets command in the last line of the slurm script. However, this should not prove necessary, as desitarget has tools for working with the HEALPixel-split parallelized output (see the next section useful commands for reading and producing subsets of targets).

Parallelized across nodes in this manner, it takes of order 20 minutes to run targeting (for nside=2), as compared to about 1.5 hours for a basic run of select_targets.

Note that the information in this section also applies to the select_sv_targets command line code that runs targets for SV.

Useful commands for reading and producing subsets of targets

desitarget has tools for producing subsets of targets limited by Right Ascension (RA) and Declination (Dec), or by HEALPixel. In addition, desitarget has wrappers to read subsets of the HEALPixel-partitioned targets produced from a parallelized run of select_targets, as described in the previous section.

As a general reminder, note that desihub code always uses the NESTED HEALPixel scheme.

Setting up some example files

To demonstrate this functionality, lets first produce a monolithic set of targets, as detailed in the Basic production of target files at the command line section. You can use any version of the code after 0.27.1 for this example (provided you use a consistent version throughout this section) and we'll store the outputs on scratch disk. The commands you'll need are (in brief, see the previous sections for more detail):

export LSDIR='/global/project/projectdirs/cosmo/data/legacysurvey/dr7'
export DR='7.1'

salloc -N 1 -C haswell -t 02:00:00 --qos interactive -L SCRATCH,project

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/master-targets.fits

Let's also produce a HEALPixel-split version of the files as detailed in the Producing target files in parallel section. To do this, run select_targets with the bundlefiles option, for example:

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/blat.fits --bundlefiles 100 --nside 2

and execute the produced slurm script on a set of interactive nodes. Finally, for the purposes of the rest of this tutorial, move the output files (targets-dr7-hp-0,16,22,23,31.fits, targets-dr7-hp-17,18,21.fits, etc.) to a directory called $CSCRATCH/targsplit.

Reading a subset of target files

desitarget has several useful wrappers to read the output of the parallelized or "bundled" code. Options include reading in a subset of HEALPixels, a "box" in RA/Dec or a cap (a "circle" on the sky) specified by RA/Dec/radius in degrees. Let's enter the Python prompt and try each of these options:


In [1]:
# Set-up the directory of HEALPixel-split files we created in the previous subsection.
import os
hpdirname = os.path.join(os.getenv("SCRATCH"), "targsplit")

# The various pieces of code for reading in targets.
from desitarget.io import read_targets_in_hp, read_targets_in_box, read_targets_in_cap

# Read targets by HEALPixel.
nside, pixlist = 16, [2303, 2557, 3063]
hptargs = read_targets_in_hp(hpdirname, nside, pixlist)

# Read targets in an RAmin, RAmax, Decmin, Decmax "rectangle".
radecbox = [120,122,32,34]
boxtargs = read_targets_in_box(hpdirname, radecbox)

# Read targets in an RA, Dec, radius (degrees) "circle".
radecrad = [121,33,0.5]
captargs = read_targets_in_cap(hpdirname, radecrad)

The default option for extracting targets is to return all columns in a target file, but it's also possible to return just a subset of columns by passing them as a list:


In [2]:
columns = ["DESI_TARGET", "TARGETID"]
hptargs = read_targets_in_hp(hpdirname, nside, pixlist, columns=columns)
boxtargs = read_targets_in_box(hpdirname, radecbox, columns=columns)
captargs = read_targets_in_cap(hpdirname, radecrad, columns=columns)

It is relatively straightforward to check that these commands worked, using the master file of targets that we created in the previous subsection. It is, of course, much quicker to read in subsets of targets rather than the entire master file!


In [3]:
import fitsio
mastername = os.path.join(os.getenv("SCRATCH"), "master-targets.fits")
mastertargs = fitsio.read(mastername, "TARGETS")

In [4]:
# Note that the geomask module has several useful functions
# for testing whether objects are in a cap, box, pixel etc.
from desitarget.geomask import is_in_cap
ii = is_in_cap(mastertargs, radecrad)

# Did we return the exact same set of TARGETIDs?
capset = set(captargs["TARGETID"])
masterset = set(mastertargs[ii]["TARGETID"])
# Are the sets identical?
capset == masterset


Out[4]:
True

Running a subset of target files

It's certainly true that reading in a subset of targets from previosuly processed HEALPixel-split files is quicker than reading in all of the targets. But, it may also be even quicker in certain situations to only process that subset of targets in the first place.

desitarget can process target files only in certain regions of the sky, to completely circumvent the need to produce very large target files. As in the previous section, options include running in a subset of HEALPixels, a "box" in RA/Dec or a cap (a "circle" on the sky) specified by RA/Dec/radius in degrees. Although many of the commands described below are probably sufficiently fast to run on a login node, let's grab an interactive node for this section of the tutorial:

salloc -N 1 -C haswell -t 00:20:00 --qos interactive -L SCRATCH,project

As before, we'll set up some environment variables that point to the location of Legacy Surveys imaging Data Release files at NERSC:

export LSDIR='/global/project/projectdirs/cosmo/data/legacysurvey/dr7'
export DR='7.1'

First, let's try the HEALPixels option. We'll produce just the targets in HEALPixel 16 for nside=2.

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/targets-hp-16.fits --nside 2 --healpixels 16

Note that this only takes about 4 minutes to run (on the default 32 processors), as compared to an hour or more when running the full set of targets.

There are similar options for an (RAmin, RAmax, Decmin, Decmax) "box":

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/targets-radecbox-120-122-32-34.fits --radecbox 120,122,32,34

and for a cap (a "circle" on the sky) specified by RA/Dec/radius in degrees:

select_targets $LSDIR/sweep/$DR/ $CSCRATCH/targets-radecrad-121-33-0.5.fits --radecrad 121,33,0.5

These last two commands should run even faster (in 1-2 minutes) as they are querying much smaller areas than for the HEALPixel example.

Note that all of the above options will work with the SV version of the code (select_sv_targets) in addition to the main survey version of the code (select_targets).

We can check that these commands returned the expected outcomes using the same method as we used in the previous subsection to verify reading in subsets of targets. For example, returning to the Python prompt:


In [5]:
import os, fitsio
from desitarget.geomask import is_in_box

mastername = os.path.join(os.getenv("SCRATCH"), "master-targets.fits")
mastertargs = fitsio.read(mastername, "TARGETS")
radecbox = [120, 122, 32, 34]
ii = is_in_box(mastertargs, radecbox)
masterset = set(mastertargs[ii]["TARGETID"])

boxname = os.path.join(os.getenv("SCRATCH"), "targets-radecbox-120-122-32-34.fits")
boxtargs = fitsio.read(boxname, "TARGETS")
boxset = set(boxtargs["TARGETID"])

# Did we return the exact same set of TARGETIDs?
boxset == masterset


Out[5]:
True

Production of all desitarget files at the command line

In addition to the basic targets file, you may want some of the other desitarget products.

To produce a random catalog:

select_randoms $LSDIR $CSCRATCH/blat.fits --bundlebricks 20000

Then, try a few different numbers in place of 20000 until you produce a script that is balanced across nodes. Once you have a balanced script, follow the onscreen instructions and move the output file to $TARGDIR/randoms-dr$DR-$VERSION.fits. This takes about 1.5-2.5 hours to run on about 10 Cori nodes with 32 processes per node.

To produce a HEALPixel map of weights from a random catalog:

make_imaging_weight_map $TARGDIR/randoms-dr$DR-$VERSION.fits $TARGDIR/targets-dr$DR-$VERSION.fits $TARGDIR/pixweight-dr$DR-$VERSION.fits

This takes about 20 minutes to run on one Cori node with 1 process per node.

To use a HEALPixel weight file, a random catalog, and a target catalog to produce a set of QA webpages (e.g. see the DR7 QA pages:

run_target_qa $TARGDIR/targets-dr$DR-$VERSION.fits $WEBDIR/desitargetQA-dr$DR-$VERSION -w $TARGDIR/pixweight-dr$DR-$VERSION.fits

This takes about 1.5 hours on one Cori node with 1 process per node.

To select sky targets for a data release:

select_skies $LSDIR $CSCRATCH/blat.fits --bundlebricks 10000

Then, try a few different numbers in place of 10000 until you produce a script that is balanced across nodes. Once you have a balanced script, follow the onscreen instructions and move the output file to $TARGDIR/skies-dr$DR-$VERSION.fits). This takes about 3-4 hours to run on about 15 Cori nodes with 32 processes per node.

Finally, to produce a set of GFA targets:

select_gfas $LSDIR/sweep/$DR $TARGDIR/gfas-dr$DR-$VERSION.fits

This takes about 15 minutes to run on one Cori node with 32 processes per node.

Again, remember to change permissions on the file if other people will use it, e.g.:

chgrp desi *fits ; chmod 660 *fits

Other tips and tricks for working with command-line desitarget code

General

  • All desitarget executables listed in the previous section have a -h flag that will describe how to use that code, e.g.:
select_targets -h

select_targets

  • It is possible to select just one target class. For example, to select (set the targeting bits for) only the quasar target class:
select_targets $LSDIR/sweep/$DR/ $TARGDIR/targets-dr$DR-$VERSION.fits --tcnames QSO

the tcnames flag should be interpreted as "target class names."

  • Multiple target classeses can also be selected by passing a comma-separated list. For example, to select only the LRG and ELG target classes:
select_targets $LSDIR/sweep/$DR/ $TARGDIR/targets-dr$DR-$VERSION.fits --tcnames LRG,ELG
  • Versions of select_targets also exist for commissioning (cmx) and Survey Validation (SV). An example of running the commissioning version of the code is:
select_cmx_targets $LSDIR/sweep/$DR/ $TARGDIR/cmx-targets-dr$DR-$VERSION.fits

The commissioning code runs very fast, so has no options beyond the input directory and output file.

An example of running the SV version of the code is:

select_sv_targets $LSDIR/sweep/$DR/ $TARGDIR/sv2-targets-dr$DR-$VERSION.fits --iteration 2

The SV code has many of the same options as select_targets but has the additional flag --iteration to denote which version of SV is being run (e.g., the first iteration of SV corresponds to --iteration 1).

run_target_qa

  • A similar trick to the use of tcnames in the previous section can be used to produce web pages for a subset of targets. For example:
run_target_qa $TARGDIR/targets-dr$DR-$VERSION.fits $WEBDIR/desitargetQA-dr$DR-$VERSION -w $TARGDIR/pixweight-dr$DR-$VERSION.fits --bitnames LRG,ELG

(note that eventually --bitnames may be renamed --tcnames for consistency).

  • It isn't necessary to use a weight map if you don't have an appropriate one made. Just drop the -w flag, e.g.:
run_target_qa $TARGDIR/targets-dr$DR-$VERSION.fits $WEBDIR/desitargetQA-dr$DR-$VERSION

Without the map, the code will produce the same output but with equal weighting for all areas of the imaging footprint.

  • Producing plots is the slowest part of running targeting QA. To produce the webpages without making any new plots:
run_target_qa $TARGDIR/targets-dr$DR-$VERSION.fits $WEBDIR/desitargetQA-dr$DR-$VERSION --noplots
  • If a weight map is passed, run_target_qa produces plots of systematics across the imaging footprint by default. It can take some time to produce these plots, but they can be turned off with the --nosystematics flag:
run_target_qa $TARGDIR/targets-dr$DR-$VERSION.fits $WEBDIR/desitargetQA-dr$DR-$VERSION -w $TARGDIR/pixweight-dr$DR-$VERSION.fits --nosystematics.

Directly calling desitarget selection functions

The main module that performs the selection work for desitarget is called cuts. It has many useful functions for target selection "under the hood" that could be called directly from Python. For example the command line call:

export LSDIR='/global/project/projectdirs/cosmo/data/legacysurvey/dr7'
export TARGDIR='/project/projectdirs/desi/target/catalogs/dr7.1/0.22.0'
export DR='7.1'
export VERSION='0.22.0'
select_targets $LSDIR/sweep/$DR/ $TARGDIR/targets-dr$DR-$VERSION.fits --tcnames LRG,ELG

Is equivalent to the following set of commands at the Python prompt:

from desitarget.cuts import select_targets
from desitarget import io

# ADM inputs
nside = io.desitarget_nside()
src = '/global/project/projectdirs/cosmo/data/legacysurvey/dr7/sweep/7.1'
dest = '/project/projectdirs/desi/target/catalogs/dr7.1/0.22.0/targets-dr7.1-0.22.0.fits'
infiles = io.list_sweepfiles(src)

# ADM note that the 'main', here, is to distinguish from 'cmx' and 'svX' (with X=1,2,3 etc.)
# ADM for commissioning and SV. The code would default to 'main', anyway, without this keyword.
targets = select_targets(infiles, tcnames=["LRG", "ELG"], survey='main')

# ADM write the targets to file.
io.write_targets(dest, targets, indir=src, survey="main", nside=nside)