The goal of this notebook is to document how we ran dr8b, including calibration files, with the updated data model and using the burst buffer + Docker image on Cori.
John Moustakas
Siena College
2019 March
Many thanks to Dustin Lang, Adam Myers, Eddie Schlafly, David Schlegel, Martin Landriau, and Stephen Bailey.
First, choose a new, empty top-level working directory on project
and create it:
mkdir -p /global/project/projectdirs/cosmo/work/legacysurvey/dr8b
cd /global/project/projectdirs/cosmo/work/legacysurvey/dr8b
Hereafter I will assume that all scripts are launched from this working directory, unless otherwise specified (starting in Section C).
Next, create soft links to the CP-reduced imaging data and the WISE background models:
mkdir -p images calib/wise
ln -s /global/project/projectdirs/cosmo/staging/90prime images/90prime
ln -s /global/project/projectdirs/cosmo/staging/mosaic images/mosaic
ln -s /global/project/projectdirs/cosmo/staging/decam images/decam
ln -s /project/projectdirs/cosmo/work/wise/unwise_catalog/dr1/mod calib/wise/modelsky
And finally grab a copy of the DR-independent survey-bricks
file, which we will need below:
cp /global/project/projectdirs/cosmo/work/legacysurvey/dr7/survey-bricks.fits.gz .
Initially, Dustin Lang created a 40TB burst-buffer reservation called "DR8". You can think of this as a mounted external drive, where large files can be written and read without the significant overhead associated with $SCRATCH
, although eventually the results will be copied onto project
, as described below.
To access this file system for the first time you have to create a configuration file (just once).
echo "#DW persistentdw name=DR8" > bb.conf
The files in the burst buffer can be accessed only from an interactive Cori node. Let's do this (just once) and make a dedicated subdirectory to keep our outputs tidy:
salloc -N 1 -C haswell -q interactive -t 00:10:00 --bbf=bb.conf
mkdir -p $DW_PERSISTENT_STRIPED_DR8/dr8b
Note that the $DW_PERSISTENT_STRIPED_DR8
environment variable must be used always, as every user will have a different absolute path.
For the record, a new reservation can be made (if necessary) by submitting the following SLURM script to the queue:
#! /bin/bash
#SBATCH -q debug
#SBATCH -N 1
#SBATCH -C haswell
#SBATCH -t 00:05:00
#BB create_persistent name=BBNAME capacity=50000GB access_mode=striped type=scratch
where BBNAME
is the desired name of the reservation.
First, make sure your environment is clean! In other words, ensure that you don't explicitly load any modules or source any code in your initialization files (e.g., $HOME/.bash_profile.ext
or $HOME/.bashrc.ext
).
Most of the code dependencies we need are in the Docker
container, including tractor and astrometry.net, described below, but we usually depend on a recent (hopefully tagged) version of legacyzpts, legacypipe, and qdo. To keep everything tidy, check out the code into a dedicated directory, although note some of the shell scripts below assume this directory!
mkdir -p code ; cd code
git clone git@github.com:legacysurvey/legacyzpts.git
git clone git@github.com:legacysurvey/legacypipe.git
git clone https://bitbucket.org/berkeleylab/qdo.git
Be sure to check out the appropriate branches of these software packages as necessary.
Next, pull the latest Docker
container
shifterimg pull docker:legacysurvey/legacypipe:latest
which one can enter (if desired) with the command
shifter --image docker:legacysurvey/legacypipe:latest bash
Although we use the Docker
container for production, for a lot of preparatory work and debugging it's often convenient to load all the software dependencies and myriad environment variables on the command line, outside of the Docker
container. We provide a simple shell script for this purpose in legacypipe
which we recommend copying to the local working directory for convenience of editing, temporarily changing paths, etc.:
cp code/legacypipe/doc/dr8/dr8-env.sh ./
source dr8-env.sh
This shell script has a single environment variable, $CODE_DIR
for pointing to the locally checked out code, which should be updated as necessary.
Finally, you will also need to add the following lines (and get the appropriate database password) in order to complete the qdo
setup (for the desiconda user these lines are in the .bashrc.ext
file):
export QDO_BACKEND=postgres
export QDO_BATCH_PROFILE=cori
export QDO_DB_HOST=nerscdb03.nersc.gov
export QDO_DB_NAME=desirun
export QDO_DB_USER=desirun_admin
export QDO_DB_PASS=XXXXXXXX
For DR8 Eddie Schlafly and David Schlegel inspected all the DECam
, 90prime
, and mosaic
imaging data on-disk and created three FITS files which should be used to define the input data:
We use these tables (cutting on QKEEP==True
) to create our input image lists. To keep everything tidy, create a new image-lists
subdirectory, copy (ToDo: specify location) the dr8-ondisk-*.fits
files there, source the dr8-env.sh
file, and then run the following snippet of code in an ipython
session.
import os
import fitsio
for camera in ('decam', 'mosaic', '90prime'):
data = fitsio.read('dr8-ondisk-{}.fits'.format(camera), upper=True)
with open('image-list-{}.txt'.format(camera), 'w') as flist:
for imfile in data[data['QKEEP']]['FILENAME']:
flist.write('{}\n'.format(os.path.join(camera, imfile.decode('utf-8').strip())))
The resulting output files are:
We use qdo
to manage the myriad of tasks (one task = one image), and the shifter
script dr8-zpts-shifter.sh
to produce the calibration files we need. First, load the tasks into the database with
qdo load dr8b-calibs-decam ./image-list-decam.txt
qdo load dr8b-calibs-90prime ./image-list-90prime.txt
qdo load dr8b-calibs-mosaic ./image-list-mosaic.txt
Next, copy the relevant scripts into the working directory:
cp $LEGACYPIPE_DIR/doc/dr8/dr8-env-shifter.sh ./
cp $LEGACYPIPE_DIR/doc/dr8/dr8-zpts-shifter.sh ./
The script dr8-env-shifter.sh
does not need to be modified, but in the dr8-zpts-shifter.sh
be sure to check that the dr
, LEGACY_SURVEY_DIR
, and CODE_DIR
variables are correct and up to date.
Then, utilize various queues to get everything done. For example, for DECam
it's not crazy to request 256 MPI tasks with 8 cores per task (equivalent to 256/8=64 nodes on cori
). Using the debug
and regular
queue for 30 and 120 minutes, respectively, (and the burst buffer) would look like
QDO_BATCH_PROFILE=cori-shifter qdo launch dr8b-calibs-decam 256 --cores_per_worker 8 \
--walltime=00:30:00 --batchqueue=debug --keep_env --script ./dr8-zpts-shifter.sh \
--batchopts "--image=docker:legacysurvey/legacypipe:latest --bbf=bb.conf"
QDO_BATCH_PROFILE=cori-shifter qdo launch dr8b-calibs-decam 256 --cores_per_worker 8 \
--walltime=02:00:00 --batchqueue=regular --keep_env --script ./dr8-zpts-shifter.sh \
--batchopts "--image=docker:legacysurvey/legacypipe:latest --bbf=bb.conf"
The game, of course, is to balance throughput and wait time, although in general the debug queues work quite well, even on the DECam images (with ~60 CCDs each).
Alternatively, one could use the shared queue with
QDO_BATCH_PROFILE=cori-shifter qdo launch dr8b-calibs-decam 1 --cores_per_worker 8 \
--walltime=04:00:00 --batchqueue=shared --keep_env --script ./dr8-zpts-shifter.sh \
--batchopts "--image=docker:legacysurvey/legacypipe:latest -a 0-99 --bbf=bb.conf"
which may also work well in production.
Note that for the 90prime
and mosaic
cameras (which only have 4 CCDs) a more typical request would be
QDO_BATCH_PROFILE=cori-shifter qdo launch dr8b-calibs-mosaic 512 --cores_per_worker 4 \
--walltime=00:30:00 --batchqueue=debug --keep_env --script ./dr8-zpts-shifter.sh \
--batchopts "--image=docker:legacysurvey/legacypipe:latest --bbf=bb.conf"
For the record, files will be written out with the following directory structure (all relative to $DW_PERSISTENT_STRIPED_DR8/dr8b
):
zpts
90prime
CP*/[image-file]-annotated.fits
CP*/[image-file]-photom.fits
CP*/[image-file]-survey.fits
decam
CP*/[image-file]-annotated.fits
CP*/[image-file]-photom.fits
CP*/[image-file]-survey.fits
mosaic
CP*/[image-file]-annotated.fits
CP*/[image-file]-photom.fits
CP*/[image-file]-survey.fits
calib
90prime
psfex
psfex-merged/?????/90prime-????????.fits
se
splinesky
splinesky-merged/?????/90prime-????????.fits
decam
psfex
psfex-merged/?????/decam-????????.fits
se
splinesky
splinesky-merged/?????/decam-????????.fits
mosaic
psfex
psfex-merged/?????/mosaic-????????.fits
se
splinesky
splinesky-merged/?????/mosaic-????????.fits
The only files we care about, however, are all the files in the zpts
, splinesky-merged
, and psfex-merged
directories; the files in the SExtractor (se), psfex, and splinesky directories are intermittent and will be deleted in the future.
The following qdo
commands may be useful:
qdo list
qdo status dr8b-calibs-decam # check on current status
qdo retry dr8b-calibs-decam # re-load failed jobs (presumably after debugging the code)
qdo recover dr8b-calibs-decam --dead # re-load jobs that hung because the queue timed out
Next, we need to merge all the individual zeropoint files to generate the all-powerful survey-ccds and annotated-ccds files, which we can accomplish with dr8-merge-zpts.sh
:
./$LEGACYPIPE_DIR/doc/dr8/dr8-merge-zpts.sh
This script builds a simple ASCII file list of the individual zeropoint tables (ignoring files with the "debug" suffix) and passes them to legacyzpts/legacy_zeropoints_merge.py
and, subsequently, legacypipe/create_kdtrees.py
, to create the following files:
annotated-ccds-dr8b-decam-nocuts.fits
annotated-ccds-dr8b-90prime-mosaic-nocuts.fits
annotated-ccds-dr8b-decam-nocuts.kd.fits
annotated-ccds-dr8b-90prime-mosaic-nocuts.kd.fits
The pipeline has to be run separately for the DECam and 90Prime+Mosaic datasets, so we need to create and set up different dedicated directories. For example, for DECam, do:
mkdir -p runbrick-decam
cd runbrick-decam
ln -s ../bb.conf bb.conf
ln -s ../calib calib
ln -s ../code code
ln -s ../images images
ln -s ../survey-bricks.fits.gz survey-bricks.fits.gz
ln -s ../survey-ccds-dr8b-decam-nocuts.kd.fits survey-ccds-dr8b-decam-nocuts.kd.fits
Next, we need two shell scripts which set up runbrick
specifically for this input directory and camera:
cp $LEGACYPIPE_DIR/doc/dr8/dr8-env-shifter.sh ./
cp $LEGACYPIPE_DIR/doc/dr8/dr8-runbrick-shifter-decam.sh ./
However, be sure to verify that the dr
, camera
, release
, LEGACY_SURVEY_DIR
, and CODE_DIR
environment variables are all correct and up to date.
And that's it! Setting up the runbrick-90prime-mosaic
directory is analogous.
survey-ccd
files.With our I/O directory set up and our survey-ccds
table in hand, we are ready to run legacypipe
. First, create a list of bricks to load into qdo
using legacypipe/queue-calibs.py
. In dr8a
and dr8b
we focused on a set of test regions. The unique set of bricks in these regions were separately determined by Adam Myers, but can be rederived with, e.g.,
cd /global/project/projectdirs/cosmo/work/legacysurvey/dr8b/runbrick-decam
source dr8-env-decam.sh
python $LEGACYPIPE_DIR/py/legacypipe/queue-calibs.py --region dr8-test-s82 > bricks-test-s82
python $LEGACYPIPE_DIR/py/legacypipe/queue-calibs.py --region dr8-test-hsc-sgc > bricks-test-hsc-sgc
python $LEGACYPIPE_DIR/py/legacypipe/queue-calibs.py --region dr8-test-hsc-ngc > bricks-test-hsc-ngc
python $LEGACYPIPE_DIR/py/legacypipe/queue-calibs.py --region dr8-test-edr > bricks-test-edr
To keep the top-level directory tidy, these files should be copied to a new directory, dr8b/brick-lists
.
Next, create a qdo
queue for each test region (so that completing a particular region can be prioritized) with
qdo load dr8b-test-s82-decam ./bricks-test-s82
qdo load dr8b-test-hsc-sgc-decam ./bricks-test-hsc-sgc
qdo load dr8b-test-hsc-ngc-decam ./bricks-test-hsc-ngc
qdo load dr8b-test-edr-decam ./bricks-test-edr
QDO_BATCH_PROFILE=cori-shifter qdo launch dr8b-test-hsc-ngc-decam 256 --cores_per_worker 8 \
--walltime=00:30:00 --batchqueue=debug --keep_env --script ./dr8-runbrick-shifter-decam.sh \
--batchopts "--image=docker:legacysurvey/legacypipe:latest --bbf=bb.conf"
cd $DW_PERSISTENT_SCRATCH/dr8b/runbrick-decam
rsync -auv tractor* $LEGACY_SURVEY_DIR >> rsync-tractor.log 2>&1 &
rsync -auv coadd $LEGACY_SURVEY_DIR >> rsync-coadd.log 2>&1 &
rsync -auv metrics $LEGACY_SURVEY_DIR >> rsync-metrics.log 2>&1 &
/global/project/projectdirs/cosmo/webapp/viewer-dev/load-layer.py
and then run it (takes a long time...)