Running DR8 bricks using docker/shifter and the Burst Buffer

Author: Adam D. Myers, University of Wyoming

The environment and source code to run legacypipe is now conveniently available in a docker image. This notebook describes how to process bricks using that set-up. I'll assume, throughout, that we're working on Cori at NERSC, and I'll use the example of running some text bricks on the debug queue using the burst buffer. Other general context and useful information is available in Martin Landriau's cookbook.

Make sure your environment is clean!

Ensure that you don't explicitly load any modules or source any code in your initialization files (e.g. in ~/.bash_profile.ext or ~/.bashrc.ext).

Setting up qdo

Grab a recent version of qdo and install it somewhere in your home directory. You will need to use a python install such as the one in desiconda, to facilitate psycopg2 (postgres database client) support. For example:

mkdir ~/git
cd ~/git
git clone https://bitbucket.org/berkeleylab/qdo/src/master/

Make sure that the location to which you cloned qdo is in your path:

export PYTHONPATH=${PYTHONPATH}:$HOME/git/qdo/lib/python3.6/site-packages/
export PATH=${PATH}:$HOME/git/qdo/bin/

Now, set up the postgres database on Cori:

export QDO_BACKEND=postgres
export QDO_BATCH_PROFILE=cori
export QDO_DB_HOST=nerscdb03.nersc.gov
export QDO_DB_NAME=desirun
export QDO_DB_PASS=   ***ask someone for the QDO DB password***
export QDO_DB_USER=desirun_admin

I typically put the previous 8 export commands in a file called, e.g., setup-qdo-cori.sh and source that file whenever I want to use qdo.

Collecting and running the docker image

The following command will pull the latest legacypipe docker image into the NERSC shifter registry:

shifterimg pull docker:legacysurvey/legacypipe:latest

Note that this image can be run via:

shifter --image docker:legacysurvey/legacypipe:latest bash

If you try running this, and look in the /src directory, you'll see that the legacypipe code, and all of its dependencies, including astrometry.net and tractor, have been installed.

Configuring the Burst Buffer

For DR8, Dustin Lang created a persistent burst buffer reservation called "dr8" that is 40TB in size. Nodes at NERSC will need to be passed a configuration file to find this reservation. So, change into a directory that will host this configuration file, and create it. For the rest of the notebook I'll assume that you're running from this directory so that the burst buffer configuration file can be found:

mkdir $SCRATCH/blat
cd $SCRATCH/blat
echo "#DW persistentdw name=dr8" > bb.conf

Note that the burst buffer is not visible from the head nodes, so if you want to monitor files in the buffer (you will) then you'll need to fire up an interactive node. For example:

salloc -N 1 -C haswell -q interactive -t 00:10:00 --bbf=bb.conf

The --bbf, here, points to the location of the configuration file and the string after -t corresponds to how long you want the interactive node. If you are just reviewing some files, you can set it to 10 minutes, as I have in the example. If you are running a lot of bricks and want to review or debug your work, you might want to set it to the maximum allowed allocation of 4 hours (-t 04:00:00).

Within the burst buffer, our persistent allocation is mounted on the directory $DW_PERSISTENT_STRIPED_dr8. For example:

salloc -N 1 -C haswell -q interactive -t 00:10:00 --bbf=bb.conf
ls $DW_PERSISTENT_STRIPED_dr8

Note that the mount-point of the allocation depends on the interactive job ID, so the absolute path to the directory can change! In my case:

salloc -N 1 -C haswell -q interactive -t 00:10:00 --bbf=bb.conf
   salloc: Granted job allocation 19351004

echo $DW_PERSISTENT_STRIPED_dr8
   /var/opt/cray/dws/mounts/batch/dr8_19351004_striped_scratch/

But the integer in the directory name (19351004) will be different in your case. For that reason, it's best to always refer to $DW_PERSISTENT_STRIPED_dr8 in scripts rather than to the absolute path. You'll note, for instance, that any slurm logs created when you submit jobs will reference output directories that include a different integer in the directory name. This is benign; different launched jobs will refer to the same mounted directory using different absolute paths.

Queueing a set of bricks

The legacypipe package has some convenient utilities for determining which bricks touch regions of space. Let's use those utilities to create a file of bricks:

First grab an interactive node, as we'll need the burst buffer:

salloc -N 1 -C haswell -q interactive -t 00:05:00 --bbf=bb.conf

Now let's enter the docker image:

shifter --image docker:legacysurvey/legacypipe:latest bash

and point to the necessary legacypipe code and dependencies.

export PYTHONPATH=/usr/local/lib/python:/usr/local/lib/python3.6/dist-packages:/src/legacypipe/py/

To run the legacypipe utilities, the $LEGACY_SURVEY_DIR environment variable must point to a directory that contains a survey bricks file. For example:

export LEGACY_SURVEY_DIR=$DW_PERSISTENT_STRIPED_dr8/dr8-depthcut

Now let's create a file of bricks in a region defined by an RA/Dec box and write it to $SCRATCH:

python -u /src/legacypipe/py/legacypipe/queue-calibs.py \
--minra 14.5 --maxra 15 --mindec -0.5 --maxdec 0 > $SCRATCH/blat/blatbricks.txt

Finally, let's exit the docker environment and load the file of bricks into the qdo queue:

exit
qdo load blatbricks $SCRATCH/blat/blatbricks.txt

Running a set of bricks

Now that we've queued a set of bricks, we can process them. Let's first examine the code that we'll be running. To do so, we'll again have to run the docker image:

shifter --image docker:legacysurvey/legacypipe:latest bash
more /src/legacypipe/bin/runbrick-shifter-bb.sh
exit

You should see that the code in runbrick-shifter-bb.sh initializes all of the environment variables that are needed to run the code at NERSC. Note also the lines:

# Try limiting memory to avoid killing the whole MPI job...
ncores=8

which forces us to run on 8 cores at a time to try to efficiently balance memory and CPU use.

Finally, let's execute the runbrick script in the debug queue for the maximum allowed debug time of 30 minutes:

QDO_BATCH_PROFILE=cori-shifter qdo launch blatbricks 4 --cores_per_worker 8 \
--walltime=30:00 --batchqueue=debug --keep_env \
--batchopts "--image=docker:legacysurvey/legacypipe:latest --bbf=bb.conf" \
--script "/src/legacypipe/bin/runbrick-shifter-bb.sh"

Hopefully, many of the flags sent to this code are now familiar. We're running the runbrick-shifter-bb.sh script within the legacysurvey/legacypipe docker image, we've pointed to the bb.conf configuration file to allow I/O in our persistent burst buffer reservation, and we're going to launch the blatbricks file we loaded in qdo. Note, also, that we've set cores_per_worker to correspond to the previously established value in the runbrick-shifter-bb.sh script.

The final option merits some extra description. The code is set up to parallelize 4 bricks across the processors on each node. So, the 4 in qdo launch blatbricks 4 corresponds to running 4 bricks on 1 node. If this number is increased, then more nodes will be utilized. For example, if this number was set to 12, then 3 nodes would be requested. If this number was set to 36, then 9 nodes would be requested. In other words, if you are processing 47 bricks it is probably most efficient to send qdo launch blatbricks 48 in order to parallelize at 4-bricks-per-node across 12 nodes.

The debug queue polocy limits requests to a maximum number of nodes, so be careful to stay within that limit.

A smorgasbord of useful commands

That's essentially it. The queue can be recovered and bricks reprocessed until everything is flagged as "Succeeded" in the table produced by qdo list, or until some bricks stall. Some additional useful commands include:

Everything appears to have stalled and I want to write out all the remaining partially processed bricks...

python -u legacypipe/runbrick.py --bail-out

(add the --bail-out option inside of the runbrick-shifter-bb.sh script)

I want to take my burst buffer output and put it on scratch disk for general consumption...

rsync -rv $DW_PERSISTENT_STRIPED_dr8/dr8test11 $SCRATCH

Is slurm running my qdo batch script? How long until it times out?

sqs

Did my running bricks start from a checkpoint, or have they stalled forever?

qdo tasks blatbricks | grep Running | awk \
'{print "echo",$2,"; grep Keeping $DW_PERSISTENT_STRIPED_dr8/dr8test11/logs/*/"$2".log"}' \
| csh

What were the last few code steps logged for each of my running bricks?

qdo tasks blatbricks | grep Running | awk \
'{print "echo ; echo",$2,"; tail $DW_PERSISTENT_STRIPED_dr8//dr8test11/logs/*/"$2".log"}' \
| csh