Persistent/Distributed Generation with Crop Example

This example shows how to use the Crop object for disk-based combo running - either for persistent progress or distributed processing.

First let's define a very simple function, describe it with a Runner and Harvester and set the combos for this first set of runs.


In [1]:
import xyzpy as xyz

def foo(a, b):
    return a + b, a - b

r = xyz.Runner(foo, ['sum', 'diff'])
h = xyz.Harvester(r, data_name='foo_data.h5')

combos = {'a': range(0, 10),
          'b': range(0, 10)}

We could use the harvester to generate data locally. But if we want results to be written to disk, either for persistence or to run them elsewhere, we need to create a Crop.


In [2]:
c = h.Crop(name='first_run', batchsize=5)
c


Out[2]:
<Crop(name='first_run', progress=*reaped or unsown*, batchsize=5)>

Sow the combos

A single crop is used for each set of runs/combos, with batchsize setting how many runs should be lumped together (default: 1). We first sow the combos to disk using the Crop:


In [3]:
c.sow_combos(combos)


100%|##########| 100/100 [00:00<00:00, 8811.01it/s]

There is now a hidden directory containing everything the crop needs:


In [4]:
ls -a


 ./                              'dask distributed example.ipynb'*
 ../                             'farming example.ipynb'*
'basic output example.ipynb'*     .ipynb_checkpoints/
'complex output example.ipynb'*   .xyz-first_run/
'crop example.ipynb'*

And inside that are folders for the batches and results, the pickled function, and some other dumped settings:


In [5]:
ls .xyz-first_run/


batches/  results/  xyz-function.clpkl  xyz-settings.jbdmp

Once sown, we can check the progress of the Crop:


In [6]:
c


Out[6]:
<Crop(name='first_run', progress=0/20, batchsize=5)>

There are a hundred combinations, with a batchsize of 5, yielding 20 batches to be processed.

Grow the results

Any python process with access to the sown batches in .xyz-first_run (and the function requirements) can grow the results (you could even zip the folder up and send elsewhere). The process can be run in several ways:

  1. In the .xyz-first_run folder itself, using e.g:
python -c "import xyzpy; xyzpy.grow(i)"  # with i = 1 ... 20
  1. In the current ('parent') folder, one then has to used a named crop to differentiate: e.g:
python -c "import xyzpy; crop=xyzpy.Crop(name='fist_run'); xyzpy.grow(i, crop=crop)"
  1. Somewhere else. Then the parent must be specified too, e.g.:
python -c "import xyzpy; crop=xyzpy.Crop(name='first_run', parent_dir='.../xyzpy/docs/examples'); xyzpy.grow(i, crop=crop)"

To fake this happening we can run grow ourselves (this cell could standalone):


In [7]:
import xyzpy
crop = xyzpy.Crop(name='first_run')
for i in range(1, 11):
    xyzpy.grow(i, crop=crop)


Batch 1: {'a': 0, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 660.58it/s]
Batch 2: {'a': 0, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 787.25it/s]
Batch 3: {'a': 1, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 1176.85it/s]
Batch 4: {'a': 1, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 1260.23it/s]
Batch 5: {'a': 2, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 2157.34it/s]
Batch 6: {'a': 2, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 1652.08it/s]
Batch 7: {'a': 3, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 1927.35it/s]
Batch 8: {'a': 3, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 1658.09it/s]
Batch 9: {'a': 4, 'b': 4}: 100%|##########| 5/5 [00:00<00:00, 1440.45it/s]
Batch 10: {'a': 4, 'b': 9}: 100%|##########| 5/5 [00:00<00:00, 1622.68it/s]

And now we can check the progress:


In [8]:
print(c)


/home/jg3014/Sync/dev/python/xyzpy/docs/examples/.xyz-first_run
------------------------------------------------------=========
10 / 20 batches of size 5 completed
[##########          ] : 50.0%

If we were on a qsub based batch system we could use Crop.qsub_grow to automatically submit all missing batches as jobs. It is worth double checking the script that is used first though! This is done using Crop.gen_qsub_script:


In [9]:
print(c.gen_qsub_script(minutes=20, gigabytes=1))


#!/bin/bash -l
#$ -S /bin/bash
#$ -l h_rt=0:20:0,mem=1G
#$ -l tmpfs=1G

#$ -N first_run
mkdir -p /home/jg3014/Scratch/output
#$ -wd /home/jg3014/Scratch/output
#$ -pe smp 1
#$ -t 1-10
cd /home/jg3014/Sync/dev/python/xyzpy/docs/examples
export OMP_NUM_THREADS=1
tmpfile=$(mktemp .xyzpy-qsub.XXXXXXXX)
cat <<EOF > $tmpfile
#
from xyzpy.gen.batch import grow, Crop
crop = Crop(name='first_run')
batch_ids = (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
grow(batch_ids[$SGE_TASK_ID - 1], crop=crop, debugging=False)
EOF
python $tmpfile
rm $tmpfile

The default scheduler is 'sge' (Sun Grid Engine), however you can also specify 'pbs' (Portable Batch System):


In [10]:
print(c.gen_qsub_script(minutes=20, gigabytes=1, scheduler='pbs'))


#!/bin/bash -l
#PBS -lselect=1:ncpus=1:mem=1gb
#PBS -lwalltime=00:20:00

#PBS -N first_run
#PBS -J 1-10
cd /home/jg3014/Sync/dev/python/xyzpy/docs/examples
export OMP_NUM_THREADS=1
tmpfile=$(mktemp .xyzpy-qsub.XXXXXXXX)
cat <<EOF > $tmpfile
#
from xyzpy.gen.batch import grow, Crop
crop = Crop(name='first_run')
batch_ids = (11, 12, 13, 14, 15, 16, 17, 18, 19, 20)
grow(batch_ids[$PBS_ARRAY_INDEX - 1], crop=crop, debugging=False)
EOF
python $tmpfile
rm $tmpfile

If you are just using the Crop as a persistence mechanism, then Crop.grow or Crop.grow_missing will process the batches in the current process:


In [11]:
c.grow_missing(parallel=True)  #  this accepts combo_runner kwargs


100%|##########| 10/10 [00:01<00:00,  5.43it/s]

In [12]:
print(c)


/home/jg3014/Sync/dev/python/xyzpy/docs/examples/.xyz-first_run
------------------------------------------------------=========
20 / 20 batches of size 5 completed
[####################] : 100.0%

Reap the results

The final step is to 'reap' the results from disk. Because the crop was instantiated from a Harvester, that harvester will be automatically used to collect the resulting dataset and sync it with the on-disk dataset:


In [13]:
c.reap()


100%|##########| 100/100 [00:00<00:00, 19596.80it/s]
Out[13]:
<xarray.Dataset>
Dimensions:  (a: 10, b: 10)
Coordinates:
  * a        (a) int64 0 1 2 3 4 5 6 7 8 9
  * b        (b) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    sum      (a, b) int64 0 1 2 3 4 5 6 7 8 9 1 ... 9 10 11 12 13 14 15 16 17 18
    diff     (a, b) int64 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 1 ... 9 8 7 6 5 4 3 2 1 0

The dataset foo_data.h5 should be on disk, and the crop folder cleaned up:


In [14]:
ls -a


 ./                              'dask distributed example.ipynb'*
 ../                             'farming example.ipynb'*
'basic output example.ipynb'*     foo_data.h5
'complex output example.ipynb'*   .ipynb_checkpoints/
'crop example.ipynb'*

And we can inspect the results:


In [15]:
h.full_ds.xyz.iheatmap('a', 'b', 'diff')


Loading BokehJS ...
Out[15]:
<xyzpy.plot.plotter_bokeh.IHeatMap at 0x7f95564f6a20>

Many crops can be created from the harvester at once, and when they are reaped, the results should be seamlessly combined into the on-disk dataset.


In [16]:
# for now clean up
h.delete_ds()