Module parallelization

Parallelization of your module could be the solution to obtain the results in an acceptable amount of time. But you have to be careful to identify the true bottleneck of your module. Other scientific tasks like for example numerical modelling commonly look as follows:

+------------------+                        +---------------+
| input data       |                        |               |
| with the initial +----------------------->| final results |
| condition        |    long computation    |               |
+------------------+        tasks           +---------------+

In GIS, we are commonly dealing with rather simple chained computations that are applied to (massive) input data requiring several heavy partial results before to obtain the result:

+------------+
|            |       +---------+    +---------+    +---------+
| input data +------>| partial +--->| partial +--->| partial +
|            |       | results |    | results |    | results |
+------------+       +---------+    +---------+    +----+----+
                                                        |
                                                        |
+---------------+    +---------+    +---------+    +---------+
|               |<---+ partial |<---+ partial |<---+ partial |
| final results |    | results |    | results |    | results |
|               |    +---------+    +---------+    +----+----+
+---------------+                                        

With respect to other parallelization tasks, in GIS processing often the main bottleneck is the maximal hard disk read/write speed occuring in the data intensive computation steps. Before starting to use all cores of your CPU, be sure to check if you have already saturated the read/write capabilites of your system. This can e.g. be done with a system monitor tool which generates small graphs of the resource consumption.

ParallelModuleQueue class

A simple way to execute several modules in parallel (developed by Sören Gebbert) is the ParallelModuleQueue class. The basic idea is to create a queue with all the modules that must be execute in parallel. The ParallelModuleQueue class is based on the Module class of the pygrass library, here a small example for viewshed calculation:


In [ ]:
# import the necessary libraries
from copy import deepcopy

from grass.pygrass.modules import Module, ParallelModuleQueue
from grass.pygrass.vector import VectorTopo

In [ ]:
# define the global variables with the inputs
TMP_VIEWSHED = 'tmp_viewshed_{:03}'
ELEV = 'elevation'
POINTS = 'view_points'

In [ ]:
# we set the region to elevation map
region = Module('g.region', raster='elevation')

In [ ]:
# initialize an empty queue and list
queue = ParallelModuleQueue(nprocs=4)
viewsheds = []

In [ ]:
# initialize a module instance with shared inputs
viewshed = Module('r.viewshed', input=ELEV, observer_elevation=3,
                  run_=False, overwrite=True)

# open the input vector map and start cycling over the vector points
with VectorTopo(POINTS, mode='r') as points:
    for point in points:
        # create a copy of the module and set the remaining parameters
        print(point.id)
        m = deepcopy(viewshed)(output=TMP_VIEWSHED.format(point.id),
                               coordinates=point.coords())
        viewsheds.append(m)
        queue.put(m)
    queue.wait()

In [ ]:
viewsheds

In [ ]: