Performance optimization and analysis

In this tutorial, we will:

  • learn how to optimize the performance of an Operator,
  • investigate the effects of optimization in two real-life seismic inversion Operators,
  • analyze and interpret the performance report displayed after a run.

We will rely on preset models and Operators from a seismic inversion problem based on an isotropic acoustic wave equation. To run one such Operator, in particular a forward modeling operator, we will exploit the benchmark.py module. This provides a number of options to configure the simulation and to try out different optimizations. The benchmark.py module is intended to let newcomers play with Devito -- and its performance optimizations! -- without having to know anything about its symbolic language, mechanisms and functioning.


In [ ]:
import examples.seismic.benchmark
benchmark = examples.seismic.benchmark.__file__

In [ ]:
# For a full list of options
%run $benchmark --help

OK, now we want Devito to run this Operator.


In [ ]:
%run $benchmark run -P acoustic

That was simple. Of course, we may want to run the same simulation on a bigger grid, with different grid point spacing or space order, and so on. And yes, we'll do this later on in the tutorial. But before scaling up in problem size, we shall take a look at what sort of performance optimizations we'll be able to apply to speed it up.

In essence, there are four knobs we can play with to maximize the Operator performance (or to see how the performace varies when adding or removing specific transformations):

  • parallelism,
  • the Devito Symbolic Engine (DSE) optimization level,
  • the Devito Loop Engine (DLE) optimization level,
  • loop blocking auto-tuning.

Shared-memory parallelism

We start with the most obvious -- parallelism. Devito implements shared-memory parallelism via OpenMP. To enable it, we would usually simply set the environment variable DEVITO_OPENMP=1. However, since we are running in a Jupyter notebook, we can change the relevant configuration option directly:


In [ ]:
from devito import configuration
configuration['openmp'] = True

Multiple threads will now be used when running an Operator. But how many? And how efficiently? We may be running on a multi-socket node -- how should we treat it, as a "flatten system" or what?

Devito aims to use distributed-memory parallelism over multi-socket nodes; that is, it allocates one MPI process per socket, and each MPI process should spawn as many OpenMP threads as the number of cores on the socket. Users don't get all this for free, however; a minimal configuration effort is required. But don't worry: as you shall see, it's much simpler than it sounds!

For this tutorial, we forget about MPI; we rather focus on enabling OpenMP on a single socket. So, first of all, we want to restrain execution to a single socket -- we want threads to stay on that socket without ever migrating to other cores of the system due to OS scheduling. Are we really on a multi-socket node? And how many physical cores does a socket have? Let's find out. We shall use a very standard tool such as lscpu on Linux systems.


In [ ]:
! lscpu

A line beginning with 'NUMA node...' represents one specific socket. Its value (on the right hand side, after the ':') indicates the ID ranges of its logical cores. For example, if our node consisted of two sockets, each socket having 16 physical cores and 2 hyperthreads per core, we would see something like

...
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
...

Now, say we choose to run on 16 cores of socket 0 (node0). We first have to set the following OpenMP environment variable:


In [ ]:
%env OMP_NUM_THREADS=16

Thus, 16 threads will be spawned each time an Operator is run. They will be killed as soon as the Operator has done its job.

We also want to bind them to the physical cores of socket 0; that is, we want to prevent OS-induced migration. This is known as thread pinning. One may use a program such as numactl or, alternatively, exploit other OpenMP environment variables. If the Intel compiler is at our disposal, we can enforce pinning through the following two-step procedure:

  • We tell Devito to use the Intel compiler through the special DEVITO_ARCH environment variable;
  • We set the Intel-specific KMP_HW_SUBSET and KMP_AFFINITY environment variables.

In [ ]:
# Thread pinning
%env KMP_HW_SUBSET=16c,1t
%env KMP_AFFINITY=compact
# Tell Devito to use the Intel compiler
%env DEVITO_ARCH=intel
# Or, analogously, using the configuration dictionary
configuration['compiler'] = 'intel'

If one didn't have access to the Intel compiler, it would still be possible to enable thread pinning through analogous mechanisms provided by OpenMP 4.5, namely the OMP_PLACES and OMP_PROC_BIND variables.


In [ ]:
# Uncomment if necessary; note that the available GCC version must support OpenMP 4.5 for the following to have effect
# %env OMP_PROC_BIND=close
# %env OMP_PLACES=cores
# %env DEVITO_ARCH=gcc

Let's see how threading and pinning impact the Operator performance.

We run the isotropic acoustic forward operator again, but this time with a much larger grid, a 256x256x256 cube, and a more realistic space order, 8. We also shorten the duration by deliberately choosing a very small simulation end time (50 ms).


In [ ]:
for i in range(3):
    print ("Run %d" % i)
    %run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50

Observation: the execution times are stable. This is a symptom that thread pinning is working. In practice, don't forget to check by taking a look at OpenMP reports or using profilers (e.g., Intel VTune) or through user-friendly tools such as htop.

DSE - Devito Symbolic Engine

We know how to switch on parallelism. So, it's finally time to see what kind of optimizations can Devito perform. By default, Devito aggressively optimizes all Operators. When running through benchmark.py, however, optimizations are left disabled until users explicitly request them. This, hopefully, simplifies initial experimentation and investigation.

Let's then dive into to the Devito Symbolic Engine (or DSE) section of this tutorial. It is worth observing that the DSE is one of the distinguishing features of Devito w.r.t. many other stencil frameworks! Why is that? This is what the documentation says about the DSE:

[The DSE performs] Flop-count optimizations - They aim to reduce the operation count of an Operator. These include, for example, code motion, factorization, and detection of cross-stencil redundancies. The flop-count optimizations are performed by routines built on top of SymPy.

So the DSE reduces the flop count of Operators. This is particularly useful in the case of complicated PDEs, for example those making extensive use of differential operators. And even more important in high order methods. In such cases, it's not unusual to end up with kernels requiring hundreds of arithmetic operations per grid point calculation. Since Devito doesn't make assumptions about the PDEs, the presence of an optimization system such as the DSE becomes of fundamental importance. In fact, we know that its impact has been remarkable in real-life siesmic inversion operators that have been written on top of Devito (e.g., TTI operators).

Let's see what happens enabling the DSE in our acoustic operator.


In [ ]:
# Increase Devito logger verbosity to display useful information about the DSE
configuration['log_level'] = 'DEBUG'

%run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50 -dse advanced

Compared to the previous runs, do you note any change ...

  • in the Devito output reports?
  • in execution times? why?

And why, from a performance analysis point of view, is the DSE useful even though no changes in execution times are observed?

DLE - Devito Loop Engine

We know how to switch on parallelism and how to leverage the DSE to reduce the flop count of our stencil equations. What's still missing are SIMD vectorization and optimizations for data locality. We won't be able to reach a significant fraction of the attainable machine peak without aggressive loop transformations. Clearly, Devito users don't "see" loops -- in fact, they only write maths in Python! -- but the generated code is nothing more than classic C with loop nests for grid function updates. So how do these "low-level loops" get optimized? Quoting from the documentation:

Loop optimizations - Examples include SIMD vectorization and loop blocking. These are performed by the Devito Loop Engine (DLE), a sub-module consisting of a sequence of compiler passes manipulating abstract syntax trees [...]

In other words, the Devito compiler, through the DLE module, automatically applies loop transformations. The how it does that(i.e., manipulation of an intermediate representation), here, is irrelevant; we rather want to understand what the DLE can do, how to use it and what kind of tuning is required to maximize the performance of an Operator. As we shall see, using and tuning the DLE will be as simple as switching on some special environment variables!

So here's a (non complete) list of transformations that the DLE will automatically apply for you:

  • SIMD Vectorization
  • Loop blocking
  • Optimization of so called "remainder" loops
  • Creation of elemental functions

OK, let's run the same problem we ran above, but this time with the DLE at maximum level (it's gonna apply all optimizations listed above). Can you guess whether the performance will be substantially better? Why?


In [ ]:
%run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50 -dse advanced -dle advanced

Can we make the Operator run quicker? Yes !

What we are missing so far is performance tuning. Take loop blocking, for example. This is a parametric loop transformation: its impact will vary depending on block shape, size and scheduling. In the literature, over the years, dozens of different loop blocking strategies have been studied! Even though we used the simplest loop blocking scheme on Earth, we would need at least to come up with a block size fitting in some level of cache. Obviously, this is such a low level detail... and we don't want users to waste any time on thinking about these matters.

Like other frameworks, Devito can automatically detect a "sufficiently good" block size through an auto-tuning engine. Let's try it out; in particular, we tell Devito to be "aggressive" in the search for block sizes. Being "aggressive" means that more time will be spent on auto-tuning, but there's a greater chance of retrieving a better candidate. We can either do this by setting an environment variable (export DEVITO_AUTOTUNING=aggressive) or directly through the configuration dictionary.


In [ ]:
configuration['autotuning'] = 'aggressive'

%run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50 -dse advanced -dle advanced -a

Note the addition of -a to the arguments list of our benchmarking script. This enables auto-tuning; the aggressive setting drives the auto-tuner search.

Do you note any difference in performance? Why?

Exercise: performance analysis of a TTI forward operator

There's another operator that you can try running with our benchmarking script, namely a Tilted Transverse Isotropic operator for forward modeling. This one is even much more interesting than the isotropic acoustic one, as it's representative of a class of real-life wave modeling operators. The physics, and thus the equations, are more complicated, which results in computationally expensive numerical kernels. You can appreciate that yourself by skimming through the genereted code, whose location is displayed after JIT compilation.

Here's how to run TTI:


In [ ]:
%run $benchmark run -P tti

This exercise asks you to repeat the same study we did for acoustic, playing with different DSE and DLE levels. Oh, and don't forget, there's another DSE level that you can try besides advanced, namely aggressive; just give it a go. How does the performance impact of DSE/DLE vary w.r.t. isotropic acoustic operator? And why?

A sneak peek at the YASK backend

YASK -- Yet Another Stencil Kernel -- is, as described on its GitHub page

a framework to facilitate exploration of the HPC stencil-performance design space.

The fundamental point is that YASK operates at a level of abstraction lower than that of Devito; for example, besides being a static system written in C++, no symbolic language is offered. We've been working on integrating YASK, meaning that, under the hood, Devito may "offload" (some of) the Operator stencil updates onto YASK. This is totally transparent to users; no changes to existing user code are required to exploit YASK! The goal is to create a synergy between the simplicity of use of Devito (i.e., the high-level abstractions) and a powerful optimization system such as YASK, specifically conceived to optimize stencil codes.

So, how do we try out YASK? Again, it's as difficult as setting one more environment variable: DEVITO_BACKEND=yask. In a notebook, we simply run the following:


In [ ]:
configuration['backend'] = 'yask'
# Decrease Devito logger verbosity, as currently YASK produces lots of debug messages
configuration['log_level'] = 'INFO'

Now we can experiment a little with YASK. Watch out, however, as this is work in progress: the performance may still be quite far from the attainable peaks (e.g., Devito cannot still leverage YASK's auto-tuning system) and some Operators are still unsupported, although our running example should just work out-of-the-box.


In [ ]:
%run $benchmark run -P acoustic -so 8 -d 256 256 256 --tn 50 -dse advanced

Note that we didn't even have to specify the -a and -dle advanced options, because YASK takes responsability for loop optimization and auto-tuning.

How faster should we expect to go thanks to YASK? Here's what we've obtained on a Knights Landing node, with the same equation used throughout the tutorial. In the plot, core refers to a run with DLE=advanced; in both core and YASK, DSE=advanced.

This notebook is part of the tutorial "Optimised Symbolic Finite Difference Computation with Devito" presented at the Intel® HPC Developer Conference 2017.