In [1]:
%autosave 10


Autosaving every 10 seconds

 Intro

  • Multiple cores, and GPUs, are becoming mainstream. Eventually thousands of cores will become typical.

Cython

  • All except two pure Python standard library modules will compile as-is using Cython, so it is compliant.
    • Just use Cython without annotation, usually quite good.
  • Can re-use existing C/C++
  • SciPy/pandas already use Cython.

 Workflow

  1. Write Cython code in pyx file.
    • Can start as original pure Python.
  2. Compile, execute setup.py.
  3. Get extension module, *.so (Linux), *.pyd (Windows).
  4. Use your extension.
    • Users do not require Cython to use compiled modules.

!!AI goes through Cython basics, same as tutorial on webpage.

Annotations - Cython at work

  • Generate HTML of annotated output, because C output is not intended for human consumption.
  • Python is yellow, C is white, click to see C source.
  • Try to turn yellow to white.

Pure Python with Decorators

In newer Cython versions

import cython

@cython.locals(a=cython.double, b=cython.double)
def add(a, b):
    return a + b

Still valid Python, will of course need cython module. Can still compile down.

This makes sense. Decorators are used for cross-cutting concerns, i.e. not part of core logic, and Cython is just that.

 Automate the Automation

pyximport tries to compile all pyz and py files, then:

import pyximport
pyximport.install()

import cy_101_pyximport

print(cy_101_pyximport.typed_python_func(3, 4))

The Buffer Interface

  • NumPy-inspired low-level view of C data structures.
  • Want to avoid copying data, this is expensive. Just use data inplace.

 Example

  • a and b are 2000 x 2000, 4 million elements. (a + b) * 2 + (a * b).
    • !!AI not matrix multiplication, element-wise multiplication.
    • Hence classic "embarrassingly parallel" problem.

 With multiprocessing

  • Split matrix into 4 quadrants.
  • But 6 times slower than NumPy.
    • Due to serialisation / pickling of data to and from processes.
    • Need to keep data close to calculators, not send it around.

The Buffer Interface from Cython

Key point:

@cython.boundscheck(False)
@cython.wraparound(False)
def func(object[double, ndim=2] buf1 not None,
         object[double, ndim=2] buf2 not None,
         object[double, ndim=2] output=one):


You want to loop over the NumPy arrray. Usually a no-no, but since this gets converted down to C fine.

Why does this speed up?

  • Tiling. Rather than going row by row going into quadrants improves usage of caches (subset of data can fit into L1/L2). Without parallelisation it is faster.

Cython and OpenMP

  • OpenMP is very featureful, and Cython can access that.
    • e.g. optimally give work to workers who need work, rather than same-size sharing.
  • You can (must) disable the GIL in key points.
  • However you lose the safety-net of Python. You need to know about C and OpenMP.

Using threads explicitly

  • If you want to you can explicitly refer to threads, but not necessary to use OpenMP.
  • Use with nogil when in C-only code.
  • Use with gil when need Python objects.
  • Can nest with gil below with nogil.
  • You need with nogil to have genuine parallelism when using threads explicitly.

Use threads implicitly

  • prange
  • But OpenMP with 4 threads gives x2 speedup (hyperthreading not useful), but tiling effect gives x4 speedup.

 Questions

  • Haven't tried OpenCL and benchmarked it.

In [ ]: