In [1]:
%autosave 10


Autosaving every 10 seconds

What, problem

  • Python allows rapid prototyping
  • But after profiling and finding slowdown need to speed up the bottleneck.
  • Need to keep team speed high.
    • Need to profile quickly, 30 minutes, not set up an expensive framework.
    • Yet the bus factor of heavily optimized code must be larger than one.
      • "Bus factor": how many people can get hit by buses until system unmaintainable.
    • So performance optimizations can't be esoteric.
  • Why is this important
    • Want to keep tasks fast and yet fit onto one machine.
    • Else need to manage clusters
    • 8GB RAM, 4 cores, ~hundreds GB SSD.
  • Book: "High Performance Python", O'Reilly.

 cProfile

  • CPU profiler, traces calls.
  • Combine with RunSnakeRun to visualise.
  • But can't drill into Python C modules, e.g. abs.
  • Don't get argument analysis; what set of arguments is causing pathological behaviour for a given function? !!AI surely you'd make more than one function to allow this type of profiling?

line_profiler

  • line-by-line profiling
  • requires a decorator that'll fail your unit tests.
    • !!AI in a previous talk presenter said you can make a dummy decorator
  • line_profiler, indeed all profilers, can't interrogate compound statements
    • !!AI again, just break it down.

 memory_profiler

  • same decorator, method as line_profiler.
  • uses psutil to ask OS for memory consumption.
    • we're not asking Python for memory occupancy of objects.
  • C modules don't tell Python how big they are, but since we're asking OS still works.
  • In IPython, %memit is magic incantation, e.g.

    %memit [0]*1000000

memory_profiler mprof

  • measure difference between two codebases.
  • did my pull request make a meanginful difference? How does the difference vary over time?
  • scikit-learn pull request 2248.

transforming memory_profiler into a resource profiler?

  • Talking with author to also measure I/O, both on disk and over network.
  • Draw plots comparing CPU / memory / I/O over time.
  • So can do: CPU, memory, disk I/O, network I/O
  • psutil could also let us:
    • mmaps?
    • file handles?
    • network connecions?
    • cache utilisation via libperf?
      • instructions per cycle. could be too low, using numppy improves it.
      • if data set too big can't fit into L1/L2, and this could tell you.
  • Could allow quick overview of an application without having to do deep code reading.
  • Presenter has used perfstat to profile CPython externally, no reason why libperf couldn't be used too.

 Cython

  • Hands-down, easiest and fastest way to optimize Python.
  • But you need to annotate code, write C-like code, so reduces team agility.
  • If you've profiled and found one hot function, great use Cython.
  • But once you've done it the bus factor drops, you have to educate people on Cython and compiling C.

Cython + NumPy + OpenMP nogil

  • Use NumPy to escape from CPython control; just a continguous array of bytes.
  • Then escape the GIL, use OpenMP to transaprently parallelise over cores.

Shedskin

  • Point Shedskin at module with a main routine.
  • Shedskin does autonomous type annotation, then converts to C.
  • It's just like Cython, but you do no work.
  • However it doesn't work with NumPy, doesn't work on byte arrays.
    • Shedskin copies all Python datastructures into C world, so double memory occupancy.
  • Idea: why not take AST of annotated Shedskin output and create a dodgy first guess annotated Cython file.
    • wouldn't work first-time, but a hell of a hint.
    • not implemented, an idea.

Pythran

  • Pass in another DSL, not same as Cython.
    • Still, superior to Cython because you just need two lines for his example.
  • Use #pythran annotation.
  • Support of OpenMP on numpy arrays.

PyPy

  • Fast, production, Python 2.7 compatible, ready for pure-Python code.
    • Many companies have switched to it for e.g. web servers.
  • Limited support for pre-existing C extentions
  • numpypy has bugs, incomplete, not production ready. If you try it add extensive unit tests.

Numba

  • Simple decorator, @jit(nopython=True)
  • LLVM, compile down to LLVM instruction language.
    • So not just C as output, but can compile down to GPU instructions.
  • API is very unstable, in flux.
    • You need to experiment and play with it.

Tool tradeoffs

  • PyPy, no learning curve, easiest win, pure Python only.
  • ShedSkin easy, pure Python only.
  • Cython, pure Python, hours to learn, team cost low.
  • Cython + NumPy + OpenMP, days to learn, high cost.
  • Numba has extreme dependency requirements (mainly LLVM), tricky to install. Could use Anaconda, but then depend on Anaconda.
  • Pythran is simple, hours to learn. Short projects looking for quick win then try it.
  • numexpr (not covered), intelligently vecotirses numpy expressions.
    • !!AI pandas transparently uses this.

 Wrapup

  • Need better, richer profiling tools.
  • 4-12 physical cores is becoming commonplace. Need to exploit it.
  • Hand-annotating code reduces agility
  • JIT/AST compilers getting better, still requires manual intervention.
  • Ultimately: hardware is cheaper than people. So consider costs of this too.

Questions

  • Author's Cython workflow is to use its annotation mode, which shows yellow for code that calls into CPython. Want to avoid yellow.
    • He makes six-seven subdirectories of different code, makes six-seven HTML annotation output, then compare yellowness to CPU times.

In [ ]: