What, problem

Python allows rapid prototyping
But after profiling and finding slowdown need to speed up the bottleneck.
Need to keep team speed high.
- Need to profile quickly, 30 minutes, not set up an expensive framework.
- Yet the bus factor of heavily optimized code must be larger than one.
  - "Bus factor": how many people can get hit by buses until system unmaintainable.
- So performance optimizations can't be esoteric.
Why is this important
- Want to keep tasks fast and yet fit onto one machine.
- Else need to manage clusters
- 8GB RAM, 4 cores, ~hundreds GB SSD.
Book: "High Performance Python", O'Reilly.

cProfile

CPU profiler, traces calls.
Combine with RunSnakeRun to visualise.
But can't drill into Python C modules, e.g. abs.
Don't get argument analysis; what set of arguments is causing pathological behaviour for a given function? !!AI surely you'd make more than one function to allow this type of profiling?

line-by-line profiling
requires a decorator that'll fail your unit tests.
- !!AI in a previous talk presenter said you can make a dummy decorator
line_profiler, indeed all profilers, can't interrogate compound statements
- !!AI again, just break it down.

same decorator, method as line_profiler.
uses psutil to ask OS for memory consumption.
- we're not asking Python for memory occupancy of objects.
C modules don't tell Python how big they are, but since we're asking OS still works.
In IPython, %memit is magic incantation, e.g.

%memit [0]*1000000

measure difference between two codebases.
did my pull request make a meanginful difference? How does the difference vary over time?
scikit-learn pull request 2248.

Talking with author to also measure I/O, both on disk and over network.
Draw plots comparing CPU / memory / I/O over time.
So can do: CPU, memory, disk I/O, network I/O
psutil could also let us:
- mmaps?
- file handles?
- network connecions?
- cache utilisation via libperf?
  - instructions per cycle. could be too low, using numppy improves it.
  - if data set too big can't fit into L1/L2, and this could tell you.
Could allow quick overview of an application without having to do deep code reading.
Presenter has used perfstat to profile CPython externally, no reason why libperf couldn't be used too.

Hands-down, easiest and fastest way to optimize Python.
But you need to annotate code, write C-like code, so reduces team agility.
If you've profiled and found one hot function, great use Cython.
But once you've done it the bus factor drops, you have to educate people on Cython and compiling C.

Point Shedskin at module with a main routine.
Shedskin does autonomous type annotation, then converts to C.
It's just like Cython, but you do no work.
However it doesn't work with NumPy, doesn't work on byte arrays.
- Shedskin copies all Python datastructures into C world, so double memory occupancy.
Idea: why not take AST of annotated Shedskin output and create a dodgy first guess annotated Cython file.
- wouldn't work first-time, but a hell of a hint.
- not implemented, an idea.

Pass in another DSL, not same as Cython.
- Still, superior to Cython because you just need two lines for his example.
Use #pythran annotation.
Support of OpenMP on numpy arrays.

Fast, production, Python 2.7 compatible, ready for pure-Python code.
- Many companies have switched to it for e.g. web servers.
Limited support for pre-existing C extentions
numpypy has bugs, incomplete, not production ready. If you try it add extensive unit tests.

Simple decorator, @jit(nopython=True)
LLVM, compile down to LLVM instruction language.
- So not just C as output, but can compile down to GPU instructions.
API is very unstable, in flux.
- You need to experiment and play with it.

PyPy, no learning curve, easiest win, pure Python only.
ShedSkin easy, pure Python only.
Cython, pure Python, hours to learn, team cost low.
Cython + NumPy + OpenMP, days to learn, high cost.
Numba has extreme dependency requirements (mainly LLVM), tricky to install. Could use Anaconda, but then depend on Anaconda.
Pythran is simple, hours to learn. Short projects looking for quick win then try it.
numexpr (not covered), intelligently vecotirses numpy expressions.
- !!AI pandas transparently uses this.