7 Best Practices, Testing and Performance

Advanced bits marked with ⚠ might be more for reference later.

PEP8 -- Style Guide for Python Code

Intendation is significant in Python, preventing the worst mess. The remaining stylistic questions, like naming conventions, used in the standard library are summarized in the Python Enhancement Proposal 8. Most Python programmers adhere to this style guide (Google has their own which is similar and also popular), making it easy to read someone elses code.

⚠ Some editors and tools like autopep8 help check for violations. Other tools, like yapf, go further and reformat your code, e.g.

def forces(t,
                X,      N):
    """Calculate forces from neighbouring cells"""
    for i, x in enumerate(X):
        r = X[N[i]] - x
        norm_r = np.minimum             (np.linalg.norm(r,



        axis=1), r_max)
        norm_F = 2*(r_min -       norm_r)*(norm_r - r_max) - (norm_r - r_max)**2
        F[i] = np.sum(r*(
                                norm_F/
                    norm_r)[:, None], axis=0)
    return F.ravel()

becomes:

def forces(t, X, N):
    """Calculate forces from neighbouring cells"""
    for i, x in enumerate(X):
        r = X[N[i]] - x
        norm_r = np.minimum(np.linalg.norm(r, axis=1), r_max)
        norm_F = 2 * (r_min - norm_r) * (norm_r - r_max) - (norm_r - r_max)**2
        F[i] = np.sum(r * (norm_F / norm_r)[:, None], axis=0)
    return F.ravel()

⚠ Compilation would prevent many erros that are only detected at run-time in uncompiled languages like Python. But some Editors and tools like pyflakes and pylint help with static code analysis to prevent the worst errors.

PEP 20 -- The Zen of Python



In [1]:

    
import this









    



The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Structuring Code

Don't repeat yourself (and others) to keep your programs small

Structure code into functions and files. You can import functions and data from files (so-called modules):

# File module/example.py
x = 'Example data'

def example_function():
    pass

if __name__ == '__main__':
    whatever()

# File in .
from module import example

example.x

Documentation

Document why and not what (what states the code)
Commit early and often to keep track of why you change things
False (usually out-dated) documentation is worse than none
Turn questions (yours or others) into documentation

Write self-documenting code, e.g. (from real-life C++) change

if ((j != k) and (d_type[d_lattice->d_cell_id[j]] == mesenchyme)
        and (d_type[d_lattice->d_cell_id[k]] == mesenchyme)
        and (dist < r_link)
        and (fabs(r.w/(d_X[d_lattice->d_cell_id[j]].w + 
             d_X[d_lattice->d_cell_id[k]].w)) > 0.2)) {
    d_link[i].a = d_lattice->d_cell_id[j];
    d_link[i].b = d_lattice->d_cell_id[k];
}

auto both_mesenchyme = (d_type[d_lattice->d_cell_id[j]] == mesenchyme)
    and (d_type[d_lattice->d_cell_id[k]] == mesenchyme);
auto along_w = fabs(r.w/(d_X[d_lattice->d_cell_id[j]].w + 
                    d_X[d_lattice->d_cell_id[k]].w)) > 0.2;
if (both_mesenchyme and (dist < r_link) and along_w) {
    d_link[i].a = d_lattice->d_cell_id[j];
    d_link[i].b = d_lattice->d_cell_id[k];
}

Use docstrings:

def documented_function():
    """This is a docstring"""
    pass

⚠ Turn documentation into code using docopt:

"""A Docopt Example.

Usage:
  docoptest.py [flag] [--parameter <x>]
  docoptest.py (-h | --help)

Options:
  -h --help        Show this screen.
  --parameter <x>  Pass parameter.
"""
if __name__ == '__main__':
    from docopt import docopt

    args = docopt(__doc__)
    print(args)

Results in

$ python docoptest.py wrong usage
Usage:
  docoptest.py [flag] [--parameter <x>]
  docoptest.py (-h | --help)

$ python docoptest.py -h
A Docopt Example.

Usage:
  docoptest.py [flag] [--parameter <x>]
  docoptest.py (-h | --help)

Options:
  -h --help        Show this screen.
  --parameter <x>  Pass parameter.

$ python docoptest.py flag
{'--help': False,
 '--parameter': None,
 'flag': True}

Testing

Turn debugging print-statements into assert-statements



In [5]:

    
mass = -1
assert mass > 0, 'Mass cannot be negative!'









    



---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-5-964f15c7480b> in <module>()
      1 mass = -1
----> 2 assert mass > 0, "Mass cannot be negative!"

AssertionError: Mass cannot be negative!

Make sure your code does what it is supposed to do by testing it with simple examples where you know what to expect
Save those tests (e.g. in if __name__ == __main__ of each module) to keep your code correct while changing parts

⚠ Automate testing

import unittest

def square(x):
    return x*x

class TestSquare(unittest.TestCase):

    def test_square(self):
        self.assertEqual(square(2), 4)


if __name__ == '__main__':
    unittest.main()

⚠ Generate fuzzy tests using hypothesis:

from hypothesis import given
from hypothesis.strategies import text, floats

@given(text(min_size=10), text(min_size=10), text(min_size=10))
def test_triangle_inequality(self, a, b, c):
    self.assertTrue(zipstance(a, c) <= zipstance(a, b) + zipstance(b, c))

⚠ Turn documentation into tests using doctest:

def square(x):
    """Return the square of x.

    >>> square(2)
    0
    """
    return x*x

if __name__ == '__main__':
    import doctest
    doctest.testmod()

Will give

$ python doctestest.py 
**********************************************************************
File "doctestest.py", line 4, in __main__.square
Failed example:
    square(2)
Expected:
    0
Got:
    4
**********************************************************************
1 items had failures:
   1 of   1 in __main__.square
***Test Failed*** 1 failures.

⚠ Use mutation tests if you need to be sure that your code is correct. They replace pieces of code, e.g. < with > to find out whether your tests cover everything.

Optimizing Performance

Quick Tests



In [46]:

    
%timeit [i**2 for i in range(100000)]









    



10 loops, best of 3: 26.8 ms per loop



In [47]:

    
%timeit np.arange(100000)**2









    



1000 loops, best of 3: 198 µs per loop

Real-life Example

We tried to speed-up the simulation of an N-body problem describing cells interacting in pairs via the spherical potential $U(r) = -(r - r_{min})(r - r_{max})^2$, where $r = |\vec x_j - \vec x_i|$. The resulting forces can be calculated as

def forces(t, X, N):
    """Calculate forces from neighbouring cells"""
    for i, x in enumerate(X):
        r = X[N[i]] - x
        norm_r = np.minimum(np.linalg.norm(r, axis=1), r_max)
        norm_F = 2*(r_min - norm_r)*(norm_r - r_max) - (norm_r - r_max)**2
        F[i] = np.sum(r*(norm_F/norm_r)[:, None], axis=0)
    return F.ravel()

where N is a matrix giving the $k$ nearest neighbours.

Profiling

The cProfile module help identify bottlenecks when running 'command':

import cProfile

cProfile.run('command', 'nbodystats')

and pstat allows you then to analyse the data saved in nbodystats (in the actual code forces was wrapped into a class with __call__):



In [55]:

    
import pstats

p = pstats.Stats('nbodystats')
p.strip_dirs().sort_stats('cumulative').print_stats(10);









    



Wed Aug  9 14:52:36 2017    nbodystats

         20147135 function calls (20133338 primitive calls) in 54.027 seconds

   Ordered by: cumulative time
   List reduced from 341 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   54.027   54.027 {built-in method builtins.exec}
        1    0.000    0.000   54.027   54.027 <string>:1(<module>)
        1    0.012    0.012   54.027   54.027 solve.py:36(scipy_ode)
      300    0.002    0.000   53.165    0.177 _ode.py:396(integrate)
      300    0.047    0.000   53.163    0.177 _ode.py:1030(run)
    15811    0.060    0.000   53.116    0.003 round_up.py:93(squeezed_cubic_knn)
    15811   29.676    0.002   52.630    0.003 round_up.py:58(__call__)
  1581100    6.794    0.000   14.802    0.000 linalg.py:1976(norm)
  3166399   11.091    0.000   11.091    0.000 {method 'reduce' of 'numpy.ufunc' objects}
  1581100    2.429    0.000    8.121    0.000 fromnumeric.py:1743(sum)

Vectorize, 7x

While the initial function was already using numpy and vectors it still involves a for-loop that can be vectorized ... well, actually tensorized:

def forces(t, X, N):
    r = X[N] - np.tile(X, (k, 1, 1)).transpose(1, 0, 2)
    norm_r = np.minimum(np.linalg.norm(r, axis=2), r_max)
    norm_F = 2*(r_min - norm_r)*(norm_r - r_max) - (norm_r - r_max)**2
    F = np.sum(r*(norm_F/norm_r)[:, None].transpose(0, 2, 1), axis=1)
    return F.ravel()

Re-use resources, 1.5x

Broadcasting gives just a "view" instead of a copy returned by np.tile:

def forces(t, X, N):
    r = X[N] - X[:, None, :]
    norm_r = np.minimum(np.linalg.norm(r, axis=2), r_max)
    norm_F = 2*(r_min - norm_r)*(norm_r - r_max) - (norm_r - r_max)**2
    F = np.sum(r*(norm_F/norm_r)[:, None].transpose(0, 2, 1), axis=1)
    return F.ravel()

The standard inplace operators +=, -=, *=, and /= can speed up large numpy calculations a little, Pandas operations often have an inplace flag, e.g. df.reset_index(inplace=True). Sometimes allocating memory with np.empty or pd.DataFrame pays off. Make sure to use the fastest data type for each column of a pd.DataFrame.

⚠ Compile the calculation, 2x (3.5x on GTX 1060 for 10k cells)

Finally code can be compiled in various ways. We choose Theano because it requires little rewritting (and can run code on the GPU as well):

from theano import Tensor as T
from theano import function


X = T.matrix('X', dtype='floatX')
N = T.imatrix('N')

r = X[N] - X[:, None, :]
norm_r = T.minimum(r.norm(2, axis=2), 1)
norm_F = 2*(0.5 - norm_r)*(norm_r - 1) - (norm_r - 1)**2
F = T.sum(r*(norm_F/norm_r).dimshuffle(0, 1, 'x'), axis=1)
f = function([X, N], F.ravel(), allow_input_downcast=True)

def forces(t, X, N):
    return f(X.reshape(-1, 3), self.N)

Aftermath



In [58]:

    
p = pstats.Stats('theanostats')
p.strip_dirs().sort_stats('cumulative').print_stats(10);









    



Wed Aug  9 15:00:50 2017    theanostats

         1384886 function calls (1371089 primitive calls) in 3.072 seconds

   Ordered by: cumulative time
   List reduced from 342 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    3.072    3.072 {built-in method builtins.exec}
        1    0.000    0.000    3.072    3.072 <string>:1(<module>)
        1    0.009    0.009    3.072    3.072 solve.py:36(scipy_ode)
      300    0.001    0.000    2.286    0.008 _ode.py:396(integrate)
      300    0.033    0.000    2.285    0.008 _ode.py:1030(run)
    10249    0.032    0.000    2.252    0.000 round_up.py:96(squeezed_cubic_knn)
    10249    0.026    0.000    1.849    0.000 round_up.py:68(__call__)
    10249    1.610    0.000    1.817    0.000 function_module.py:725(__call__)
      300    0.002    0.000    0.494    0.002 frame.py:4346(append)
      300    0.001    0.000    0.492    0.002 merge.py:1396(concat)

The file function_module.py belongs to Theano, there is likely not much to be gained. Still most time is spent calculating the derivatives, but allocating the DataFrame for the results could shave another 17% (35% when using a GPU) off in the best case. This would result in a ~50x (100x) faster code than the initial version.

This prototype allowed us to simulate a few hundred interacting cells in seconds. However, we were aiming for at least half a million cells and thus built a CUDA/C++ implementation after three months.