Optimizing numerical Python code using Cython

Code Corner, Institute of Astronomy, 07/11/2017

by Bjoern Soergel

NB: This is based on a longer workshop about Python optimization that I gave for the other students earlier this year. There I was also talking about other Python optimization tools (e.g. numba) and some simple ways of parallel programming in Python. The Jupyter notebooks from this workshop are available here: http://www.ast.cam.ac.uk/~bs538/programming.html.

After testing various options for making numerical Python code fast, Cython is my clear favourite. For my current main project, all numerically expensive parts are written as Cython extensions.

Some general points about Cython:

Cython is a superset of the Python language. The most notable new feature are static type definitions.
Cython can be used to created compiled C extensions from Python-like code.
It integrates seamlessly with numpy arrays, making it a great tool for speeding up the performance critical parts of our Python code.
It can also be used for wrapping existing C/C++ libraries in Python. (But I am not going to talk about this in more detail today.)
It is a mature project and a well-established part of the scientific Python stack.
Do not confuse it with CPython (the standard Python implementation written in C we all use).

Installation

Cython is included in the Anaconda Python distribution. It can also be installed via pip.

Cython compilation workflow:

Step 1: Cythonize: Translate Cython source file (.pyx) into C source code.
Step 2: Compile .c source code
Step 3: Import extension into Python

Fortunately we can automate these steps, and they also integrate with Jupyter notebooks.

Using Cython to optimize numerical Python code:

Some general remarks about optimizing code:

Here is the procedure I would recommend for optimizing a piece of numerical code.

Remember the 80/20 principle: For a typical code, 80% of the runtime will be spent in at most 20% of the code. Don't waste your time optimizing anything but the critical 20%.
Profile your code to find out what the bottleneck is. In Jupyter notebooks, we can e.g. use the %timeit and %%time magic for that.
Before you start coding, think/research if there is a better algorithm of doing the same thing. Again, we don't want to waste our time optimizing a bad algorithm.
Don't parallelize blindly, first optimize on a single CPU.
Check carefully that the optimized version still gives the same result.

For the sake of this tutorial we focus exclusively on optimizing a given piece of code, but profiling and the algorithm are equally important!

Worked example: Contaminant removal on the sphere

Setup:

one catalogue with coordinates of objects of interest
second catalogue with coordinates of contaminants
want to flag all objects of interest that are closer than a certain angular distance to a contaminant object

NB: There are built-in functions in scipy and astropy to do exactly this computation, but that's not the point here.

Setup: generate random points



In [1]:

    
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
#helper function to generate points
def gen_points_sphere(n):
    """
    generate random points on sphere
    """
    #angles
    costheta = 2*np.random.rand(n)-1
    theta = np.arccos(costheta)
    phi = 2*np.pi*np.random.rand(n)
    #unit vectors on the sphere
    x = np.sin(theta)*np.cos(phi)
    y = np.sin(theta)*np.sin(phi)
    z = np.cos(theta)
    vec = np.array([x,y,z]).T
    return vec,theta,phi

n = 100000
vec,theta,phi = gen_points_sphere(n)

The next cell is just for visualization and can be commented if you do not have healpy installed.



In [3]:

    
import healpy as hp
def skymap(theta,phi,nside): 
    """
    creates healpix map of density on the sky
    """
    ipix = hp.ang2pix(nside,theta,phi)
    testmap = np.bincount(ipix,minlength=hp.nside2npix(nside))
    return testmap
        
testmap = skymap(theta,phi,64)
hp.mollview(testmap)

Generate the populations for the example. We begin with a small number of objects.



In [4]:

    
# objects (obj)
vec_obj,theta_obj,phi_obj = gen_points_sphere(5000) # ~500,000 in my real use case
# contaminants (ps)
vec_ps,theta_ps,phi_ps = gen_points_sphere(500) # ~50,000 in my real use case
print(vec_obj.shape,vec_ps.shape)
# maximum separation: 
maxsep_deg = 1.
cos_maxsep = np.cos(np.deg2rad(maxsep_deg))









    



(5000, 3) (500, 3)

Algorithm:

If we represent our points by 3D unit vectors, then $ \cos (\Delta \sigma) = \bf{n}_1 \dot \bf{n}_2$. We therefore only need compute one dot product per pair, which is faster then computing the great-circle distance.

NB: For large N a k-d tree is more efficient (this is what the astropy version uses internally. But for the sake of having an example that is easy to implement, let's stick with the dot product (2).

Implementations

1) Naive Python

(we all know that for loops in Python are slow)

NB: For convenience we are still using numpy arrays as input and output (we could do the same with lists though). The core of the computation is however in pure Python.



In [5]:

    
def angdistcut_python_naive(vec_obj,vec_ps,cos_maxsep):
    nobj = vec_obj.shape[0]
    nps = vec_ps.shape[0]
    dim = vec_obj.shape[1]
    #objects to be deleted
    out = []
    for i in range(nobj):
        for j in range(nps):
            cos = 0.
            #compute dot product
            for k in range(dim):
                cos += vec_obj[i,k] * vec_ps[j,k]
            #stop once we have found one contaminant
            if cos > cos_maxsep:
                out.append(i)
                break   
    return np.array(out)



In [6]:

    
%%time
result_python_naive = angdistcut_python_naive(vec_obj,vec_ps,cos_maxsep)









    



CPU times: user 4.19 s, sys: 8 ms, total: 4.2 s
Wall time: 4.21 s

2) Numpy

This is a more sensible solution making use of the numpy-builtins.



In [7]:

    
def angdistcut_numpy(vec_obj,vec_ps,cos_maxsep):
    nobj = vec_obj.shape[0]
    nps = vec_ps.shape[0]
    # fast vectorized way to compute dot product
    # (looses the advantage of stopping once one contaminant is found)
    # (but the numpy speed gains easily outweigh that)
    cos_arr = np.einsum('ik,jk->ij',vec_obj,vec_ps)
    # number of contaminants per object
    temp = (cos_arr > cos_maxsep).sum(axis=1)
    #indices of objects that should be removed
    out = np.flatnonzero(temp)
    return out



In [8]:

    
%%time
result_numpy = angdistcut_numpy(vec_obj,vec_ps,cos_maxsep)









    



CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 17.1 ms



In [9]:

    
# check that results agree
assert np.array_equal(result_python_naive,result_numpy)

Using numpy has given us a speed-up of a factor of ~200. This is quite typical for computations that can be vectorized efficiently.



In [10]:

    
# expected runtime for full sample (100 times as many objects and contaminants)
20e-3*100*100/3600. #hours









    Out[10]:





0.05555555555555555

In terms of runtime, our problem is solved. However, we have not considered memory usage so far. For its vectorized computation to be efficient, numpy has to store everything in large arrays that are contiguous in memory. Simply to store the intermediate array cos_arr we need at least:



In [11]:

    
# double precision (64 bit = 8 byte per element)
# Nobj*Nps*8 
5e5*5e4*8/1e9 #in GB









    Out[11]:





200.0

Our work desktops have 16 GB, so suddenly memory becomes the bottleneck. As soon as things need to be swapped from memory to the hard drive, things become really slow.

# only try this if you want to freeze your computer :) %%time vec_obj_new,_,_ = gen_points_sphere(500000) vec_ps_new,_,_ = gen_points_sphere(50000) test = angdistcut_numpy(vec_obj_new,vec_ps_new,cos_maxsep)

On top, some numpy functions allocate larger temporary arrays "behind the scenes", making this problem even worse.

Enter: python optimization tools

numba (numpy-aware just-in-time compiler)
Cython (Python/C hybrid language, creates compiled C extension)

From my experience, the two above are the most mature and useful ones, but there are certainly other options, such as:

other JIT compilers (Hope, Pythran, PyPy,...)
scipy weave (importing C code into python). Deprecated!
ctypes (Python/C API for C extensions)
f2py (Fortran --> Python modules)
...

NB: The latter three require you to write C/C++ or Fortran code. If you are proficient in either of these, you might not run into the issue of having to optimize numerical Python code so frequently. It might still be worth learning, as Python (and also Cython) are much nicer to write than C/C++ or Fortran.

Today I am only showing Cython, but it is worth knowing about the others.

Cython

Cython is a super-set of Python. It allows static type definitions, translates our Python code into C code, which we can then compile and then import as an external library (.so on linux) back into Python. In Jupyter notebooks, all this can be done automatically behind the scenes.

Outside of Jupyter notebooks, we need to write a very short Python Makefile that handles the compilation for us, see e.g. here: http://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html



In [12]:

    
%load_ext Cython

Start: a cythonized version of the naive pure-Python implementation



In [13]:

    
%%cython -a
# -a provides annotation
# yellow annotations: calls to Python --> slow
# want to remove all yellow from our loops

#Python functions,attributes from numpy
import numpy as np
# C functions,attributes
cimport numpy as np
# Cython handles the namespace ambiguity internally in this case

def angdistcut_cython_naive(vec_obj, vec_ps, cos_maxsep):
    nobj = vec_obj.shape[0]
    nps = vec_ps.shape[0]
    dim = vec_obj.shape[1]
    #objects to be deleted
    out = []
    for i in range(nobj):
        for j in range(nps):
            cos = 0.
            #compute dot product
            for k in range(dim):
                cos += vec_obj[i,k] * vec_ps[j,k]
            #stop once we have found one contaminant
            if cos > cos_maxsep:
                out.append(i)
                break   
    return np.array(out)









    Out[13]:









    
    Cython: _cython_magic_f0c498c337a32577ec162baadf774112.pyx
    
    


Generated by Cython 0.26

    Yellow lines hint at Python interaction.

    Click on a line that starts with a "+" to see the C code that Cython generated for it.

+01: # -a provides annotation
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 1, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 02: # yellow annotations: calls to Python --> slow
 03: # want to remove all yellow from our loops
 04: 
 05: #Python functions,attributes from numpy
+06: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) __PYX_ERR(0, 6, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 07: # C functions,attributes
 08: cimport numpy as np
 09: # Cython handles the namespace ambiguity internally in this case
 10: 
+11: def angdistcut_cython_naive(vec_obj, vec_ps, cos_maxsep):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_f0c498c337a32577ec162baadf774112_1angdistcut_cython_naive(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static PyMethodDef __pyx_mdef_46_cython_magic_f0c498c337a32577ec162baadf774112_1angdistcut_cython_naive = {"angdistcut_cython_naive", (PyCFunction)__pyx_pw_46_cython_magic_f0c498c337a32577ec162baadf774112_1angdistcut_cython_naive, METH_VARARGS|METH_KEYWORDS, 0};
static PyObject *__pyx_pw_46_cython_magic_f0c498c337a32577ec162baadf774112_1angdistcut_cython_naive(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  PyObject *__pyx_v_vec_obj = 0;
  PyObject *__pyx_v_vec_ps = 0;
  PyObject *__pyx_v_cos_maxsep = 0;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_naive (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_vec_obj,&__pyx_n_s_vec_ps,&__pyx_n_s_cos_maxsep,0};
    PyObject* values[3] = {0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        CYTHON_FALLTHROUGH;
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_obj)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_ps)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_naive", 1, 3, 3, 1); __PYX_ERR(0, 11, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_cos_maxsep)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_naive", 1, 3, 3, 2); __PYX_ERR(0, 11, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "angdistcut_cython_naive") < 0)) __PYX_ERR(0, 11, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 3) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
    }
    __pyx_v_vec_obj = values[0];
    __pyx_v_vec_ps = values[1];
    __pyx_v_cos_maxsep = values[2];
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("angdistcut_cython_naive", 1, 3, 3, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 11, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_f0c498c337a32577ec162baadf774112.angdistcut_cython_naive", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_f0c498c337a32577ec162baadf774112_angdistcut_cython_naive(__pyx_self, __pyx_v_vec_obj, __pyx_v_vec_ps, __pyx_v_cos_maxsep);

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_f0c498c337a32577ec162baadf774112_angdistcut_cython_naive(CYTHON_UNUSED PyObject *__pyx_self, PyObject *__pyx_v_vec_obj, PyObject *__pyx_v_vec_ps, PyObject *__pyx_v_cos_maxsep) {
  PyObject *__pyx_v_nobj = NULL;
  PyObject *__pyx_v_nps = NULL;
  PyObject *__pyx_v_dim = NULL;
  PyObject *__pyx_v_out = NULL;
  PyObject *__pyx_v_i = NULL;
  PyObject *__pyx_v_j = NULL;
  PyObject *__pyx_v_cos = NULL;
  PyObject *__pyx_v_k = NULL;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_naive", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_8);
  __Pyx_XDECREF(__pyx_t_11);
  __Pyx_XDECREF(__pyx_t_12);
  __Pyx_AddTraceback("_cython_magic_f0c498c337a32577ec162baadf774112.angdistcut_cython_naive", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __Pyx_XDECREF(__pyx_v_nobj);
  __Pyx_XDECREF(__pyx_v_nps);
  __Pyx_XDECREF(__pyx_v_dim);
  __Pyx_XDECREF(__pyx_v_out);
  __Pyx_XDECREF(__pyx_v_i);
  __Pyx_XDECREF(__pyx_v_j);
  __Pyx_XDECREF(__pyx_v_cos);
  __Pyx_XDECREF(__pyx_v_k);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__10 = PyTuple_Pack(11, __pyx_n_s_vec_obj, __pyx_n_s_vec_ps, __pyx_n_s_cos_maxsep, __pyx_n_s_nobj, __pyx_n_s_nps, __pyx_n_s_dim, __pyx_n_s_out, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_cos, __pyx_n_s_k); if (unlikely(!__pyx_tuple__10)) __PYX_ERR(0, 11, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__10);
  __Pyx_GIVEREF(__pyx_tuple__10);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_f0c498c337a32577ec162baadf774112_1angdistcut_cython_naive, NULL, __pyx_n_s_cython_magic_f0c498c337a32577ec); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 11, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_angdistcut_cython_naive, __pyx_t_1) < 0) __PYX_ERR(0, 11, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
+12:     nobj = vec_obj.shape[0]
  __pyx_t_1 = __Pyx_PyObject_GetAttrStr(__pyx_v_vec_obj, __pyx_n_s_shape); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = __Pyx_GetItemInt(__pyx_t_1, 0, long, 1, __Pyx_PyInt_From_long, 0, 0, 1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 12, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_nobj = __pyx_t_2;
  __pyx_t_2 = 0;
+13:     nps = vec_ps.shape[0]
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_v_vec_ps, __pyx_n_s_shape); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 13, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_1 = __Pyx_GetItemInt(__pyx_t_2, 0, long, 1, __Pyx_PyInt_From_long, 0, 0, 1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 13, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_v_nps = __pyx_t_1;
  __pyx_t_1 = 0;
+14:     dim = vec_obj.shape[1]
  __pyx_t_1 = __Pyx_PyObject_GetAttrStr(__pyx_v_vec_obj, __pyx_n_s_shape); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_2 = __Pyx_GetItemInt(__pyx_t_1, 1, long, 1, __Pyx_PyInt_From_long, 0, 0, 1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 14, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_dim = __pyx_t_2;
  __pyx_t_2 = 0;
 15:     #objects to be deleted
+16:     out = []
  __pyx_t_2 = PyList_New(0); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 16, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_v_out = ((PyObject*)__pyx_t_2);
  __pyx_t_2 = 0;
+17:     for i in range(nobj):
  __pyx_t_2 = PyTuple_New(1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_INCREF(__pyx_v_nobj);
  __Pyx_GIVEREF(__pyx_v_nobj);
  PyTuple_SET_ITEM(__pyx_t_2, 0, __pyx_v_nobj);
  __pyx_t_1 = __Pyx_PyObject_Call(__pyx_builtin_range, __pyx_t_2, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  if (likely(PyList_CheckExact(__pyx_t_1)) || PyTuple_CheckExact(__pyx_t_1)) {
    __pyx_t_2 = __pyx_t_1; __Pyx_INCREF(__pyx_t_2); __pyx_t_3 = 0;
    __pyx_t_4 = NULL;
  } else {
    __pyx_t_3 = -1; __pyx_t_2 = PyObject_GetIter(__pyx_t_1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
    __pyx_t_4 = Py_TYPE(__pyx_t_2)->tp_iternext; if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 17, __pyx_L1_error)
  }
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  for (;;) {
    if (likely(!__pyx_t_4)) {
      if (likely(PyList_CheckExact(__pyx_t_2))) {
        if (__pyx_t_3 >= PyList_GET_SIZE(__pyx_t_2)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyList_GET_ITEM(__pyx_t_2, __pyx_t_3); __Pyx_INCREF(__pyx_t_1); __pyx_t_3++; if (unlikely(0 < 0)) __PYX_ERR(0, 17, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_3); __pyx_t_3++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      } else {
        if (__pyx_t_3 >= PyTuple_GET_SIZE(__pyx_t_2)) break;
        #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
        __pyx_t_1 = PyTuple_GET_ITEM(__pyx_t_2, __pyx_t_3); __Pyx_INCREF(__pyx_t_1); __pyx_t_3++; if (unlikely(0 < 0)) __PYX_ERR(0, 17, __pyx_L1_error)
        #else
        __pyx_t_1 = PySequence_ITEM(__pyx_t_2, __pyx_t_3); __pyx_t_3++; if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_1);
        #endif
      }
    } else {
      __pyx_t_1 = __pyx_t_4(__pyx_t_2);
      if (unlikely(!__pyx_t_1)) {
        PyObject* exc_type = PyErr_Occurred();
        if (exc_type) {
          if (likely(exc_type == PyExc_StopIteration || PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
          else __PYX_ERR(0, 17, __pyx_L1_error)
        }
        break;
      }
      __Pyx_GOTREF(__pyx_t_1);
    }
    __Pyx_XDECREF_SET(__pyx_v_i, __pyx_t_1);
    __pyx_t_1 = 0;
/* … */
  }
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
+18:         for j in range(nps):
    __pyx_t_1 = PyTuple_New(1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 18, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_INCREF(__pyx_v_nps);
    __Pyx_GIVEREF(__pyx_v_nps);
    PyTuple_SET_ITEM(__pyx_t_1, 0, __pyx_v_nps);
    __pyx_t_5 = __Pyx_PyObject_Call(__pyx_builtin_range, __pyx_t_1, NULL); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 18, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_5);
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
    if (likely(PyList_CheckExact(__pyx_t_5)) || PyTuple_CheckExact(__pyx_t_5)) {
      __pyx_t_1 = __pyx_t_5; __Pyx_INCREF(__pyx_t_1); __pyx_t_6 = 0;
      __pyx_t_7 = NULL;
    } else {
      __pyx_t_6 = -1; __pyx_t_1 = PyObject_GetIter(__pyx_t_5); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 18, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __pyx_t_7 = Py_TYPE(__pyx_t_1)->tp_iternext; if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 18, __pyx_L1_error)
    }
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    for (;;) {
      if (likely(!__pyx_t_7)) {
        if (likely(PyList_CheckExact(__pyx_t_1))) {
          if (__pyx_t_6 >= PyList_GET_SIZE(__pyx_t_1)) break;
          #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
          __pyx_t_5 = PyList_GET_ITEM(__pyx_t_1, __pyx_t_6); __Pyx_INCREF(__pyx_t_5); __pyx_t_6++; if (unlikely(0 < 0)) __PYX_ERR(0, 18, __pyx_L1_error)
          #else
          __pyx_t_5 = PySequence_ITEM(__pyx_t_1, __pyx_t_6); __pyx_t_6++; if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 18, __pyx_L1_error)
          __Pyx_GOTREF(__pyx_t_5);
          #endif
        } else {
          if (__pyx_t_6 >= PyTuple_GET_SIZE(__pyx_t_1)) break;
          #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
          __pyx_t_5 = PyTuple_GET_ITEM(__pyx_t_1, __pyx_t_6); __Pyx_INCREF(__pyx_t_5); __pyx_t_6++; if (unlikely(0 < 0)) __PYX_ERR(0, 18, __pyx_L1_error)
          #else
          __pyx_t_5 = PySequence_ITEM(__pyx_t_1, __pyx_t_6); __pyx_t_6++; if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 18, __pyx_L1_error)
          __Pyx_GOTREF(__pyx_t_5);
          #endif
        }
      } else {
        __pyx_t_5 = __pyx_t_7(__pyx_t_1);
        if (unlikely(!__pyx_t_5)) {
          PyObject* exc_type = PyErr_Occurred();
          if (exc_type) {
            if (likely(exc_type == PyExc_StopIteration || PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
            else __PYX_ERR(0, 18, __pyx_L1_error)
          }
          break;
        }
        __Pyx_GOTREF(__pyx_t_5);
      }
      __Pyx_XDECREF_SET(__pyx_v_j, __pyx_t_5);
      __pyx_t_5 = 0;
/* … */
    }
    __pyx_L6_break:;
    __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
+19:             cos = 0.
      __Pyx_INCREF(__pyx_float_0_);
      __Pyx_XDECREF_SET(__pyx_v_cos, __pyx_float_0_);
 20:             #compute dot product
+21:             for k in range(dim):
      __pyx_t_5 = PyTuple_New(1); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 21, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_5);
      __Pyx_INCREF(__pyx_v_dim);
      __Pyx_GIVEREF(__pyx_v_dim);
      PyTuple_SET_ITEM(__pyx_t_5, 0, __pyx_v_dim);
      __pyx_t_8 = __Pyx_PyObject_Call(__pyx_builtin_range, __pyx_t_5, NULL); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 21, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_8);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      if (likely(PyList_CheckExact(__pyx_t_8)) || PyTuple_CheckExact(__pyx_t_8)) {
        __pyx_t_5 = __pyx_t_8; __Pyx_INCREF(__pyx_t_5); __pyx_t_9 = 0;
        __pyx_t_10 = NULL;
      } else {
        __pyx_t_9 = -1; __pyx_t_5 = PyObject_GetIter(__pyx_t_8); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 21, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_5);
        __pyx_t_10 = Py_TYPE(__pyx_t_5)->tp_iternext; if (unlikely(!__pyx_t_10)) __PYX_ERR(0, 21, __pyx_L1_error)
      }
      __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
      for (;;) {
        if (likely(!__pyx_t_10)) {
          if (likely(PyList_CheckExact(__pyx_t_5))) {
            if (__pyx_t_9 >= PyList_GET_SIZE(__pyx_t_5)) break;
            #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
            __pyx_t_8 = PyList_GET_ITEM(__pyx_t_5, __pyx_t_9); __Pyx_INCREF(__pyx_t_8); __pyx_t_9++; if (unlikely(0 < 0)) __PYX_ERR(0, 21, __pyx_L1_error)
            #else
            __pyx_t_8 = PySequence_ITEM(__pyx_t_5, __pyx_t_9); __pyx_t_9++; if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 21, __pyx_L1_error)
            __Pyx_GOTREF(__pyx_t_8);
            #endif
          } else {
            if (__pyx_t_9 >= PyTuple_GET_SIZE(__pyx_t_5)) break;
            #if CYTHON_ASSUME_SAFE_MACROS && !CYTHON_AVOID_BORROWED_REFS
            __pyx_t_8 = PyTuple_GET_ITEM(__pyx_t_5, __pyx_t_9); __Pyx_INCREF(__pyx_t_8); __pyx_t_9++; if (unlikely(0 < 0)) __PYX_ERR(0, 21, __pyx_L1_error)
            #else
            __pyx_t_8 = PySequence_ITEM(__pyx_t_5, __pyx_t_9); __pyx_t_9++; if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 21, __pyx_L1_error)
            __Pyx_GOTREF(__pyx_t_8);
            #endif
          }
        } else {
          __pyx_t_8 = __pyx_t_10(__pyx_t_5);
          if (unlikely(!__pyx_t_8)) {
            PyObject* exc_type = PyErr_Occurred();
            if (exc_type) {
              if (likely(exc_type == PyExc_StopIteration || PyErr_GivenExceptionMatches(exc_type, PyExc_StopIteration))) PyErr_Clear();
              else __PYX_ERR(0, 21, __pyx_L1_error)
            }
            break;
          }
          __Pyx_GOTREF(__pyx_t_8);
        }
        __Pyx_XDECREF_SET(__pyx_v_k, __pyx_t_8);
        __pyx_t_8 = 0;
/* … */
      }
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
+22:                 cos += vec_obj[i,k] * vec_ps[j,k]
        __pyx_t_8 = PyTuple_New(2); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_8);
        __Pyx_INCREF(__pyx_v_i);
        __Pyx_GIVEREF(__pyx_v_i);
        PyTuple_SET_ITEM(__pyx_t_8, 0, __pyx_v_i);
        __Pyx_INCREF(__pyx_v_k);
        __Pyx_GIVEREF(__pyx_v_k);
        PyTuple_SET_ITEM(__pyx_t_8, 1, __pyx_v_k);
        __pyx_t_11 = PyObject_GetItem(__pyx_v_vec_obj, __pyx_t_8); if (unlikely(!__pyx_t_11)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_11);
        __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
        __pyx_t_8 = PyTuple_New(2); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_8);
        __Pyx_INCREF(__pyx_v_j);
        __Pyx_GIVEREF(__pyx_v_j);
        PyTuple_SET_ITEM(__pyx_t_8, 0, __pyx_v_j);
        __Pyx_INCREF(__pyx_v_k);
        __Pyx_GIVEREF(__pyx_v_k);
        PyTuple_SET_ITEM(__pyx_t_8, 1, __pyx_v_k);
        __pyx_t_12 = PyObject_GetItem(__pyx_v_vec_ps, __pyx_t_8); if (unlikely(!__pyx_t_12)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_12);
        __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
        __pyx_t_8 = PyNumber_Multiply(__pyx_t_11, __pyx_t_12); if (unlikely(!__pyx_t_8)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_8);
        __Pyx_DECREF(__pyx_t_11); __pyx_t_11 = 0;
        __Pyx_DECREF(__pyx_t_12); __pyx_t_12 = 0;
        __pyx_t_12 = PyNumber_InPlaceAdd(__pyx_v_cos, __pyx_t_8); if (unlikely(!__pyx_t_12)) __PYX_ERR(0, 22, __pyx_L1_error)
        __Pyx_GOTREF(__pyx_t_12);
        __Pyx_DECREF(__pyx_t_8); __pyx_t_8 = 0;
        __Pyx_DECREF_SET(__pyx_v_cos, __pyx_t_12);
        __pyx_t_12 = 0;
 23:             #stop once we have found one contaminant
+24:             if cos > cos_maxsep:
      __pyx_t_5 = PyObject_RichCompare(__pyx_v_cos, __pyx_v_cos_maxsep, Py_GT); __Pyx_XGOTREF(__pyx_t_5); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 24, __pyx_L1_error)
      __pyx_t_13 = __Pyx_PyObject_IsTrue(__pyx_t_5); if (unlikely(__pyx_t_13 < 0)) __PYX_ERR(0, 24, __pyx_L1_error)
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
      if (__pyx_t_13) {
/* … */
      }
+25:                 out.append(i)
        __pyx_t_14 = __Pyx_PyList_Append(__pyx_v_out, __pyx_v_i); if (unlikely(__pyx_t_14 == -1)) __PYX_ERR(0, 25, __pyx_L1_error)
+26:                 break
        goto __pyx_L6_break;
+27:     return np.array(out)
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_1 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  __pyx_t_5 = __Pyx_PyObject_GetAttrStr(__pyx_t_1, __pyx_n_s_array); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 27, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_t_1 = NULL;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_5))) {
    __pyx_t_1 = PyMethod_GET_SELF(__pyx_t_5);
    if (likely(__pyx_t_1)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_5);
      __Pyx_INCREF(__pyx_t_1);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_5, function);
    }
  }
  if (!__pyx_t_1) {
    __pyx_t_2 = __Pyx_PyObject_CallOneArg(__pyx_t_5, __pyx_v_out); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_2);
  } else {
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_5)) {
      PyObject *__pyx_temp[2] = {__pyx_t_1, __pyx_v_out};
      __pyx_t_2 = __Pyx_PyFunction_FastCall(__pyx_t_5, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_1); __pyx_t_1 = 0;
      __Pyx_GOTREF(__pyx_t_2);
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_5)) {
      PyObject *__pyx_temp[2] = {__pyx_t_1, __pyx_v_out};
      __pyx_t_2 = __Pyx_PyCFunction_FastCall(__pyx_t_5, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_1); __pyx_t_1 = 0;
      __Pyx_GOTREF(__pyx_t_2);
    } else
    #endif
    {
      __pyx_t_12 = PyTuple_New(1+1); if (unlikely(!__pyx_t_12)) __PYX_ERR(0, 27, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_12);
      __Pyx_GIVEREF(__pyx_t_1); PyTuple_SET_ITEM(__pyx_t_12, 0, __pyx_t_1); __pyx_t_1 = NULL;
      __Pyx_INCREF(__pyx_v_out);
      __Pyx_GIVEREF(__pyx_v_out);
      PyTuple_SET_ITEM(__pyx_t_12, 0+1, __pyx_v_out);
      __pyx_t_2 = __Pyx_PyObject_Call(__pyx_t_5, __pyx_t_12, NULL); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 27, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_2);
      __Pyx_DECREF(__pyx_t_12); __pyx_t_12 = 0;
    }
  }
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  __pyx_r = __pyx_t_2;
  __pyx_t_2 = 0;
  goto __pyx_L0;



In [14]:

    
%%time
result_cython_naive = angdistcut_cython_naive(vec_obj,vec_ps,cos_maxsep)









    



CPU times: user 3.67 s, sys: 8 ms, total: 3.68 s
Wall time: 3.66 s



In [15]:

    
assert np.array_equal(result_cython_naive,result_numpy)

Essentially all Python code is valid Cython code. But without adding type definitions we get only a very small speed-up.

Adding static type definitions

Here we make use of Cython's typed memorview. This is a container that provides fast access to numpy-like arrays in Cython.

NB: I have also made one small change to the algorithm: instead of appending to a list I use an array to store which objects are flagged as contaminated, and at the end return the nonzero elements of that array (similar to the numpy case above.)



In [16]:

    
%%cython -a

import numpy as np
cimport numpy as np

def angdistcut_cython_typed(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
    """
    NB: double[:] --> "typed memoryview": gives fast access to numpy-like arrays in Cython
    double[:,::1] --> 2D typed memoryview, with last index varying fastest
    for this it needs to be C contigous
    """
    
    cdef:
        int nps = vec_ps.shape[0]
        int nobj = vec_obj.shape[0]
        int dim = vec_obj.shape[1]
        #use int array instead of list
        int[:] found = np.zeros(nobj, np.int32)
        int i,j
        double cos

    for i in range(nobj):
        for j in range(nps):
            cos = (vec_obj[i,0]*vec_ps[j,0] + 
                   vec_obj[i,1]*vec_ps[j,1] + 
                   vec_obj[i,2]*vec_ps[j,2])
            if cos > cos_maxsep:
                found[i] = 1
                break
    #convert result from typed memory view into array before returning 
    return np.flatnonzero(np.asarray(found))









    Out[16]:









    
    Cython: _cython_magic_de08e5659540fb4eb798dbe4a19ecfbb.pyx
    
    


Generated by Cython 0.26

    Yellow lines hint at Python interaction.

    Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 03: cimport numpy as np
 04: 
+05: def angdistcut_cython_typed(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_1angdistcut_cython_typed(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_angdistcut_cython_typed[] = "\n    NB: double[:] --> \"typed memoryview\": gives fast access to numpy-like arrays in Cython\n    double[:,::1] --> 2D typed memoryview, with last index varying fastest\n    for this it needs to be C contigous\n    ";
static PyMethodDef __pyx_mdef_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_1angdistcut_cython_typed = {"angdistcut_cython_typed", (PyCFunction)__pyx_pw_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_1angdistcut_cython_typed, METH_VARARGS|METH_KEYWORDS, __pyx_doc_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_angdistcut_cython_typed};
static PyObject *__pyx_pw_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_1angdistcut_cython_typed(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  __Pyx_memviewslice __pyx_v_vec_obj = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_vec_ps = { 0, 0, { 0 }, { 0 }, { 0 } };
  double __pyx_v_cos_maxsep;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_typed (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_vec_obj,&__pyx_n_s_vec_ps,&__pyx_n_s_cos_maxsep,0};
    PyObject* values[3] = {0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        CYTHON_FALLTHROUGH;
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_obj)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_ps)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_typed", 1, 3, 3, 1); __PYX_ERR(0, 5, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_cos_maxsep)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_typed", 1, 3, 3, 2); __PYX_ERR(0, 5, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "angdistcut_cython_typed") < 0)) __PYX_ERR(0, 5, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 3) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
    }
    __pyx_v_vec_obj = __Pyx_PyObject_to_MemoryviewSlice_d_dc_double(values[0]); if (unlikely(!__pyx_v_vec_obj.memview)) __PYX_ERR(0, 5, __pyx_L3_error)
    __pyx_v_vec_ps = __Pyx_PyObject_to_MemoryviewSlice_d_dc_double(values[1]); if (unlikely(!__pyx_v_vec_ps.memview)) __PYX_ERR(0, 5, __pyx_L3_error)
    __pyx_v_cos_maxsep = __pyx_PyFloat_AsDouble(values[2]); if (unlikely((__pyx_v_cos_maxsep == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 5, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("angdistcut_cython_typed", 1, 3, 3, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 5, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb.angdistcut_cython_typed", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_angdistcut_cython_typed(__pyx_self, __pyx_v_vec_obj, __pyx_v_vec_ps, __pyx_v_cos_maxsep);

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_angdistcut_cython_typed(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_vec_obj, __Pyx_memviewslice __pyx_v_vec_ps, double __pyx_v_cos_maxsep) {
  int __pyx_v_nps;
  int __pyx_v_nobj;
  CYTHON_UNUSED int __pyx_v_dim;
  __Pyx_memviewslice __pyx_v_found = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_i;
  int __pyx_v_j;
  double __pyx_v_cos;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_typed", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_7);
  __PYX_XDEC_MEMVIEW(&__pyx_t_8, 1);
  __Pyx_XDECREF(__pyx_t_27);
  __Pyx_AddTraceback("_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb.angdistcut_cython_typed", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_found, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_vec_obj, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_vec_ps, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__29 = PyTuple_Pack(10, __pyx_n_s_vec_obj, __pyx_n_s_vec_ps, __pyx_n_s_cos_maxsep, __pyx_n_s_nps, __pyx_n_s_nobj, __pyx_n_s_dim, __pyx_n_s_found, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_cos); if (unlikely(!__pyx_tuple__29)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__29);
  __Pyx_GIVEREF(__pyx_tuple__29);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_de08e5659540fb4eb798dbe4a19ecfbb_1angdistcut_cython_typed, NULL, __pyx_n_s_cython_magic_de08e5659540fb4eb7); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_angdistcut_cython_typed, __pyx_t_1) < 0) __PYX_ERR(0, 5, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_codeobj__30 = (PyObject*)__Pyx_PyCode_New(3, 0, 10, 0, 0, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__29, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_mini_cache_ipython_cython, __pyx_n_s_angdistcut_cython_typed, 5, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__30)) __PYX_ERR(0, 5, __pyx_L1_error)
 06:     """
 07:     NB: double[:] --> "typed memoryview": gives fast access to numpy-like arrays in Cython
 08:     double[:,::1] --> 2D typed memoryview, with last index varying fastest
 09:     for this it needs to be C contigous
 10:     """
 11: 
 12:     cdef:
+13:         int nps = vec_ps.shape[0]
  __pyx_v_nps = (__pyx_v_vec_ps.shape[0]);
+14:         int nobj = vec_obj.shape[0]
  __pyx_v_nobj = (__pyx_v_vec_obj.shape[0]);
+15:         int dim = vec_obj.shape[1]
  __pyx_v_dim = (__pyx_v_vec_obj.shape[1]);
 16:         #use int array instead of list
+17:         int[:] found = np.zeros(nobj, np.int32)
  __pyx_t_2 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __Pyx_PyInt_From_int(__pyx_v_nobj); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_4 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_int32); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = NULL;
  __pyx_t_6 = 0;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_4 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_4)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_4);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_6 = 1;
    }
  }
  #if CYTHON_FAST_PYCALL
  if (PyFunction_Check(__pyx_t_3)) {
    PyObject *__pyx_temp[3] = {__pyx_t_4, __pyx_t_2, __pyx_t_5};
    __pyx_t_1 = __Pyx_PyFunction_FastCall(__pyx_t_3, __pyx_temp+1-__pyx_t_6, 2+__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  } else
  #endif
  #if CYTHON_FAST_PYCCALL
  if (__Pyx_PyFastCFunction_Check(__pyx_t_3)) {
    PyObject *__pyx_temp[3] = {__pyx_t_4, __pyx_t_2, __pyx_t_5};
    __pyx_t_1 = __Pyx_PyCFunction_FastCall(__pyx_t_3, __pyx_temp+1-__pyx_t_6, 2+__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  } else
  #endif
  {
    __pyx_t_7 = PyTuple_New(2+__pyx_t_6); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    if (__pyx_t_4) {
      __Pyx_GIVEREF(__pyx_t_4); PyTuple_SET_ITEM(__pyx_t_7, 0, __pyx_t_4); __pyx_t_4 = NULL;
    }
    __Pyx_GIVEREF(__pyx_t_2);
    PyTuple_SET_ITEM(__pyx_t_7, 0+__pyx_t_6, __pyx_t_2);
    __Pyx_GIVEREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_7, 1+__pyx_t_6, __pyx_t_5);
    __pyx_t_2 = 0;
    __pyx_t_5 = 0;
    __pyx_t_1 = __Pyx_PyObject_Call(__pyx_t_3, __pyx_t_7, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 17, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  }
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_8 = __Pyx_PyObject_to_MemoryviewSlice_ds_int(__pyx_t_1);
  if (unlikely(!__pyx_t_8.memview)) __PYX_ERR(0, 17, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_found = __pyx_t_8;
  __pyx_t_8.memview = NULL;
  __pyx_t_8.data = NULL;
 18:         int i,j
 19:         double cos
 20: 
+21:     for i in range(nobj):
  __pyx_t_6 = __pyx_v_nobj;
  for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_6; __pyx_t_9+=1) {
    __pyx_v_i = __pyx_t_9;
+22:         for j in range(nps):
    __pyx_t_10 = __pyx_v_nps;
    for (__pyx_t_11 = 0; __pyx_t_11 < __pyx_t_10; __pyx_t_11+=1) {
      __pyx_v_j = __pyx_t_11;
+23:             cos = (vec_obj[i,0]*vec_ps[j,0] +
      __pyx_t_12 = __pyx_v_i;
      __pyx_t_13 = 0;
      __pyx_t_14 = -1;
      if (__pyx_t_12 < 0) {
        __pyx_t_12 += __pyx_v_vec_obj.shape[0];
        if (unlikely(__pyx_t_12 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_12 >= __pyx_v_vec_obj.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_13 < 0) {
        __pyx_t_13 += __pyx_v_vec_obj.shape[1];
        if (unlikely(__pyx_t_13 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_13 >= __pyx_v_vec_obj.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 23, __pyx_L1_error)
      }
      __pyx_t_15 = __pyx_v_j;
      __pyx_t_16 = 0;
      __pyx_t_14 = -1;
      if (__pyx_t_15 < 0) {
        __pyx_t_15 += __pyx_v_vec_ps.shape[0];
        if (unlikely(__pyx_t_15 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_15 >= __pyx_v_vec_ps.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_16 < 0) {
        __pyx_t_16 += __pyx_v_vec_ps.shape[1];
        if (unlikely(__pyx_t_16 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_16 >= __pyx_v_vec_ps.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 23, __pyx_L1_error)
      }
+24:                    vec_obj[i,1]*vec_ps[j,1] +
      __pyx_t_17 = __pyx_v_i;
      __pyx_t_18 = 1;
      __pyx_t_14 = -1;
      if (__pyx_t_17 < 0) {
        __pyx_t_17 += __pyx_v_vec_obj.shape[0];
        if (unlikely(__pyx_t_17 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_17 >= __pyx_v_vec_obj.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_18 < 0) {
        __pyx_t_18 += __pyx_v_vec_obj.shape[1];
        if (unlikely(__pyx_t_18 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_18 >= __pyx_v_vec_obj.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 24, __pyx_L1_error)
      }
      __pyx_t_19 = __pyx_v_j;
      __pyx_t_20 = 1;
      __pyx_t_14 = -1;
      if (__pyx_t_19 < 0) {
        __pyx_t_19 += __pyx_v_vec_ps.shape[0];
        if (unlikely(__pyx_t_19 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_19 >= __pyx_v_vec_ps.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_20 < 0) {
        __pyx_t_20 += __pyx_v_vec_ps.shape[1];
        if (unlikely(__pyx_t_20 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_20 >= __pyx_v_vec_ps.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 24, __pyx_L1_error)
      }
/* … */
      __pyx_v_cos = ((((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_12 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_13)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_15 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_16)) )))) + ((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_17 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_18)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_19 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_20)) ))))) + ((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_21 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_22)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_23 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_24)) )))));
+25:                    vec_obj[i,2]*vec_ps[j,2])
      __pyx_t_21 = __pyx_v_i;
      __pyx_t_22 = 2;
      __pyx_t_14 = -1;
      if (__pyx_t_21 < 0) {
        __pyx_t_21 += __pyx_v_vec_obj.shape[0];
        if (unlikely(__pyx_t_21 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_21 >= __pyx_v_vec_obj.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_22 < 0) {
        __pyx_t_22 += __pyx_v_vec_obj.shape[1];
        if (unlikely(__pyx_t_22 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_22 >= __pyx_v_vec_obj.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 25, __pyx_L1_error)
      }
      __pyx_t_23 = __pyx_v_j;
      __pyx_t_24 = 2;
      __pyx_t_14 = -1;
      if (__pyx_t_23 < 0) {
        __pyx_t_23 += __pyx_v_vec_ps.shape[0];
        if (unlikely(__pyx_t_23 < 0)) __pyx_t_14 = 0;
      } else if (unlikely(__pyx_t_23 >= __pyx_v_vec_ps.shape[0])) __pyx_t_14 = 0;
      if (__pyx_t_24 < 0) {
        __pyx_t_24 += __pyx_v_vec_ps.shape[1];
        if (unlikely(__pyx_t_24 < 0)) __pyx_t_14 = 1;
      } else if (unlikely(__pyx_t_24 >= __pyx_v_vec_ps.shape[1])) __pyx_t_14 = 1;
      if (unlikely(__pyx_t_14 != -1)) {
        __Pyx_RaiseBufferIndexError(__pyx_t_14);
        __PYX_ERR(0, 25, __pyx_L1_error)
      }
+26:             if cos > cos_maxsep:
      __pyx_t_25 = ((__pyx_v_cos > __pyx_v_cos_maxsep) != 0);
      if (__pyx_t_25) {
/* … */
      }
    }
    __pyx_L6_break:;
  }
+27:                 found[i] = 1
        __pyx_t_26 = __pyx_v_i;
        __pyx_t_14 = -1;
        if (__pyx_t_26 < 0) {
          __pyx_t_26 += __pyx_v_found.shape[0];
          if (unlikely(__pyx_t_26 < 0)) __pyx_t_14 = 0;
        } else if (unlikely(__pyx_t_26 >= __pyx_v_found.shape[0])) __pyx_t_14 = 0;
        if (unlikely(__pyx_t_14 != -1)) {
          __Pyx_RaiseBufferIndexError(__pyx_t_14);
          __PYX_ERR(0, 27, __pyx_L1_error)
        }
        *((int *) ( /* dim=0 */ (__pyx_v_found.data + __pyx_t_26 * __pyx_v_found.strides[0]) )) = 1;
+28:                 break
        goto __pyx_L6_break;
 29:     #convert result from typed memory view into array before returning 
+30:     return np.flatnonzero(np.asarray(found))
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_7 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_flatnonzero); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_5 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_5, __pyx_n_s_asarray); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  __pyx_t_5 = __pyx_memoryview_fromslice(__pyx_v_found, 1, (PyObject *(*)(char *)) __pyx_memview_get_int, (int (*)(char *, PyObject *)) __pyx_memview_set_int, 0);; if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 30, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_4 = NULL;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_2))) {
    __pyx_t_4 = PyMethod_GET_SELF(__pyx_t_2);
    if (likely(__pyx_t_4)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2);
      __Pyx_INCREF(__pyx_t_4);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_2, function);
    }
  }
  if (!__pyx_t_4) {
    __pyx_t_3 = __Pyx_PyObject_CallOneArg(__pyx_t_2, __pyx_t_5); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 30, __pyx_L1_error)
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_GOTREF(__pyx_t_3);
  } else {
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_2)) {
      PyObject *__pyx_temp[2] = {__pyx_t_4, __pyx_t_5};
      __pyx_t_3 = __Pyx_PyFunction_FastCall(__pyx_t_2, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_2)) {
      PyObject *__pyx_temp[2] = {__pyx_t_4, __pyx_t_5};
      __pyx_t_3 = __Pyx_PyCFunction_FastCall(__pyx_t_2, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    } else
    #endif
    {
      __pyx_t_27 = PyTuple_New(1+1); if (unlikely(!__pyx_t_27)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_27);
      __Pyx_GIVEREF(__pyx_t_4); PyTuple_SET_ITEM(__pyx_t_27, 0, __pyx_t_4); __pyx_t_4 = NULL;
      __Pyx_GIVEREF(__pyx_t_5);
      PyTuple_SET_ITEM(__pyx_t_27, 0+1, __pyx_t_5);
      __pyx_t_5 = 0;
      __pyx_t_3 = __Pyx_PyObject_Call(__pyx_t_2, __pyx_t_27, NULL); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_27); __pyx_t_27 = 0;
    }
  }
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = NULL;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_7))) {
    __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_7);
    if (likely(__pyx_t_2)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_7);
      __Pyx_INCREF(__pyx_t_2);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_7, function);
    }
  }
  if (!__pyx_t_2) {
    __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_7, __pyx_t_3); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    __Pyx_GOTREF(__pyx_t_1);
  } else {
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[2] = {__pyx_t_2, __pyx_t_3};
      __pyx_t_1 = __Pyx_PyFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[2] = {__pyx_t_2, __pyx_t_3};
      __pyx_t_1 = __Pyx_PyCFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    } else
    #endif
    {
      __pyx_t_27 = PyTuple_New(1+1); if (unlikely(!__pyx_t_27)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_27);
      __Pyx_GIVEREF(__pyx_t_2); PyTuple_SET_ITEM(__pyx_t_27, 0, __pyx_t_2); __pyx_t_2 = NULL;
      __Pyx_GIVEREF(__pyx_t_3);
      PyTuple_SET_ITEM(__pyx_t_27, 0+1, __pyx_t_3);
      __pyx_t_3 = 0;
      __pyx_t_1 = __Pyx_PyObject_Call(__pyx_t_7, __pyx_t_27, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 30, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_27); __pyx_t_27 = 0;
    }
  }
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;

Because of the way we have created the input arrays, they are not C-contiguous (row-major ordering) in memory, but Fortran-contiguous (column-major ordering). If we simply input them into the Cython function, we would get an error. Of course we could have anticipated this when creating them, but is an instructive to do this here. (For Python this is never an issue.)



In [17]:

    
#change memory ordering
vec_obj_c = np.ascontiguousarray(vec_obj)
vec_ps_c = np.ascontiguousarray(vec_ps)



In [18]:

    
%%time
result_cython_typed = angdistcut_cython_typed(vec_obj_c,vec_ps_c,cos_maxsep)









    



CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.5 ms



In [19]:

    
assert np.array_equal(result_cython_typed,result_numpy)

With a little typing effort, we are already at roughly the same speed as the (internally highly optimized!) numpy version.

NB: again, this statement is somewhat platform dependent

Let's add some compiler directives to optimize our Cython function for even more speed-up.

full overview of available directives: http://cython.readthedocs.io/en/latest/src/reference/compilation.html#compiler-directives



In [20]:

    
%%cython -a

import numpy as np
cimport numpy as np
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def angdistcut_cython_opt(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
    """
    disabled some of the run-time checks, and also enabled C division
    the latter is pointless here, but can make large differences when divisions are involved
    """
    
    cdef:
        int nps = vec_ps.shape[0]
        int nobj = vec_obj.shape[0]
        int dim = vec_obj.shape[1]
        #use int array instead of list
        int[:] found = np.zeros(nobj, np.int32)
        int i,j
        double cos

    for i in range(nobj):
        for j in range(nps):
            cos = (vec_obj[i,0]*vec_ps[j,0] + 
                   vec_obj[i,1]*vec_ps[j,1] + 
                   vec_obj[i,2]*vec_ps[j,2])
            if cos > cos_maxsep:
                found[i] = 1
                break
                
    return np.flatnonzero(np.asarray(found))









    Out[20]:









    
    Cython: _cython_magic_b93a464e795d8c6062607c0172b204b3.pyx
    
    


Generated by Cython 0.26

    Yellow lines hint at Python interaction.

    Click on a line that starts with a "+" to see the C code that Cython generated for it.

 01: 
+02: import numpy as np
  __pyx_t_1 = __Pyx_Import(__pyx_n_s_numpy, 0, 0); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_np, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
/* … */
  __pyx_t_1 = PyDict_New(); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_test, __pyx_t_1) < 0) __PYX_ERR(0, 2, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
 03: cimport numpy as np
 04: cimport cython
 05: 
 06: @cython.boundscheck(False)
 07: @cython.wraparound(False)
 08: @cython.nonecheck(False)
 09: @cython.cdivision(True)
+10: def angdistcut_cython_opt(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep):
/* Python wrapper */
static PyObject *__pyx_pw_46_cython_magic_b93a464e795d8c6062607c0172b204b3_1angdistcut_cython_opt(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds); /*proto*/
static char __pyx_doc_46_cython_magic_b93a464e795d8c6062607c0172b204b3_angdistcut_cython_opt[] = "\n    disabled some of the run-time checks, and also enabled C division\n    the latter is pointless here, but can make large differences when divisions are involved\n    ";
static PyMethodDef __pyx_mdef_46_cython_magic_b93a464e795d8c6062607c0172b204b3_1angdistcut_cython_opt = {"angdistcut_cython_opt", (PyCFunction)__pyx_pw_46_cython_magic_b93a464e795d8c6062607c0172b204b3_1angdistcut_cython_opt, METH_VARARGS|METH_KEYWORDS, __pyx_doc_46_cython_magic_b93a464e795d8c6062607c0172b204b3_angdistcut_cython_opt};
static PyObject *__pyx_pw_46_cython_magic_b93a464e795d8c6062607c0172b204b3_1angdistcut_cython_opt(PyObject *__pyx_self, PyObject *__pyx_args, PyObject *__pyx_kwds) {
  __Pyx_memviewslice __pyx_v_vec_obj = { 0, 0, { 0 }, { 0 }, { 0 } };
  __Pyx_memviewslice __pyx_v_vec_ps = { 0, 0, { 0 }, { 0 }, { 0 } };
  double __pyx_v_cos_maxsep;
  PyObject *__pyx_r = 0;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_opt (wrapper)", 0);
  {
    static PyObject **__pyx_pyargnames[] = {&__pyx_n_s_vec_obj,&__pyx_n_s_vec_ps,&__pyx_n_s_cos_maxsep,0};
    PyObject* values[3] = {0,0,0};
    if (unlikely(__pyx_kwds)) {
      Py_ssize_t kw_args;
      const Py_ssize_t pos_args = PyTuple_GET_SIZE(__pyx_args);
      switch (pos_args) {
        case  3: values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
        CYTHON_FALLTHROUGH;
        case  2: values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
        CYTHON_FALLTHROUGH;
        case  1: values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
        CYTHON_FALLTHROUGH;
        case  0: break;
        default: goto __pyx_L5_argtuple_error;
      }
      kw_args = PyDict_Size(__pyx_kwds);
      switch (pos_args) {
        case  0:
        if (likely((values[0] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_obj)) != 0)) kw_args--;
        else goto __pyx_L5_argtuple_error;
        CYTHON_FALLTHROUGH;
        case  1:
        if (likely((values[1] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_vec_ps)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_opt", 1, 3, 3, 1); __PYX_ERR(0, 10, __pyx_L3_error)
        }
        CYTHON_FALLTHROUGH;
        case  2:
        if (likely((values[2] = PyDict_GetItem(__pyx_kwds, __pyx_n_s_cos_maxsep)) != 0)) kw_args--;
        else {
          __Pyx_RaiseArgtupleInvalid("angdistcut_cython_opt", 1, 3, 3, 2); __PYX_ERR(0, 10, __pyx_L3_error)
        }
      }
      if (unlikely(kw_args > 0)) {
        if (unlikely(__Pyx_ParseOptionalKeywords(__pyx_kwds, __pyx_pyargnames, 0, values, pos_args, "angdistcut_cython_opt") < 0)) __PYX_ERR(0, 10, __pyx_L3_error)
      }
    } else if (PyTuple_GET_SIZE(__pyx_args) != 3) {
      goto __pyx_L5_argtuple_error;
    } else {
      values[0] = PyTuple_GET_ITEM(__pyx_args, 0);
      values[1] = PyTuple_GET_ITEM(__pyx_args, 1);
      values[2] = PyTuple_GET_ITEM(__pyx_args, 2);
    }
    __pyx_v_vec_obj = __Pyx_PyObject_to_MemoryviewSlice_d_dc_double(values[0]); if (unlikely(!__pyx_v_vec_obj.memview)) __PYX_ERR(0, 10, __pyx_L3_error)
    __pyx_v_vec_ps = __Pyx_PyObject_to_MemoryviewSlice_d_dc_double(values[1]); if (unlikely(!__pyx_v_vec_ps.memview)) __PYX_ERR(0, 10, __pyx_L3_error)
    __pyx_v_cos_maxsep = __pyx_PyFloat_AsDouble(values[2]); if (unlikely((__pyx_v_cos_maxsep == (double)-1) && PyErr_Occurred())) __PYX_ERR(0, 10, __pyx_L3_error)
  }
  goto __pyx_L4_argument_unpacking_done;
  __pyx_L5_argtuple_error:;
  __Pyx_RaiseArgtupleInvalid("angdistcut_cython_opt", 1, 3, 3, PyTuple_GET_SIZE(__pyx_args)); __PYX_ERR(0, 10, __pyx_L3_error)
  __pyx_L3_error:;
  __Pyx_AddTraceback("_cython_magic_b93a464e795d8c6062607c0172b204b3.angdistcut_cython_opt", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __Pyx_RefNannyFinishContext();
  return NULL;
  __pyx_L4_argument_unpacking_done:;
  __pyx_r = __pyx_pf_46_cython_magic_b93a464e795d8c6062607c0172b204b3_angdistcut_cython_opt(__pyx_self, __pyx_v_vec_obj, __pyx_v_vec_ps, __pyx_v_cos_maxsep);

  /* function exit code */
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}

static PyObject *__pyx_pf_46_cython_magic_b93a464e795d8c6062607c0172b204b3_angdistcut_cython_opt(CYTHON_UNUSED PyObject *__pyx_self, __Pyx_memviewslice __pyx_v_vec_obj, __Pyx_memviewslice __pyx_v_vec_ps, double __pyx_v_cos_maxsep) {
  int __pyx_v_nps;
  int __pyx_v_nobj;
  CYTHON_UNUSED int __pyx_v_dim;
  __Pyx_memviewslice __pyx_v_found = { 0, 0, { 0 }, { 0 }, { 0 } };
  int __pyx_v_i;
  int __pyx_v_j;
  double __pyx_v_cos;
  PyObject *__pyx_r = NULL;
  __Pyx_RefNannyDeclarations
  __Pyx_RefNannySetupContext("angdistcut_cython_opt", 0);
/* … */
  /* function exit code */
  __pyx_L1_error:;
  __Pyx_XDECREF(__pyx_t_1);
  __Pyx_XDECREF(__pyx_t_2);
  __Pyx_XDECREF(__pyx_t_3);
  __Pyx_XDECREF(__pyx_t_4);
  __Pyx_XDECREF(__pyx_t_5);
  __Pyx_XDECREF(__pyx_t_7);
  __PYX_XDEC_MEMVIEW(&__pyx_t_8, 1);
  __Pyx_XDECREF(__pyx_t_26);
  __Pyx_AddTraceback("_cython_magic_b93a464e795d8c6062607c0172b204b3.angdistcut_cython_opt", __pyx_clineno, __pyx_lineno, __pyx_filename);
  __pyx_r = NULL;
  __pyx_L0:;
  __PYX_XDEC_MEMVIEW(&__pyx_v_found, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_vec_obj, 1);
  __PYX_XDEC_MEMVIEW(&__pyx_v_vec_ps, 1);
  __Pyx_XGIVEREF(__pyx_r);
  __Pyx_RefNannyFinishContext();
  return __pyx_r;
}
/* … */
  __pyx_tuple__29 = PyTuple_Pack(10, __pyx_n_s_vec_obj, __pyx_n_s_vec_ps, __pyx_n_s_cos_maxsep, __pyx_n_s_nps, __pyx_n_s_nobj, __pyx_n_s_dim, __pyx_n_s_found, __pyx_n_s_i, __pyx_n_s_j, __pyx_n_s_cos); if (unlikely(!__pyx_tuple__29)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_tuple__29);
  __Pyx_GIVEREF(__pyx_tuple__29);
/* … */
  __pyx_t_1 = PyCFunction_NewEx(&__pyx_mdef_46_cython_magic_b93a464e795d8c6062607c0172b204b3_1angdistcut_cython_opt, NULL, __pyx_n_s_cython_magic_b93a464e795d8c6062); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_1);
  if (PyDict_SetItem(__pyx_d, __pyx_n_s_angdistcut_cython_opt, __pyx_t_1) < 0) __PYX_ERR(0, 10, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_codeobj__30 = (PyObject*)__Pyx_PyCode_New(3, 0, 10, 0, 0, __pyx_empty_bytes, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_tuple__29, __pyx_empty_tuple, __pyx_empty_tuple, __pyx_kp_s_home_mini_cache_ipython_cython, __pyx_n_s_angdistcut_cython_opt, 10, __pyx_empty_bytes); if (unlikely(!__pyx_codeobj__30)) __PYX_ERR(0, 10, __pyx_L1_error)
 11:     """
 12:     disabled some of the run-time checks, and also enabled C division
 13:     the latter is pointless here, but can make large differences when divisions are involved
 14:     """
 15: 
 16:     cdef:
+17:         int nps = vec_ps.shape[0]
  __pyx_v_nps = (__pyx_v_vec_ps.shape[0]);
+18:         int nobj = vec_obj.shape[0]
  __pyx_v_nobj = (__pyx_v_vec_obj.shape[0]);
+19:         int dim = vec_obj.shape[1]
  __pyx_v_dim = (__pyx_v_vec_obj.shape[1]);
 20:         #use int array instead of list
+21:         int[:] found = np.zeros(nobj, np.int32)
  __pyx_t_2 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_3 = __Pyx_PyObject_GetAttrStr(__pyx_t_2, __pyx_n_s_zeros); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = __Pyx_PyInt_From_int(__pyx_v_nobj); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __pyx_t_4 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_4)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_4);
  __pyx_t_5 = __Pyx_PyObject_GetAttrStr(__pyx_t_4, __pyx_n_s_int32); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __Pyx_DECREF(__pyx_t_4); __pyx_t_4 = 0;
  __pyx_t_4 = NULL;
  __pyx_t_6 = 0;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_3))) {
    __pyx_t_4 = PyMethod_GET_SELF(__pyx_t_3);
    if (likely(__pyx_t_4)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_3);
      __Pyx_INCREF(__pyx_t_4);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_3, function);
      __pyx_t_6 = 1;
    }
  }
  #if CYTHON_FAST_PYCALL
  if (PyFunction_Check(__pyx_t_3)) {
    PyObject *__pyx_temp[3] = {__pyx_t_4, __pyx_t_2, __pyx_t_5};
    __pyx_t_1 = __Pyx_PyFunction_FastCall(__pyx_t_3, __pyx_temp+1-__pyx_t_6, 2+__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 21, __pyx_L1_error)
    __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  } else
  #endif
  #if CYTHON_FAST_PYCCALL
  if (__Pyx_PyFastCFunction_Check(__pyx_t_3)) {
    PyObject *__pyx_temp[3] = {__pyx_t_4, __pyx_t_2, __pyx_t_5};
    __pyx_t_1 = __Pyx_PyCFunction_FastCall(__pyx_t_3, __pyx_temp+1-__pyx_t_6, 2+__pyx_t_6); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 21, __pyx_L1_error)
    __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  } else
  #endif
  {
    __pyx_t_7 = PyTuple_New(2+__pyx_t_6); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 21, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_7);
    if (__pyx_t_4) {
      __Pyx_GIVEREF(__pyx_t_4); PyTuple_SET_ITEM(__pyx_t_7, 0, __pyx_t_4); __pyx_t_4 = NULL;
    }
    __Pyx_GIVEREF(__pyx_t_2);
    PyTuple_SET_ITEM(__pyx_t_7, 0+__pyx_t_6, __pyx_t_2);
    __Pyx_GIVEREF(__pyx_t_5);
    PyTuple_SET_ITEM(__pyx_t_7, 1+__pyx_t_6, __pyx_t_5);
    __pyx_t_2 = 0;
    __pyx_t_5 = 0;
    __pyx_t_1 = __Pyx_PyObject_Call(__pyx_t_3, __pyx_t_7, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 21, __pyx_L1_error)
    __Pyx_GOTREF(__pyx_t_1);
    __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  }
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_8 = __Pyx_PyObject_to_MemoryviewSlice_ds_int(__pyx_t_1);
  if (unlikely(!__pyx_t_8.memview)) __PYX_ERR(0, 21, __pyx_L1_error)
  __Pyx_DECREF(__pyx_t_1); __pyx_t_1 = 0;
  __pyx_v_found = __pyx_t_8;
  __pyx_t_8.memview = NULL;
  __pyx_t_8.data = NULL;
 22:         int i,j
 23:         double cos
 24: 
+25:     for i in range(nobj):
  __pyx_t_6 = __pyx_v_nobj;
  for (__pyx_t_9 = 0; __pyx_t_9 < __pyx_t_6; __pyx_t_9+=1) {
    __pyx_v_i = __pyx_t_9;
+26:         for j in range(nps):
    __pyx_t_10 = __pyx_v_nps;
    for (__pyx_t_11 = 0; __pyx_t_11 < __pyx_t_10; __pyx_t_11+=1) {
      __pyx_v_j = __pyx_t_11;
+27:             cos = (vec_obj[i,0]*vec_ps[j,0] +
      __pyx_t_12 = __pyx_v_i;
      __pyx_t_13 = 0;
      __pyx_t_14 = __pyx_v_j;
      __pyx_t_15 = 0;
+28:                    vec_obj[i,1]*vec_ps[j,1] +
      __pyx_t_16 = __pyx_v_i;
      __pyx_t_17 = 1;
      __pyx_t_18 = __pyx_v_j;
      __pyx_t_19 = 1;
/* … */
      __pyx_v_cos = ((((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_12 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_13)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_14 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_15)) )))) + ((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_16 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_17)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_18 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_19)) ))))) + ((*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_obj.data + __pyx_t_20 * __pyx_v_vec_obj.strides[0]) )) + __pyx_t_21)) ))) * (*((double *) ( /* dim=1 */ ((char *) (((double *) ( /* dim=0 */ (__pyx_v_vec_ps.data + __pyx_t_22 * __pyx_v_vec_ps.strides[0]) )) + __pyx_t_23)) )))));
+29:                    vec_obj[i,2]*vec_ps[j,2])
      __pyx_t_20 = __pyx_v_i;
      __pyx_t_21 = 2;
      __pyx_t_22 = __pyx_v_j;
      __pyx_t_23 = 2;
+30:             if cos > cos_maxsep:
      __pyx_t_24 = ((__pyx_v_cos > __pyx_v_cos_maxsep) != 0);
      if (__pyx_t_24) {
/* … */
      }
    }
    __pyx_L6_break:;
  }
+31:                 found[i] = 1
        __pyx_t_25 = __pyx_v_i;
        *((int *) ( /* dim=0 */ (__pyx_v_found.data + __pyx_t_25 * __pyx_v_found.strides[0]) )) = 1;
+32:                 break
        goto __pyx_L6_break;
 33: 
+34:     return np.flatnonzero(np.asarray(found))
  __Pyx_XDECREF(__pyx_r);
  __pyx_t_3 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_3);
  __pyx_t_7 = __Pyx_PyObject_GetAttrStr(__pyx_t_3, __pyx_n_s_flatnonzero); if (unlikely(!__pyx_t_7)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_7);
  __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
  __pyx_t_5 = __Pyx_GetModuleGlobalName(__pyx_n_s_np); if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_2 = __Pyx_PyObject_GetAttrStr(__pyx_t_5, __pyx_n_s_asarray); if (unlikely(!__pyx_t_2)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_2);
  __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
  __pyx_t_5 = __pyx_memoryview_fromslice(__pyx_v_found, 1, (PyObject *(*)(char *)) __pyx_memview_get_int, (int (*)(char *, PyObject *)) __pyx_memview_set_int, 0);; if (unlikely(!__pyx_t_5)) __PYX_ERR(0, 34, __pyx_L1_error)
  __Pyx_GOTREF(__pyx_t_5);
  __pyx_t_4 = NULL;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_2))) {
    __pyx_t_4 = PyMethod_GET_SELF(__pyx_t_2);
    if (likely(__pyx_t_4)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_2);
      __Pyx_INCREF(__pyx_t_4);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_2, function);
    }
  }
  if (!__pyx_t_4) {
    __pyx_t_3 = __Pyx_PyObject_CallOneArg(__pyx_t_2, __pyx_t_5); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 34, __pyx_L1_error)
    __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    __Pyx_GOTREF(__pyx_t_3);
  } else {
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_2)) {
      PyObject *__pyx_temp[2] = {__pyx_t_4, __pyx_t_5};
      __pyx_t_3 = __Pyx_PyFunction_FastCall(__pyx_t_2, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_2)) {
      PyObject *__pyx_temp[2] = {__pyx_t_4, __pyx_t_5};
      __pyx_t_3 = __Pyx_PyCFunction_FastCall(__pyx_t_2, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_4); __pyx_t_4 = 0;
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_5); __pyx_t_5 = 0;
    } else
    #endif
    {
      __pyx_t_26 = PyTuple_New(1+1); if (unlikely(!__pyx_t_26)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_26);
      __Pyx_GIVEREF(__pyx_t_4); PyTuple_SET_ITEM(__pyx_t_26, 0, __pyx_t_4); __pyx_t_4 = NULL;
      __Pyx_GIVEREF(__pyx_t_5);
      PyTuple_SET_ITEM(__pyx_t_26, 0+1, __pyx_t_5);
      __pyx_t_5 = 0;
      __pyx_t_3 = __Pyx_PyObject_Call(__pyx_t_2, __pyx_t_26, NULL); if (unlikely(!__pyx_t_3)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_3);
      __Pyx_DECREF(__pyx_t_26); __pyx_t_26 = 0;
    }
  }
  __Pyx_DECREF(__pyx_t_2); __pyx_t_2 = 0;
  __pyx_t_2 = NULL;
  if (CYTHON_UNPACK_METHODS && unlikely(PyMethod_Check(__pyx_t_7))) {
    __pyx_t_2 = PyMethod_GET_SELF(__pyx_t_7);
    if (likely(__pyx_t_2)) {
      PyObject* function = PyMethod_GET_FUNCTION(__pyx_t_7);
      __Pyx_INCREF(__pyx_t_2);
      __Pyx_INCREF(function);
      __Pyx_DECREF_SET(__pyx_t_7, function);
    }
  }
  if (!__pyx_t_2) {
    __pyx_t_1 = __Pyx_PyObject_CallOneArg(__pyx_t_7, __pyx_t_3); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 34, __pyx_L1_error)
    __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    __Pyx_GOTREF(__pyx_t_1);
  } else {
    #if CYTHON_FAST_PYCALL
    if (PyFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[2] = {__pyx_t_2, __pyx_t_3};
      __pyx_t_1 = __Pyx_PyFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    } else
    #endif
    #if CYTHON_FAST_PYCCALL
    if (__Pyx_PyFastCFunction_Check(__pyx_t_7)) {
      PyObject *__pyx_temp[2] = {__pyx_t_2, __pyx_t_3};
      __pyx_t_1 = __Pyx_PyCFunction_FastCall(__pyx_t_7, __pyx_temp+1-1, 1+1); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_XDECREF(__pyx_t_2); __pyx_t_2 = 0;
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_3); __pyx_t_3 = 0;
    } else
    #endif
    {
      __pyx_t_26 = PyTuple_New(1+1); if (unlikely(!__pyx_t_26)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_26);
      __Pyx_GIVEREF(__pyx_t_2); PyTuple_SET_ITEM(__pyx_t_26, 0, __pyx_t_2); __pyx_t_2 = NULL;
      __Pyx_GIVEREF(__pyx_t_3);
      PyTuple_SET_ITEM(__pyx_t_26, 0+1, __pyx_t_3);
      __pyx_t_3 = 0;
      __pyx_t_1 = __Pyx_PyObject_Call(__pyx_t_7, __pyx_t_26, NULL); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 34, __pyx_L1_error)
      __Pyx_GOTREF(__pyx_t_1);
      __Pyx_DECREF(__pyx_t_26); __pyx_t_26 = 0;
    }
  }
  __Pyx_DECREF(__pyx_t_7); __pyx_t_7 = 0;
  __pyx_r = __pyx_t_1;
  __pyx_t_1 = 0;
  goto __pyx_L0;



In [21]:

    
%%time
result_cython_opt = angdistcut_cython_opt(vec_obj_c,vec_ps_c,cos_maxsep)









    



CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 4.05 ms



In [22]:

    
assert np.array_equal(result_cython_opt,result_numpy)

By adding the decorators Cython now runs significantly faster than numpy. (NB: Part of this is because our Cython implementation does not calculate any further dot products once one contaminant is found for a certain object.)

But we can do even better by adding parallelization.

Cython has an easy integration of OMP-parallel loops (very similar to C,C++, Fortran). This circumvents the Global Interpreter Lock (GIL), which normally prohibits the execution of parallel code in Python (for reasons related to Python's internal memory management). More infos here: https://wiki.python.org/moin/GlobalInterpreterLock

NB: The different threads share access to the same memory. Therefore we need to make sure that our implementation is thread-safe, i.e. that different threads do not write to the same memory location.

Now with OMP-parallelization:



In [23]:

    
%%cython --compile-args=-fopenmp --link-args=-fopenmp

import numpy as np
cimport numpy as np
cimport cython
from cython.parallel import prange, parallel

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
@cython.cdivision(True)
def angdistcut_cython_par(double[:,::1] vec_obj, double[:,::1] vec_ps, double cos_maxsep, int num_threads):
    """
    added extra compiler flags
    also added a new argument for the number of threads
    """
    
    cdef:
        int nps = vec_ps.shape[0]
        int nobj = vec_obj.shape[0]
        int dim = vec_obj.shape[1]
        #use int array instead of list
        int[:] found = np.zeros(nobj, np.int32)
        int i,j
        double cos

    #every object is independent, so can parallelize the outer loop easily
    # need to release the GIL for that (see above)
    with nogil, parallel(num_threads=num_threads):
        #extra keyword arguments of prange control OMP settings (can fine-tune this based on our problem)
        for i in prange(nobj): #,schedule='static',chunksize=1):
            for j in range(nps):
                cos = (vec_obj[i,0]*vec_ps[j,0] + 
                       vec_obj[i,1]*vec_ps[j,1] + 
                       vec_obj[i,2]*vec_ps[j,2])
                if cos > cos_maxsep:
                    found[i] = 1
                    break
                
    return np.flatnonzero(np.asarray(found))



In [24]:

    
%%time
result_cython_par = angdistcut_cython_par(vec_obj_c,vec_ps_c,cos_maxsep, num_threads=2)









    



CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 13.4 ms



In [25]:

    
assert np.array_equal(result_cython_par,result_numpy)

For such a small computation overhead becomes significant, therefore there is no further speedup (NB: slightly different on my work desktop). Let's repeat the benchmark with larger arrays.

Timings with larger arrays:



In [26]:

    
# similar to real use case
vec_obj_large,theta_obj_large,phi_obj_large = gen_points_sphere(500000)
vec_ps_large,theta_ps_large,phi_ps_large = gen_points_sphere(50000)
#make them C contiguous again
vec_obj_large_c = np.ascontiguousarray(vec_obj_large)
vec_ps_large_c = np.ascontiguousarray(vec_ps_large)



In [27]:

    
%%time
result_cython_typed = angdistcut_cython_typed(vec_obj_large_c,vec_ps_large_c,cos_maxsep)









    



CPU times: user 22.1 s, sys: 52 ms, total: 22.1 s
Wall time: 22.1 s



In [28]:

    
%%time
result_cython_opt = angdistcut_cython_opt(vec_obj_large_c,vec_ps_large_c,cos_maxsep)









    



CPU times: user 9.46 s, sys: 16 ms, total: 9.48 s
Wall time: 9.44 s



In [31]:

    
%%time
result_cython_par_large = angdistcut_cython_par(vec_obj_large_c,vec_ps_large_c,cos_maxsep,num_threads=2)









    



CPU times: user 10.9 s, sys: 8 ms, total: 10.9 s
Wall time: 5.47 s



In [32]:

    
%%time
result_cython_par_large = angdistcut_cython_par(vec_obj_large_c,vec_ps_large_c,cos_maxsep,num_threads=4)









    



CPU times: user 18.4 s, sys: 8 ms, total: 18.5 s
Wall time: 4.67 s

On this laptop, I get almost linear speed-up up to 2 cores, and then no further improvement. On the work desktop: Until 4 cores, I get a roughly linear speed-up. When using all 8 cores, there is no further speed-up.

How much speed-up OMP parallelization brings, depends both on the system and the problem. In purely CPU-bound problems (almost no memory I/O needed), one can get quasi-linear speed-up to a large number of cores.

NB: Sometimes we can squeeze out some extra speed-up by fine-tuning the 'schedule' and 'chunksize' arguments of prange. See e.g. http://cython.readthedocs.io/en/latest/src/userguide/parallelism.html

Optimization summary

Python/Numpy

The naive pure-Python implementation of this problem would have run for 8 hours for the full data set.
A vectorized numpy version is >100 times faster, but has prohibitively large memory requirements.

Cython

A statically typed Cython version with the appropriate decorators is faster than the numpy version and solves the memory issues.
With a few easy modifications, we have made our Cython code OMP-parallel and gained another factor of 3.5 in speed when run on 4 cores (on my work desktop). This makes our final version several thousand times faster than the first pure-Python version.

Compilation outside of Jupyter notebooks:

You need the following files:

Cython source code (.pyx). See the included example angdist.pyx.
A simple setup.py file (Example included in this repository.)
Then run $ python setup.py build_ext --inplace or put this command into a Makefile.

Final remarks:

Today I have only covered a small fraction of what is possible with Cython.
There is a lot of great material on Cython on the web, e.g. here http://cython.readthedocs.io/en/latest/index.html
I would also highly recommend the book "Cython: A Guide for Python Programmers" by Kurt Smith.



In [ ]:



In [ ]:

Optimizing numerical Python code using Cython

Code Corner, Institute of Astronomy, 07/11/2017

by Bjoern Soergel

Some general points about Cython:

Installation

Cython compilation workflow:

Using Cython to optimize numerical Python code:

Some general remarks about optimizing code:

Worked example: Contaminant removal on the sphere

Setup:

Setup: generate random points

Algorithm:

Implementations

1) Naive Python

NB: All timing-related statements may vary depending on your machine.

2) Numpy

Enter: python optimization tools

Today I am only showing Cython, but it is worth knowing about the others.

Cython

Start: a cythonized version of the naive pure-Python implementation

Adding static type definitions

Here we make use of Cython's typed memorview. This is a container that provides fast access to numpy-like arrays in Cython.

Let's add some compiler directives to optimize our Cython function for even more speed-up.

But we can do even better by adding parallelization.

Now with OMP-parallelization:

Timings with larger arrays:

On this laptop, I get almost linear speed-up up to 2 cores, and then no further improvement. On the work desktop: Until 4 cores, I get a roughly linear speed-up. When using all 8 cores, there is no further speed-up.

Optimization summary

Python/Numpy

Cython

Compilation outside of Jupyter notebooks:

Final remarks: