Faster Spherical Distance Calculations with Numpy Array Broadcasting

An IPython Notebook Documenting an Awesome Thing I Realized
by Daniel P. Gettings

In Python and other interpreted languages (e.g. IDL), one should attempt to do as little looping (at the interpreted level) as possible to maximize speed. This is what the vectorized array routines in Numpy are for -- they do most of the looping in C or C++ (which is much faster at looping). Some of the coolest and most efficient uses of Numpy arrays come from "array broadcasting". This set of notes give a short demo showing how broadcasting can dramatically increase the efficiency of the spherical distance calculations that are common in Astronomy.



In [1]:

    
from kapteyn.maputils import dist_on_sphere as kap_dist
import numpy as np
# Nicer printing for a notebook
np.set_printoptions(formatter={'float':lambda x: '{0:0.3f}'.format(x)})

"""
 Using the angular distance function from the Kapteyn 
 package (http://www.astro.rug.nl/software/kapteyn), 
 which uses the formula from Vincenty (1975), which
 is accurate over the whole sphere.

"""









    Out[1]:





'\n Using the angular distance function from the Kapteyn \n package (http://www.astro.rug.nl/software/kapteyn), \n which uses the formula from Vincenty (1975), which\n is accurate over the whole sphere.\n\n'

Quick Demo

Here's an illustration of the concept, to show that the broadcasted form of the calculation is doing the same thing as the standard one.

Task:
Given two sets of astronomical coordinates (cat1 and cat2):

Calculate the angular separation between every source in cat1 and every source in cat2
For the source in cat2 that is closest to a given source in cat1, find:
- The angular separation and
- The array index into cat2



In [10]:

    
# Generate Random (RA,Dec) Points
ra1 = np.random.rand(6) - 0.5
dec1 = np.random.rand(6) - 0.5
ra2 = np.random.rand(4) + 180-0.5
dec2 = np.random.rand(4) - 0.5

Method 1 -- Looping

Loop over every source in cat1, calculating the distance to every source in cat2. This is a pretty standard "vectorized" method.



In [11]:

    
for ind in xrange(ra1.size):
    # Calculate the distance to every cat2 source, one cat1 source at a time
    dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
    print 'Separations:',repr(dist),
    print '| Min Sep Value: {0:0.3f}'.format(dist.min()), '| Cat2 Min Index:', dist.argmin()









    



Separations: array([179.868, 179.603, 179.639, 179.685]) | Min Sep Value: 179.603 | Cat2 Min Index: 1
Separations: array([179.285, 179.077, 178.995, 179.229]) | Min Sep Value: 178.995 | Cat2 Min Index: 2
Separations: array([179.657, 179.394, 179.454, 179.507]) | Min Sep Value: 179.394 | Cat2 Min Index: 1
Separations: array([179.642, 179.533, 179.968, 179.497]) | Min Sep Value: 179.497 | Cat2 Min Index: 3
Separations: array([179.489, 179.443, 179.124, 179.584]) | Min Sep Value: 179.124 | Cat2 Min Index: 2
Separations: array([179.708, 179.734, 179.795, 179.647]) | Min Sep Value: 179.647 | Cat2 Min Index: 3

Method 2 -- Broadcasting

Broadcasting kind of works like the "Outer Product" in matrix multiplication. You will notice that the results are the same!



In [12]:

    
# Reshaping one of the sets of coordinates tells Numpy to "broadcast"
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))
# The Numpy Functions in the Distance Calculation propagate the broadcasting
dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())

print 'Separations:'
print ' ',repr(dist_broadcast)
print 'Min Sep Values:'
print ' ',repr(dist_broadcast.min(axis=1))
print 'Cat2 Min Indices:'
print ' ',repr(dist_broadcast.argmin(axis=1))









    



Separations:
  array([[179.868, 179.603, 179.639, 179.685],
       [179.285, 179.077, 178.995, 179.229],
       [179.657, 179.394, 179.454, 179.507],
       [179.642, 179.533, 179.968, 179.497],
       [179.489, 179.443, 179.124, 179.584],
       [179.708, 179.734, 179.795, 179.647]])
Min Sep Values:
  array([179.603, 178.995, 179.394, 179.497, 179.124, 179.647])
Cat2 Min Indices:
  array([1, 2, 1, 3, 2, 3])

Timing Comparison

Comparing the runtime of both methods, looping and broadcsting.

Why add this layer of complexity? Here we compare three different cases, varying the relative number of sources in each catalog. This is where things get interesting!



In [5]:

    
# Small cat1, large cat2
# -----------------------
# Generate Points, Normal and Broadcasted
ra1 = np.random.rand(20) - 0.5
dec1 = np.random.rand(20) - 0.5
ra2 = np.random.rand(100) + 180-0.5
dec2 = np.random.rand(100) - 0.5
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))

# Do Time Testing
print 'Test 1'
print ' cat1: {0:d} points, cat2: {1:d} points'.format(ra1.size, ra2.size)
print '--------------------------------------'
print '>> Looping Method: '
%timeit for ind in xrange(ra1.size): dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
print '>> Broadcasting Method: '
%timeit dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())









    



Test 1
 cat1: 20 points, cat2: 100 points
--------------------------------------
>> Looping Method: 
100 loops, best of 3: 2.85 ms per loop
>> Broadcasting Method: 
1000 loops, best of 3: 517 µs per loop



In [6]:

    
# Large cat1, small cat2
# -----------------------
# Generate Points, Normal and Broadcasted
ra1 = np.random.rand(100) - 0.5
dec1 = np.random.rand(100) - 0.5
ra2 = np.random.rand(20) + 180-0.5
dec2 = np.random.rand(20) - 0.5
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))

# Do Time Testing
print 'Test 2'
print ' cat1: {0:d} points, cat2: {1:d} points'.format(ra1.size, ra2.size)
print '--------------------------------------'
print '>> Looping Method: '
%timeit for ind in xrange(ra1.size): dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
print '>> Broadcasting Method: '
%timeit dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())









    



Test 2
 cat1: 100 points, cat2: 20 points
--------------------------------------
>> Looping Method: 
100 loops, best of 3: 11.9 ms per loop
>> Broadcasting Method: 
1000 loops, best of 3: 543 µs per loop



In [7]:

    
# Large cat1, Large cat2
# -----------------------
# Generate Points, Normal and Broadcasted
ra1 = np.random.rand(100) - 0.5
dec1 = np.random.rand(100) - 0.5
ra2 = np.random.rand(100) + 180-0.5
dec2 = np.random.rand(100) - 0.5
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))

# Do Time Testing
print 'Test 3'
print ' cat1: {0:d} points, cat2: {1:d} points'.format(ra1.size, ra2.size)
print '--------------------------------------'
print '>> Looping Method: '
%timeit for ind in xrange(ra1.size): dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
print '>> Broadcasting Method: '
%timeit dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())









    



Test 3
 cat1: 100 points, cat2: 100 points
--------------------------------------
>> Looping Method: 
100 loops, best of 3: 15.8 ms per loop
>> Broadcasting Method: 
100 loops, best of 3: 2.02 ms per loop

Worse-Case Scenarios

Here we compare the worst-case runtime scenarios for each method.

For broadcasting, the worst-case is a very small number of sources in cat1 relative to cat2. In this case, looping gains its greatest level of optimization from Numpy's vectorized array operations, putting nearly all the actual looping in C/C++. But it still performs fewer loop iterations in C/C++ than the broadcasted calculation.
The worse-case for looping is a very large number of cat1 sources relative to cat2. Only a tiny fraction of the actual looping gets done in C/C++, while the vast majority of it is performed in (slow) Python. For broadcasting, the calculation looks nearly identical to its worst-case, putting all the looping in C/C++.



In [8]:

    
# ----------------------------------
# Broadcasting Worst Case Scenario 
# --> Tiny cat1, huge cat2
# ----------------------------------
# Generate Points, Normal and Broadcasted
ra1 = np.random.rand(5) - 0.5
dec1 = np.random.rand(5) - 0.5
ra2 = np.random.rand(1e4) + 180-0.5
dec2 = np.random.rand(1e4) - 0.5
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))

# Do Time Testing
print 'Broadcasting Worst Case Scenario'
print ' cat1: {0:d} points, cat2: {1:d} points'.format(ra1.size, ra2.size)
print '--------------------------------------'
print '>> Looping Method: '
%timeit for ind in xrange(ra1.size): dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
print '>> Broadcasting Method: '
%timeit dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())









    



Broadcasting Worst Case Scenario
 cat1: 5 points, cat2: 10000 points
--------------------------------------
>> Looping Method: 
100 loops, best of 3: 13.5 ms per loop
>> Broadcasting Method: 
100 loops, best of 3: 11 ms per loop



In [9]:

    
# ----------------------------------
# Looping Worst Case Scenario 
# --> Huge cat1, tiny cat2
# ----------------------------------
# Generate Points, Normal and Broadcasted
ra1 = np.random.rand(1e4) - 0.5
dec1 = np.random.rand(1e4) - 0.5
ra2 = np.random.rand(5) + 180-0.5
dec2 = np.random.rand(5) - 0.5
ra1_vert = ra1.reshape((-1,1))
dec1_vert = dec1.reshape((-1,1))

# Do Time Testing
print 'Looping Worst Case Scenario'
print ' cat1: {0:d} points, cat2: {1:d} points'.format(ra1.size, ra2.size)
print '--------------------------------------'
print '>> Looping Method: '
%timeit for ind in xrange(ra1.size): dist = kap_dist(ra1[ind].copy(), dec1[ind].copy(), ra2.copy(), dec2.copy())
print '>> Broadcasting Method: '
%timeit dist_broadcast = kap_dist(ra1_vert.copy(), dec1_vert.copy(), ra2.copy(), dec2.copy())









    



Looping Worst Case Scenario
 cat1: 10000 points, cat2: 5 points
--------------------------------------
>> Looping Method: 
1 loops, best of 3: 1.17 s per loop
>> Broadcasting Method: 
100 loops, best of 3: 12.4 ms per loop

Verdict

Even in its worst-case runtime scenario, the array broadcasting method is still ${\sim} 20\%$ faster than the simple scheme using Python loops. However, in its best-case runtime scenario the broadcasted calculation is nearly 1000 times faster than looping in Python!