1) Introduction: Computational Geometry and scipy.spatial

This year my updated talk focuses on scipy.spatial instead of the broader topic of "computational geometry in Python" covered last year. This hopefully cuts down on library dependencies a bit as well. As I've been increasingly involved in contributing to scipy.spatial, this tutorial contains a number of 'Development Notes' formatted as follows:

Development Note Type of Change Pull Request # scipy Implementation Version
description of change / improvement to scipy Enhancement 9999 0.19

My objective in placing these 'Development Notes' throughout the tutorial is to provide you with some cutting edge information about scipy.spatial, especially for those areas that I'm actively working on. It may also inspire some of you to contribute or suggest implementating features we haven't thought of yet.

One of the key references I used in preparing the talk for last year was:

'Discrete and Computational Geometry' (2011) Devadoss and O'Rourke.

Since some key components of that talk are still present this year, that is still a very suitable reference, especially for fundamental topics like convex hulls, Delaunay triangulations, and Voronoi diagrams.

Computational Geometry: the study of computer algorithms that perform geometric operations.

Applications:

  • computer graphics
  • computer-aided design / drafting (CAD)
  • robotic motion planning / collision avoidance
  • geographic information systems (GIS) / search and rescue, etc.
  • computer vision (3D reconstruction)
  • computational biochemistry / virology
  • epidemiology (water source contamination)
  • ecology (species habitat estimates based on sightings)
  • integrated circuit design
  • enconomical construction of physical networks
  • cooking ingredient logistics (success polygons)

    A "young field," with modern computational geometry starting in ~ the 1970s, although the underlying concepts permeate many fields and existed in some form or other for centuries. Consider the illustration of solar system vertices (matter circling around stars) from Rene Descartes, produced in the 17th Century, and highlighted by Aurenhammer et al. in their recent textbook "Voronoi Diagrams and Delaunay Triangulations":


In [1]:
import scipy
import numpy as np
import math
import matplotlib.pyplot as plt
import time
import scipy.optimize
import PIL
import matplotlib
import networkx as nx
from matplotlib.patches import Polygon
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from matplotlib.collections import PolyCollection
from matplotlib.collections import PatchCollection
from matplotlib import colors
import pickle
import circumcircle
%matplotlib inline
scipy.__version__


Out[1]:
'0.19.0'

2) Distances between arrays of points

2.1) scipy.spatial.distance_matrix

2.1.1) Euclidean (L2 norm)

There are many cases where it is desirable to calculate the distances between all possible pairs of points from two data sets. Consider a self-driving vehicle that is programmed to avoid animals that enter a roadway. It may have a set of coordinates that represent the vehicle proper relative to some kind of camera / detector, and another set of coordinates that represent the apparent obstruction (animal on the road).

For simplicity, lets define the two sets of coordinates randomly:


In [2]:
vehicle_coords = np.random.random((50,3))
animal_coords = np.random.random((600,3))

In order to calculate the amount of time that the vehicle has to make a course adjustment, it would be sensible to first calculate the closest distance between a point on the vehicle and a point on the animal. Lets try this by calculating all possible distances between the objects and finding the minimum of that distance matrix.


In [3]:
from scipy.spatial import distance_matrix
all_distances = distance_matrix(vehicle_coords, animal_coords)
all_distances.shape


Out[3]:
(50, 600)

Each row of the distance matrix represents a point on the vehicle, and each column (value in a given row) is the distance between that point on the vehicle and one of the points on the animal (ordered by index).


In [4]:
closest_distance = all_distances.min()
closest_distance


Out[4]:
0.024756761151640203

2.1.2) Rectilinear (L1 norm)

Consider another situation, where a self-driving vehicle is aware (i.e., via real-time 'traffic' reporting) of an animal obstructing a roadway, even though the vehicle does not have line of sight to the animal obstacle. The animal is a few city blocks away, but the vehicle would still benefit from a distance & time estimate so that appropriate adjustments may be made.

In this case, the straight-line Euclidean distance would not represent the shortest distance the vehicle would travel in practice--instead, we need to consider the so-called L1 norm / taxicab geometry/ Manhattan distance, etc.


In [5]:
taxicab_distances = distance_matrix(vehicle_coords, animal_coords, p=1)
taxicab_distances.shape


Out[5]:
(50, 600)

In [6]:
closest_taxicab_distance = taxicab_distances.min()
closest_taxicab_distance


Out[6]:
0.038169387389545673

2.2) scipy.spatial.distance.cdist

2.2.1) Faster than distance_matrix for Euclidean norm


In [7]:
%timeit distance_matrix(vehicle_coords, animal_coords)


1000 loops, best of 3: 1.64 ms per loop

In [8]:
from scipy.spatial.distance import cdist

In [9]:
%timeit cdist(vehicle_coords, animal_coords)


1000 loops, best of 3: 358 µs per loop

In [10]:
# confirm same value for p=2 norm
distance_matrix(vehicle_coords, animal_coords, p=2).min() == cdist(vehicle_coords, animal_coords, metric='euclidean').min()


Out[10]:
True

2.2.2) Faster than distance_matrix for L1 norm


In [11]:
%timeit distance_matrix(vehicle_coords, animal_coords, p=1)


1000 loops, best of 3: 1.4 ms per loop

Slower with minkowski metric specification though!


In [12]:
%timeit cdist(vehicle_coords, animal_coords, metric='minkowski', p=1)


100 loops, best of 3: 5.25 ms per loop

Use cityblock metric instead of p=1 minkowski


In [13]:
# time equivalent calculation using alternative calling convention
%timeit cdist(vehicle_coords, animal_coords, metric='cityblock')


10000 loops, best of 3: 168 µs per loop

Make sure the two different rectilinear calling conventions produce the same value


In [14]:
cdist(vehicle_coords, animal_coords, metric='minkowski', p=1).min() == cdist(vehicle_coords, animal_coords, metric='cityblock').min()


Out[14]:
True

And likewise for cdist vs. distance_matrix for rectilinear distance


In [15]:
# confirm same value for p=1 norm
distance_matrix(vehicle_coords, animal_coords, p=1).min() == cdist(vehicle_coords, animal_coords, metric='minkowski', p=1).min()


Out[15]:
True

2.2.3) Using cdist for Mahalanobis distance -- clustering and outlier detection

Apart from the enhanced performance (at least for the data shapes tested above) of cdist over distance_matrix, cdist also has access to a wide variety of distance metrics with a plethora of possible applications.

Consider the following ellipsoidal and circular data sets:


In [16]:
# ellipsoidal point set generation: http://stackoverflow.com/a/5529199/2942522
np.random.seed(255792)
width = 2.82
height = 0.36
phi = np.random.random((180)) * 2 * math.pi 
rho = np.random.random((180))
x_ellipse = np.sqrt(rho) * np.cos(phi)
y_ellipse = np.sqrt(rho) * np.sin(phi)
x_ellipse *= width / 2.0
y_ellipse *= height / 2.0
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x_ellipse, y_ellipse, c='blue', alpha=0.7, edgecolor='none', s=90)
ax.set_title('Ellipsoidal and Circular Point Sets')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_aspect('equal')
ax.set_xlim(-2,2)
ax.set_ylim(-2,2)
circle_radius = 1.5
x_circle = np.cos(phi) * circle_radius
y_circle = np.sin(phi) * circle_radius
ax.scatter(x_circle, y_circle, c='green', alpha=0.7, edgecolor='none', s=90)

fig.set_size_inches(8,8)


The Mahalanobis distance for each circular point (relative to the elliptical point set) will account for the "shape" of the ellipse and not just the mean of the elliptical points.

That said, it is somewhat confusing that cdist will actually calculate the Mahalanobis distance between all possible pairs of points, when conceptually one would often just want to look at the distance of a point on the circle from the mean point of the elliptical data set. For clarity, we will perform the latter simplification here, calculating the inverse covariance matrix of the elliptical data set, but only focusing on the centroid of the ellipse for cdist purposes. If your brain can handle the concept of the covariance matrix and ALL of the points in the ellipse, you can probably better leverage the power of cdist with Mahalanobis than I!

Let's proceed with the simplified example though.


In [17]:
ellipse_data = np.dstack((x_ellipse, y_ellipse))[0]
circle_data = np.dstack((x_circle, y_circle))[0]
ellipse_centroid = np.average(ellipse_data, axis=0).reshape((1,2))
ellipse_centroid


Out[17]:
array([[-0.06998538,  0.01217176]])

In [18]:
ellipse_inv_covariance = np.linalg.inv(np.cov(ellipse_data.T))

In [19]:
mahalanobis_distances = cdist(circle_data,
                              ellipse_centroid,
                              metric='mahalanobis',
                              VI=ellipse_inv_covariance)
mahalanobis_distances.shape


Out[19]:
(180, 1)

In [20]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(ellipse_centroid[...,0], ellipse_centroid[...,1], c='k', alpha=0.7, edgecolor='none', s=90)
ax.set_title('Ellipsoidal and Circular Point Sets')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_aspect('equal')
ax.set_xlim(-2,2)
ax.set_ylim(-2,2)
im = ax.scatter(circle_data[...,0], circle_data[...,1], c=mahalanobis_distances,
                cmap='viridis', alpha=0.7, edgecolor='none', s=90)
fig.colorbar(im)
fig.set_size_inches(8,8)


So, the smaller Mahalanobis distances are near the long axis of the ellipse even though all Euclidean distances would be effectively identical. This is exactly what we'd expect after accounting for the covariance of the ellipse reference data.

Of course, my simplification to a single-point comparison is good for clarity but bad for emphasizing the power of cdist--it is designed to perform this calculation for all possible combinations of points in the ellipse and circle!

To be fair, there is also some (ongoing) confusion about the cdist Mahalanobis distance code for the core developers as well:

Development Note Type of Change Issue # scipy Implementation Version
possible issues with the mahalanobis distance with default VI argument unclear 4815 ongoing

2.3) scipy.spatial.distance.pdist


In [21]:
from scipy.spatial.distance import pdist

2.3.1) pdist basics -- very simple example


In [22]:
test_array = np.array([[0,0],
                       [0,1],
                       [0,2]])

The condensed distance matrix returned by pdist for the above test_array should have three values:

  1. Distance between [0,0] and [0,1]
  2. Distance between [0,0] and [0,2]
  3. Distance between [0,1] and [0,2]

Let's try it:


In [23]:
condensed_matrix = pdist(test_array)
condensed_matrix


Out[23]:
array([ 1.,  2.,  1.])

Although it may be relatively easy to track which distance value corresponds to which pair of points in this case, if test_array were much larger, working with a condensed distance matrix may become confusing.

This is where scipy.spatial.distance.squareform becomes useful:


In [24]:
from scipy.spatial.distance import squareform
redundant_matrix = squareform(condensed_matrix)
redundant_matrix


Out[24]:
array([[ 0.,  1.,  2.],
       [ 1.,  0.,  1.],
       [ 2.,  1.,  0.]])

Even though pdist does not calculate the distance between a point and itself, squareform converts the condensed distance matrix to the familiar redundant form that would also be returned by cdist for the data set compared with itself:


In [25]:
cdist(test_array, test_array)


Out[25]:
array([[ 0.,  1.,  2.],
       [ 1.,  0.,  1.],
       [ 2.,  1.,  0.]])

It is worth noting that squareform can perform the reverse conversion as well:


In [26]:
squareform(cdist(test_array, test_array)) # produce the same result as pdist by using squareform


Out[26]:
array([ 1.,  2.,  1.])
Development Note Type of Change Pull Request # scipy Implementation Version
squareform no longer converts all input data to float64 Enhancement 6457 0.19

2.3.2 pdist 'real-world' example

Imagine that you have experimental scientific data for the behavior of a protein (molecule) in the human body that causes a specific medical condition. Your suspicion is that this molecule becomes "compacted" / squished in the disease state. You want to know which two atoms are closest together in the normal vs. diseased state, to assess whether there really is evidence of compaction.


In [27]:
np.random.seed(387)
normal_coordinates = np.random.random((12000,3))
disease_coordinates = np.random.random((12000,3))

We could try a brute-force approach using cdist to compare a data set against itself:


In [28]:
normal_dist_matrix_cdist = cdist(normal_coordinates, normal_coordinates)
disease_dist_matrix_cdist = cdist(disease_coordinates, disease_coordinates)
print('normal min:', normal_dist_matrix_cdist.min())
print('disease min:', disease_dist_matrix_cdist.min())


normal min: 0.0
disease min: 0.0

We actually need to filter out the self-comparisons when using cdist in this manner. To do this, we effectively filter out the diagonal of the square distance matrix before calculating the minimum:


In [29]:
for name, cdist_array in zip(['normal', 'disease'], [normal_dist_matrix_cdist, disease_dist_matrix_cdist]):
    mask = np.ones(cdist_array.shape, dtype=bool)
    np.fill_diagonal(mask, 0)
    print(name, ':', cdist_array[mask].min())


normal : 0.00191693986339
disease : 0.00157908261241

But, as you may have guessed, the redundant square matrices produced by cdist can be quite large:


In [30]:
normal_dist_matrix_cdist.size, disease_dist_matrix_cdist.size


Out[30]:
(144000000, 144000000)

It also just feels intuitively a bit strange to have to feed the same array of data to cdist twice, and this is a strong indicator that pdist may be a more efficient choice, especially for a large data set.


In [31]:
normal_dist_matrix_pdist = pdist(normal_coordinates)
disease_dist_matrix_pdist = pdist(disease_coordinates)
normal_dist_matrix_pdist.size, disease_dist_matrix_pdist.size


Out[31]:
(71994000, 71994000)

In [32]:
normal_dist_matrix_cdist.size / normal_dist_matrix_pdist.size


Out[32]:
2.000166680556713

The number of floating point numbers stored by the result of pdist is roughly half that stored by the return value of cdist in this case. This could make a really big difference when working with 'big data' sets.

Now, confirm that the results from the two approaches are simply transformations of the same data:


In [33]:
np.testing.assert_equal(squareform(disease_dist_matrix_cdist), disease_dist_matrix_pdist)

User-defined distance functions / metrics may also be used with pdist and cdist, but as it stands there is currently no way to feed any additionally required keyword arguments through either pdist or cdist. There are only a few supported additional keyword arguments such as i.e., VI for the Mahalanobis distance. There's some debate about adjusting the "interface" to Xdist such that additional custom kwargs may be used:

Development Note Type of Change Pull Request # scipy Implementation Version
ENH: added args and kwargs to Xdist Enhancement 6974 In Progress

3) Distances between single numeric vectors

For computing the distances between two 1D arrays scipy.spatial.distance has many of the same functions as are available to the metric arguments of pdist and cdist. However, it is important to note that the metric for pdist and cdist should always be specified as the string type argument (i.e., "euclidean") to use the optimized C-level implementation instead of providing the built-in 1D vector distance function name (i.e., scipy.spatial.distance.euclidean).

Of course, in some cases one may actually wish to calculate the distance between two 1D vectors, for which the specific functions in scipy.spatial.distance are appropriate.

3.1) Calculate the Euclidean distance between two arrays


In [34]:
array_1 = np.random.random((3,))
array_2 = np.random.random((3,))
euclidean_dist = scipy.spatial.distance.euclidean(array_1, array_2)
euclidean_dist


Out[34]:
0.4756617846886483

In [35]:
# but not appropriate for > 1D arrays
array_1 = np.random.random((2, 3))
array_2 = np.random.random((2, 3))
try:
    euclidean_dist = scipy.spatial.distance.euclidean(array_1, array_2)
except ValueError:
    print('ValueError raised')


ValueError raised

3.2) Calculate the cityblock distance between two arrays


In [36]:
array_1 = np.zeros((2,))
array_2 = np.array([4, 4])

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(array_1[0], array_1[1], s=200, label='array 1')
ax.scatter(array_2[0], array_2[1], s=200, label='array 2')
ax.legend(fontsize=16)
# add 'city blocks / steps' to plot
step_val = 0.5
current_pos = np.zeros((2,))
next_pos = np.zeros((2,))
x_val = 0
x_turn = 1
while x_val < 8:
    if x_turn:
        next_pos[0] += step_val
        x_turn = 0
    else:
        next_pos[1] += step_val
        x_turn = 1
        
    ax.plot([current_pos[0], next_pos[0]], [current_pos[1], next_pos[1]], c='k')
    current_pos[:] = next_pos[:]
    x_val += step_val

ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.set_title('Simple Cityblock Distance', fontsize=16)
fig.set_size_inches(6,6)



In [37]:
cityblock_dist = scipy.spatial.distance.cityblock(array_1, array_2)
cityblock_dist


Out[37]:
8.0

Note that the City Block distance is equivalent to the sum of the x and y steps, and would be equivalent for one large block (single step).

3.3) Calculate the weighted Minkowski distance between two arrays


In [38]:
array_1 = np.zeros((2,))
array_2 = np.array([1,9])

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(array_1[0], array_1[1], s=250, label='array 1')
ax.scatter(array_2[0], array_2[1], s=250, label='array 2')
ax.legend(fontsize=16)

ax.set_xlim(-1,10)
ax.set_ylim(-1,10)
ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.set_aspect('equal')
ax.set_title('Weighted Minkowski Example', fontsize=16)
fig.set_size_inches(6,6)


We could pick any p-norm for this calculation, but for conceptual simplicity let's stick with the Euclidean L2 norm. For the conventional Euclidean distance we would expect d > 9 by simple visual inspection.

However, given the small difference in x coordinate values, perhaps we can achieve d < 9 with the appropriate weighting.


In [39]:
weight_vector = [1, 0]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
weighted_euclidean_dist


Out[39]:
1.0

In [40]:
weight_vector = [1, 0.1]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
weighted_euclidean_dist


Out[40]:
1.3453624047073711

In [41]:
weight_vector = [1, 1]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
print('weighted_euclidean_dist with equal unit weights:', weighted_euclidean_dist)
assert weighted_euclidean_dist == scipy.spatial.distance.euclidean(array_1, array_2)


weighted_euclidean_dist with equal unit weights: 9.055385138137417

The shifting of the weighted distance matches nicely with intuition.

4) Distances between single boolean vectors

Once again, the specific 1D boolean distance functions in scipy.spatial.distance should only be used for comparison of two 1D vectors, while i.e., pdist with the metric string argument (i.e., "hamming") should be used for larger boolean comparisons.

4.1) Calculate the hamming distance between two 1D arrays

The Reed-Solomon code adds extra / redundant data bits to a data storage medium (i.e., CD, DVD, etc.) to assist with error correction. For example, error correction may be possible in the event of a minor scratch on the surface of a CD or if the reading device 'skips.' While the precise implementation details are beyond the scope of this example, if we think of these extra error-correcting bits as "code words," imagine a case where the encoder places a set of extra bits in the storage medium and the recovered equivalent word during decoding differs. We'd like a way to quantify the number of substitutions required to restore the original code word, which is an important metric in error correction.


In [42]:
original_code_word =  np.array([0, 1, 1, 0, 1, 1, 0, 0], dtype=np.bool)
recovered_code_word = np.array([1, 1, 0, 0, 1, 1, 0, 1], dtype=np.bool)

from scipy.spatial.distance import hamming
minimum_substitutions = hamming(original_code_word, recovered_code_word)
minimum_substitutions


Out[42]:
0.375

Why isn't the Hamming distance an integer number corresponding to the number of positions that differ? Because we divide by the total number of "letters" (bits) in each "word" (byte). So, the scipy hamming function is really the normalized Hamming distance.


In [43]:
np.argwhere(original_code_word != recovered_code_word).size / original_code_word.size


Out[43]:
0.375

Of course, the real-world implementation has more details and more bytes, so one might use i.e., pdist with the metric "hamming" argument for larger comparisons of arrays.

4.2) Calculate the Jaccard distance between two 1D arrays

Let's say we want to assess the ability of a computer program to identify a penguin in a real photograph based on a set of trial attempts to bound the animal. The photograph was taken by Jack Green with credit to the NSF as well.


In [44]:
from scipy.spatial.distance import jaccard
from PIL import Image
im = Image.open('emperor-penguin-alpha-construction-site-sign.jpg')
im.size # pixel dimensions


Out[44]:
(1524, 1128)

In [45]:
# define the reference / ground truth bounding box visually
x_start_ref = 350
x_end_ref = 550
y_start_ref = 600
y_end_ref = 900
box_actual = (x_start_ref, x_end_ref, y_start_ref, y_end_ref)
region = im.crop(box_actual)
region


Out[45]:

In [46]:
# define ground truth reference matrix as a 1D boolean array of pixel matches to the original image
def generate_bool_matrix(original_image, x_start, x_end, y_start, y_end):
    matrix = np.zeros(original_image.size)
    matrix[x_start:x_end, y_start:y_end] = 1
    matrix = matrix.astype(np.bool, copy=False).ravel()
    return matrix

reference_matrix = generate_bool_matrix(im, x_start_ref, x_end_ref, y_start_ref, y_end_ref)
# sanity check on reference_matrix properties
reference_matrix, reference_matrix.shape, reference_matrix.sum(), ((x_end_ref - x_start_ref) * (y_end_ref - y_start_ref) )


Out[46]:
(array([False, False, False, ..., False, False, False], dtype=bool),
 (1719072,),
 60000,
 60000)

Of course the numer of preserved pixels is the product of the pixel dimensions of the selection region.

Now, let's pretend we have some pixel bounding boxes selected by a variety of computer vision routines, and we want to identify the "best" approach.


In [47]:
fig =  plt.figure()

# the pixel boundaries would typically come from some kind of computer vision routine
x_start_p1 = 0; x_start_p2 = 390; x_start_p3 = 490
x_end_p1 = 200; x_end_p2 = 590; x_end_p3 = 690
y_start_p1 = 600; y_start_p2 = 610; y_start_p3 = 660
y_end_p1 = 900; y_end_p2 = 910; y_end_p3 = 960

box_program_1 = (x_start_p1, y_start_p1, x_end_p1, y_end_p1)
box_program_2 = (x_start_p2, y_start_p2, x_end_p2, y_end_p2)
box_program_3 = (x_start_p3, y_start_p3, x_end_p3, y_end_p3)

plot_num = 1
for program_name, bounding_box in zip(['program 1', 'program 2', 'program 3'],
                                      [box_program_1, box_program_2, box_program_3]):
    ax = fig.add_subplot(1, 3, plot_num)
    current_region = im.crop(bounding_box)
    current_region_bool_matrix = generate_bool_matrix(im, bounding_box[0], bounding_box[2],
                                                      bounding_box[1], bounding_box[3])
    jaccard_dist = jaccard(current_region_bool_matrix, reference_matrix)
    ax.set_title('{program}\nJaccard distance:{dist}'.format(program=program_name,
                                                             dist=np.around(jaccard_dist, 4)),
                                                      fontsize=20)
    ax.imshow(current_region)
    ax.axis('off')
    plot_num += 1

fig.subplots_adjust(wspace=0.01)
fig.set_size_inches(14,6)


So, larger Jaccard distances are farther away from the ground truth match. We've just used a relatively simple 1D scipy.spatial.distance function to assess the output of computer vision routines. In pratice, one would normally use the Jaccard index, $J_I = 1 - J_D$, such that the best match has the highest value & values $ > 0.5 $ are typically considered decent matches.

4.3 Calculate the Yule distance (dissimilarity) between two 1D boolean arrays

It seems that these Boolean distance calculations have different strengths and weaknesses depending on the application domain. I haven't seen as many interesting examples with the Yule distance as with the last two, but let's see how it behaves for an example case anyway.

Let's consider a very crude way to compare a hand-written letter with computer-generated letters--looking at at the positions of 'dark pixels.'


In [48]:
im_handwritten_T = Image.open('hand_written_t.jpg')
print('im_handwritten_T.size:', im_handwritten_T.size)
print('im_handwritten_T.getbands:', im_handwritten_T.getbands())
im_handwritten_T


im_handwritten_T.size: (225, 224)
im_handwritten_T.getbands: ('R', 'G', 'B')
Out[48]:

In [49]:
im_computer_T = Image.open('computer_T.jpg')
print('im_computer_T.size:', im_computer_T.size)
print('im_computer_T.getbands:', im_computer_T.getbands())
im_computer_T


im_computer_T.size: (225, 224)
im_computer_T.getbands: ('R', 'G', 'B')
Out[49]:

In [50]:
im_computer_S = Image.open('computer_S.jpg')
print('im_computer_S.size:', im_computer_S.size)
print('im_computer_S.getbands:', im_computer_S.getbands())
im_computer_S


im_computer_S.size: (225, 224)
im_computer_S.getbands: ('R', 'G', 'B')
Out[50]:

In [51]:
handwritten_T_array = np.asarray(im_handwritten_T)
computer_T_array = np.asarray(im_computer_T)
computer_S_array = np.asarray(im_computer_S)
computer_S_array


Out[51]:
array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       ..., 
       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ..., 
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8)

In [52]:
# generate and compare boolean arrays based on pixel darkness
from scipy.spatial.distance import yule
hand_written_boolean = handwritten_T_array.ravel() < 250
for condition, pixel_array in zip(['computer S', 'computer T'], [computer_S_array, computer_T_array]):
    computer_boolean = pixel_array.ravel() < 250
    yule_dist = yule(computer_boolean, hand_written_boolean)
    print(condition, 'yule distance:', yule_dist)


computer S yule distance: 1.1003904599547838
computer T yule distance: 0.3105709587080567

The Yule dissimilarity is bounded by 0 and 2 for boolean data sets, and we can see that the hand written "T" is closer to the computer-generated "T", even using this crude measure. Far more sophisticated use of the Yule distance in handwriting recognition has been described in the literature--see, for example: Zhang and Srihari (2003) Proceedings of SPIE-IS&T Electronic Imaging 5010.

5) Distances in Large Data Sets: Binary Space Partitioning

5.1) General Concept of Binary Space Partitioning (BSP)

The basic idea behind BSP is to split a space into a set of cells using so-called hyperplanes. In 3-D, this would mean using 2D planes to progressively divide the space into cells containing points, with the data structure typically organized in a "tree" that allows for i.e., rapid searching for nearest neighbors.

Some of the primary applications of BSP include:

  • 3D computer graphics
  • collision detection
  • robotics
  • ray tracing
  • video games

5.2) Simple Example of General BSP


In [53]:
# write Python code to mirror the static example at https://en.wikipedia.org/wiki/Binary_space_partitioning#Generation

import networkx as nx

list_lines = [np.array([[-4,3],[-1,-2]]),
              np.array([[-1,2],[-1,-1]]),
              np.array([[-1,0],[2,0]]),
              np.array([[3.5,3], [2.5,-1]])]
line_labels = ['D', 'B', 'A', 'C']

fig = plt.figure()
fig.subplots_adjust(wspace=0.1, right=1.0)
fig.set_size_inches(6,14)


def recursive_BSP(list_lines, line_labels, current_node_index=2, fig=fig, curr_plot=1):
    '''A recursive algorithm that crudely generates a BSP tree from a list of lines. Would enable use of i.e.,
    the Painter's algorithm in computer graphics, though in practice there are more details to handle.'''
    G = nx.Graph()
    G.clear()
    ax = fig.add_subplot(7,3,curr_plot)
    for line, label in zip(list_lines, line_labels):
        ax.plot(line[...,0],
                line[...,1],
                color='k')
        ax.annotate(label, xy=line[0], xytext=line[0] + 0.2)
    ax.set_aspect('equal')
    ax.set_xlim(-5,5)
    ax.set_ylim(-5,5)
    current_line = list_lines[current_node_index]
    current_line_label = line_labels[current_node_index]
    
    # I want to plot a dashed line for the current segment (node) being processed
    x_max_system = max([line_array[...,0].max() for line_array in list_lines])
    x_min_system = min([line_array[...,0].min() for line_array in list_lines])
    y_max_system = max([line_array[...,1].max() for line_array in list_lines])
    y_min_system = min([line_array[...,1].min() for line_array in list_lines])

    xdiff = current_line[0][0] - current_line[1][0]
    ydiff = current_line[0][1] - current_line[1][1]

    
    if xdiff==0:
        # vertical line
        current_line_extended = np.array([[current_line[0][0],y_min_system],[current_line[1][0],y_max_system]])
    elif ydiff==0:
        # horizontal line
        current_line_extended = np.array([[x_min_system, current_line[0][1]],[x_max_system, current_line[1][1]]])
    else:
        # other slopes
        m = ydiff / xdiff
        b = current_line[0][1] - m * current_line[0][0]
        current_line_extended = np.array([[x_min_system, m * x_min_system + b],[x_max_system, m * x_max_system + b]])

    # determine which (if any) segments are split (intersected) by the line formed by the chosen segment
    def line_intersection(line1, line2):
        # see: http://stackoverflow.com/a/20677983/2942522
        xdiff = (line1[0][0] - line1[1][0], line2[0][0] - line2[1][0])
        ydiff = (line1[0][1] - line1[1][1], line2[0][1] - line2[1][1]) 

        def det(a, b):
            return a[0] * b[1] - a[1] * b[0]

        div = det(xdiff, ydiff)
        if div == 0:
           raise Exception('lines do not intersect')

        d = (det(*line1), det(*line2))
        x = det(d, xdiff) / div
        y = det(d, ydiff) / div
        return x, y

    curr_plot += 1
    ax = fig.add_subplot(7,3,curr_plot)
    ax.set_aspect('equal')
    list_back_lines = []
    list_front_lines = []
    list_back_line_labels = []
    list_front_line_labels = []
    new_list_lines = []
    new_line_labels = []
    for index, line in enumerate(list_lines):
        if index == current_node_index:
            continue
        else:
            intersection = line_intersection(line, current_line)
            
            # the intersection criterion also needs to check for segment overlap
            # with line
            line_x_min = line[...,0].min()
            line_x_max = line[...,0].max()
            line_y_min = line[...,1].min()
            line_y_max = line[...,1].max()
            if not (line_x_max >= current_line_extended[...,0].min() and line_x_max <= current_line_extended[...,0].max()):
                continue
                
            if not line_y_min <= current_line_extended[...,1].min():
                continue

            
            #print('intersection:', intersection)
            ax.scatter(intersection[0], intersection[1], c='red')
            new_segment_1 = np.array([line[0],intersection])
            new_segment_2 = np.array([intersection,line[1]])
            label = line_labels[index]
            for label_suffix, segment, color in zip(['1','2'],
                                                    [new_segment_1, new_segment_2],
                                                    ['blue','red']):
                ax.plot(segment[...,0],
                        segment[...,1],
                        color=color)
                #print('appending segment:', segment)
                new_list_lines.append(segment)
                new_line_labels.append(label + label_suffix)
                
                ax.annotate(label + label_suffix, xy=segment[0])
    
    ax.set_xlim(-5,5)
    ax.set_ylim(-5,5)

    if len(new_list_lines) == 0:
        new_list_lines = list_lines[:]
        new_line_labels = line_labels[:]
        
    if abs(xdiff) > abs(ydiff):
        # split over y
        index_val = 1
    else:
        # split over x
        index_val = 0
        
    for line, label in zip(new_list_lines, new_line_labels):
        if label == current_line_label:
            continue
        line_first_val = line[0][index_val]
        if line_first_val > current_line_extended[0][index_val]:
            list_back_lines.append(line)
            list_back_line_labels.append(label)
            color='blue'
        else:
            list_front_lines.append(line)
            list_front_line_labels.append(label)
            color='red'
        ax.plot(line[...,0],
                line[...,1],
                color=color)
        ax.annotate(label, xy=line[0])

    ax.plot(current_line_extended[...,0],
        current_line_extended[...,1],
        ls='dashed',
        c='k')
    curr_plot += 1
    ax = fig.add_subplot(7,3,curr_plot)
    ax.set_aspect('equal')
    ax.set_xlim(-4,4)
    ax.set_ylim(-4,4)

    pos = {}
    G.add_node(current_line_label)
    pos[current_line_label] = [0,0]
    front_node_labels = '\n'.join(list_front_line_labels)
    back_node_labels = '\n'.join(list_back_line_labels)
    G.add_node(front_node_labels)
    G.add_node(back_node_labels)
    pos[front_node_labels] = [2,-2]
    pos[back_node_labels] = [-2,-2]
    nodelist=[current_line_label,
              front_node_labels,
              back_node_labels]
    if len(list_front_line_labels) == 0 and len(list_back_line_labels) == 0:
        nodelist=nodelist[:1]
    else:
        G.add_edges_from([(current_line_label, front_node_labels),
                          (current_line_label, back_node_labels)])
    colors = ['black','red', 'blue']
    nx.draw_networkx(G,
                     pos=pos,
                     ax=ax,
                     nodelist=nodelist,
                     node_size=3000,
                     node_color=colors,
                     linewidths=0,
                     alpha=0.4,
                     fontsize=12)
    
    curr_plot += 1
    if list_front_line_labels:
        values = [list_front_lines, list_front_line_labels]
        recursive_BSP(list_lines=values[0],
              line_labels=values[1],
              current_node_index=0,
              curr_plot=curr_plot)
    
    if list_back_line_labels:
        curr_plot += 3
        values = [list_back_lines, list_back_line_labels]    
        recursive_BSP(list_lines=values[0],
                      line_labels=values[1],
                      current_node_index=0,
                      curr_plot=curr_plot)
    
recursive_BSP(list_lines, line_labels)


Although my algorithm is rather crude, and the criteria for placement of the splitting hyperplanes and decision to terminate the tree-building process vary depending on the specific application, this is a Pythonic example representative of the general strategy used in many graphics rendering applications (i.e., early game engines) along with the Painter's algorithm. It provides an efficient way to store and rapidly retrieve information about polygons, though the actual construction of a BSP tree can be time consuming.

Note that the final number of lines (polygons) is greater than at the start, and that the choice of starting node can affect the structure of the tree--controlling polygon population and tree 'balance' can be important in some applications. A BSP tree may be traversed (searched) rapidly--in linear time.

The focus of the next section (KDTrees) is simply a specialized application of BSP Trees, often used for looking at large sets of points instead of large sets of polygons.

5.3) k-d trees

Informal Definition: k-dimensional trees are a specialized case of BSP trees, where every node is (usually) a k-dimensional point. The hyperplanes for splitting up the k-dimensional space are chosen perpendicular to the various axes of the space.

There are many different ways to construct k-d trees--it is common to have a build process that alternates the splitting planes as you move down the tree--first an x-aligned plane, then y, then z, then back to x, etc. We will focus on the application of the code implemented in scipy.spatial.KDTree

k-d trees are particularly appropriate for nearest-neighbor searches, where a large portion of the search space can often be rapidly eliminated due to the binary split structure of the tree. However, k-d trees suffer the curse of dimensionality.

5.3.1) Example Exploring k-d trees with scipy.spatial

Consider a set of autonomous vehicles (or robots) moving through a space with a dense set of obstacles. Explore the various ways that k-d trees may be used to probe critically-important (very close) obstacles.


In [54]:
np.random.seed(20)
vehicle_coords = np.random.random((500,2))
np.random.seed(21701)
obstacle_coords = np.random.random((775000,2))

fig = plt.figure()
ax = fig.add_subplot(111)
for label, coords, size, alpha in zip(['vehicles', 'obstacles'],
                               [vehicle_coords, obstacle_coords],
                               [500,60],
                               [1.0,0.01]):
    ax.scatter(coords[...,0],
               coords[...,1],
               label=label,
               alpha=alpha,
               s=size)
    
ax.set_aspect('equal')
ax.legend()
fig.set_size_inches(12,12)



In [55]:
%%time 
# find the indices of the 20 closest points using brute force
dist_matrix = cdist(vehicle_coords, obstacle_coords)
for vehicle_matrix in dist_matrix:
    closest_obstacle_indices = np.argsort(vehicle_matrix)[:20]


CPU times: user 1min 4s, sys: 1.26 s, total: 1min 5s
Wall time: 1min 5s

In [56]:
%%time
# find the indices of the 20 closest points using KDTree (pure Python)
tree = scipy.spatial.KDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
                                      k=20)[1]


CPU times: user 5.69 s, sys: 107 ms, total: 5.79 s
Wall time: 5.77 s

In [57]:
%%time
# find the indices of the 20 closest points using cKDTree
tree = scipy.spatial.cKDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
                                      k=20)[1]


CPU times: user 823 ms, sys: 18.1 ms, total: 841 ms
Wall time: 853 ms

In [58]:
%%time
# find the indices of the 20 closest points using cKDTree + approximation
tree = scipy.spatial.cKDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
                                      k=20,
                                      eps=0.5)[1] # (1 + eps) * true_distance threshold


CPU times: user 703 ms, sys: 13.3 ms, total: 716 ms
Wall time: 715 ms

Note that with a single vehicle, the results favor the brute force approach--it really depends on the size of the query you are trying to do. In practice, when working with huge data sets, I sometimes find that a KDTree is the only way to find nearest neighbors without having the space complexity exceed the physical memory of the system I'm using.

6) Convex Hulls

6.1) Convexity, Convex Hulls and General Position

Convex Regions: a region is convex if any two points of the region are visible to one another within the region. The convex hull is the smallest convex region containing the point set S.

Convex Hull by Example -- Back to Kindergarden Geoboards

Calculate it yourself with the scipy.spatial library


In [59]:
#a simple example of the convex hull for a random set of points in the plane
random_2D_array = np.random.random_sample((35,2))
hull = scipy.spatial.ConvexHull(random_2D_array)
sample_hull_plot = scipy.spatial.convex_hull_plot_2d(hull)


Prerequisites: General Position

A set of points (or other geometric objects) are said to be in general position if they avoid troublesome configurations, known as degenerate situations. The precise definition of general position varies depending on the algorithm. A point set is always in general position if the n points are chosen randomly, but this is (of course) not the case with real world data.


In [60]:
#produce an example of a degenerate point set:
fig_degenerate = plt.figure()
degenerate_vertices = np.array([[0,0],
                                [1,0],
                                [2,0],
                                [3,0],
                                [3,1],
                                [2,1],
                                [1,1],
                                [1.5,0.6],
                                [0.4,0.3]])
ax = fig_degenerate.add_subplot(111)
ax.scatter(degenerate_vertices[...,0], degenerate_vertices[...,1], c='k', s=70)
ax.set_title('A degenerate point set')


Out[60]:
<matplotlib.text.Text at 0x1a3aa32e8>

In [61]:
#try calculating the convex hull of the above point set
hull = scipy.spatial.ConvexHull(degenerate_vertices, qhull_options='Qc')
plot = scipy.spatial.convex_hull_plot_2d(hull)



In [62]:
hull.coplanar #the above computation is possible because Qhull is able to ignore a subset of the pathological input points (their indices and neighbour indices are only computed with the additional 'Qc' option sent to Qhull)


Out[62]:
array([[2, 0, 3],
       [1, 0, 0],
       [5, 3, 6]], dtype=int32)

In this case the algorithm actually handles the potentially pathological collinear data, but it is important to note that this kind of data could typically cause issues.

6.2) Practical Algorithm Time & Space Complexities

Convex Hull algorithms are fundamental to computational geometry and it is useful to have a working knowledge of the terminology used to describe their performance.

Complexity analysis captures the speed of an algorithm as a function of data input size using Big-Oh notation.

For a specific algorithm and an input of size $n$, the running time is captured as $O(f(n))$, and $cf(n)$ is the upper bound on the running time of the algorithm, where $c>0$ is a constant. The upper bound means that we typically ignore lower values of $n$ and focus on the asymptotic 'worst-case' scenarios.

Selected Examples:

big-Oh notation name Example
$O(1)$ Constant Adding two numbers
$O(n \textrm{ log } n)$ loglinear Sorting a list
$O(n^2)$ Quadratic Incremental convex hull algorithm
$O(n^k)$ Polynomial Robot Arm Motion Planning
$O(c^n)$ Exponential Some Brute Force Algorithms

We focus on time complexity, but there are also cases where memory usage (space complexity) is critical for an algorithm.


In [63]:
#we will use an empirical approach, although an expert could study the algorithm / source code and identify the 'rate-limiting' step based on the type of operations performed

def linear(n, m, c):
    return m * n + c

def loglinear(n, m, c):
    return m * (n * np.log(n)) + c

def quadratic(n, m, c):
    return m * (n ** 2) + c

points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []

for num_points in points_list:
    random_2D_points = np.random.random_sample((num_points,2))
    start_time = time.time()
    hull = scipy.spatial.ConvexHull(random_2D_points)
    elapsed_time = time.time() - start_time
    list_times.append(elapsed_time)
    print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')


benchmarked 1000 points in: 0.0009069442749023438  seconds
benchmarked 20000 points in: 0.004400968551635742  seconds
benchmarked 30000 points in: 0.0062830448150634766  seconds
benchmarked 50000 points in: 0.010130882263183594  seconds
benchmarked 70000 points in: 0.012984037399291992  seconds
benchmarked 100000 points in: 0.019385099411010742  seconds
benchmarked 200000 points in: 0.03750205039978027  seconds
benchmarked 300000 points in: 0.05657005310058594  seconds
benchmarked 500000 points in: 0.09241890907287598  seconds
benchmarked 700000 points in: 0.1319129467010498  seconds
benchmarked 900000 points in: 0.17487001419067383  seconds
benchmarked 1000000 points in: 0.19111895561218262  seconds

In [64]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])

fig_bench_hull = plt.figure()
ax = fig_bench_hull.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')

ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)

ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Convex Hull')
ax.set_xlim(-50,1.2e+6)
ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_hull.set_size_inches(8,8)


Qhull implements the Quickhull algorithm for convex hull [Barber et al. '96]. It has output-sensitive performance that can slightly improve on loglinear in some cases.

While the above Qhull algorithm is highly specialized / optimized and works well across several dimensions, a more straightforward summary of general Convex Hull algorithm performance is shown below:

Paradigm 2D Complexity 3D Complexity Notes
Incremental $O(n^2)$ $O(n^2)$ often implemented because of conceptual simplicity
Gift-wrapping $O(nh)$ $O(nf)$
Graham scan $O(n\:{\log}\:n)$ N/A
divide-and-conquer (recursion) $O(n\:{\log}\:n)$ $O(n\:{\log}\:n)$ 4D and up: $\Omega(n^{d/2})$

6.3.1) Can the Python community do better than scipy.spatial.ConvexHull?

NO, not from an algorithmic standpoint, because loglinear performance is the lower bound on sorting operations and we can reduce the convex hull problem (in the simplest case) to a sorting problem for a parabola:


In [65]:
x = np.arange(12)
y = x ** 2
fig_hull = plt.figure()
ax = fig_hull.add_subplot(111)
ax.scatter(x,y,s=80,c ='k')
closed_parabola = np.zeros((x.shape[0] + 1,2))
closed_parabola[:-1,0] = x
closed_parabola[:-1,1] = y
closed_parabola[:-1] = np.array(x[0],y[0])
parabola = np.array(list(zip(x,y)))
p = Polygon(parabola, alpha = 0.2)
ax.add_patch(p)


Out[65]:
<matplotlib.patches.Polygon at 0x2add20f28>

It turns out that, even if we didn't care about connectivity (i.e., only wanted the hull points without their order), the fastest possible performance is still loglinear. This was discovered relatively recently (1985) by Franco Preparata and Michael Shamos.

6.4) Practical Examples with scipy.spatial.ConvexHull

6.4.1) 2D

Problem: Let's say you have a startup that built a long-distance robot that you'd like to send along land from the far West of Oregon to its Eastern Border. However, a sufficient number of Oregon residents have objected to this project such that a straight line through the State will not be possible. Instead, you decide to treat the entire area of Oregon as hazardous for the robot, planning to travel on the borders (or in the more permissive neibhouring States) only. Assuming flat terrain, find the minimum total distance that may be traveled by the robot from the Westernmost point of Oregon to its Easternmost point.


In [66]:
#determine the indices of the West / East limit points (which also fall on the convex hull, by definition):
Oregon_vertices = np.load('Oregon_vertices.npy', allow_pickle=False)
x_min_index = np.argmin(Oregon_vertices[...,0])
x_max_index = np.argmax(Oregon_vertices[...,0])
x_min_index, x_max_index


Out[66]:
(0, 137)

In [67]:
#confirm positions of the West/ East limits in the context of the state and its convex hull
x_min_coord = Oregon_vertices[0]
x_max_coord = Oregon_vertices[137]
fig_Oregon_limits = plt.figure()
ax = fig_Oregon_limits.add_subplot(111)
ax.scatter(x_min_coord[0], x_min_coord[1], c='red', s = 300)
ax.scatter(x_max_coord[0], x_max_coord[1], c='red', s = 300)

hull = scipy.spatial.ConvexHull(Oregon_vertices)
hull_plot = scipy.spatial.convex_hull_plot_2d(hull, ax)


/Users/treddy/miniconda2/envs/mda_python35/lib/python3.5/site-packages/scipy/spatial/_plotutils.py:20: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  was_held = ax.ishold()

In [68]:
#plot and assess travel distance for the 2 possible border paths and 2 possible Convex Hull paths:
fig_Oregon_paths = plt.figure()
fig_Oregon_paths.set_size_inches(10,10)

ax_border_1 = fig_Oregon_paths.add_subplot(221)
ax_border_1.plot(Oregon_vertices[:138, 0], Oregon_vertices[:138, 1])
dist_1 = np.diag(scipy.spatial.distance_matrix(Oregon_vertices[:137],Oregon_vertices[1:138])).sum()
ax_border_1.set_title('Border path 1\n D={dist}'.format(dist=dist_1))

ax_border_2 = fig_Oregon_paths.add_subplot(222)
cycled_array = np.concatenate((Oregon_vertices,np.array([Oregon_vertices[0,...]])))
ax_border_2.plot(cycled_array[137:, 0], cycled_array[137:, 1])
dist_2 = np.diag(scipy.spatial.distance_matrix(cycled_array[137:-1],cycled_array[138:])).sum()
ax_border_2.set_title('Border path 2\n D={dist}'.format(dist=dist_2))

#note: in 2D scipy returns hull coords in CCW order
hull_coords = hull.points[hull.vertices]
hull_min_index = np.argmin(hull_coords[...,0])
hull_max_index = np.argmax(hull_coords[...,0])
ax_border_3 = fig_Oregon_paths.add_subplot(223)
ax_border_3.plot(hull_coords[hull_max_index:hull_min_index + 1,0],hull_coords[hull_max_index:hull_min_index + 1,1])
dist_3 = np.diag(scipy.spatial.distance_matrix(hull_coords[hull_max_index:hull_min_index],hull_coords[hull_max_index + 1:hull_min_index + 1])).sum()
ax_border_3.set_title('Hull path 1\n D={dist}'.format(dist=dist_3))

ax_border_4 = fig_Oregon_paths.add_subplot(224)
cycled_hull_coords = np.concatenate((hull_coords,hull_coords))
ax_border_4.plot(cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 2,0], cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 2,1])
dist_4 = np.diag(scipy.spatial.distance_matrix(cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 1],cycled_hull_coords[hull_min_index + 1:hull_coords.shape[0] + 2])).sum()
ax_border_4.set_title('Hull path 2\n D={dist}'.format(dist=dist_4))

for axis in [ax_border_1, ax_border_2, ax_border_3,ax_border_4]:
    axis.scatter(x_min_coord[0], x_min_coord[1], c='red', s = 300, alpha = 0.3)
    axis.scatter(x_max_coord[0], x_max_coord[1], c='red', s = 300, alpha = 0.3)
    axis.set_xlim(-128,-112)
    axis.set_ylim(40,48)


So, clockwise Hull Path 1 is the shortest.

6.4.2) 3D

Problem: Estimate the surface area of a spherical influenza A virus based on my simulation coordinates.


In [69]:
#load and plot the data:
fig_flu = plt.figure()
fig_flu.set_size_inches(7,7)
flu_coords = np.load('flu_coords.npy', allow_pickle=False)
ax = fig_flu.add_subplot(111,projection = '3d')
ax.scatter(flu_coords[...,0], flu_coords[...,1], flu_coords[...,2])
ax.set_xlabel('x ($\AA$)')
ax.set_ylabel('y ($\AA$)')
ax.set_zlabel('z ($\AA$)')
ax.set_xlim3d(400,1200)
ax.set_ylim3d(400,1200)
ax.set_zlim3d(0,800)
ax.set_title('Flu Envelope Coordinates')


Out[69]:
<matplotlib.text.Text at 0x113d082e8>

In [70]:
#calculate the 3D convex hull, plot the facets (triangles) of the hull, and sum together their areas to estimate the overall area of the viral surface
fig_flu_hull = plt.figure()
ax = fig_flu_hull.add_subplot(111, projection = '3d')
flu_hull = scipy.spatial.ConvexHull(flu_coords)
hull_triangle_coords = flu_hull.points[flu_hull.simplices]
flu_triangles = Poly3DCollection(hull_triangle_coords, alpha = 0.1)
ax.add_collection3d(flu_triangles)
ax.set_xlabel('x ($\AA$)')
ax.set_ylabel('y ($\AA$)')
ax.set_zlabel('z ($\AA$)')
ax.set_xlim3d(400,1200)
ax.set_ylim3d(400,1200)
ax.set_zlim3d(0,800)
fig_flu_hull.set_size_inches(7,7)
ax.set_title('Flu Convex Hull')

print(hull_triangle_coords.shape)
print('surface area estimate from Convex Hull:', flu_hull.area)

#compare again surface area of roughly equivalent sphere
crude_radius = (flu_coords[...,2].max() - flu_coords[...,2].min()) / 2.
sphere_SA = 4. * math.pi * (crude_radius ** 2)
print('surface area of roughly equivalent sphere:', sphere_SA)
print('% reconstitution of sphere:', flu_hull.area / sphere_SA * 100)


(1404, 3, 3)
surface area estimate from Convex Hull: 1103521.4229953205
surface area of roughly equivalent sphere: 1134557.42184
% reconstitution of sphere: 97.2644840849

Considering that flu isn't a perfect sphere, that % reconstitution of SA is an excellent indication that the calcluation was an excellent estimate.

7) Delaunay Triangulations

7.1) Definition and Basic Information

Named after Boris Delaunay (Russian Mathematician)

A Delaunay Triangulation is a triangulation that only has legal edges--edges that avoid small angles in triangles. The lexicographically sorted comparison of the full set of angles in two triangulations will always have larger angles for the Delaunay triangulation (other triangulations will have a smaller angle first).

General Position for Delaunay in 2D: no four points are cocircular

Textbook example of terrain reconstruction sensitivity to triangulation:

It is actually possible to construct a 2D Delaunay triangulation starting from any arbitrary triangulation by flipping one edge at a time (progressive illegal to legal edge flips), although this algorithm is slow ($O(n^2)$) and cannot be extended to 3D cases.

7.2) The Empty Circle Property


In [71]:
#none of the original data points in S can fall within the circumcircle of a triangle for a Delaunay triangulation (alternative and very important definition)
#demonstration using scipy.spatial once again:
fig_Delaunay_circles = plt.figure()
fig_Delaunay_circles.set_size_inches(8,8)
ax = fig_Delaunay_circles.add_subplot(111, aspect='equal')
random_points = np.random.random_sample((10,2))
tri = scipy.spatial.Delaunay(random_points)
ax.triplot(random_points[...,0], random_points[...,1], tri.simplices, 'go-')

#deal with calculating and plotting the circumcircles
circumcenters, circumradii = circumcircle.calc_circumcircle(tri.points[tri.simplices])
patches = []
for circumcenter_coordinates, circumradius in zip(circumcenters,circumradii):
    patches.append(matplotlib.patches.Circle((circumcenter_coordinates[0],circumcenter_coordinates[1]),circumradius[0],fill=False, alpha=1.0, fc='none', ec='none',))
p = PatchCollection(patches, alpha=0.1,match_original = True)
ax.add_collection(p)

ax.set_title('Delaunay Empty Circle Property')


Out[71]:
<matplotlib.text.Text at 0x1a0649208>

Also note that connecting the circumcenters of the circumcircles would produce the Voronoi diagram (they are dual graphs) -- this can be a very useful property and indeed Delaunay was a student of Voronoi.

7.3) Practical Applications of scipy.spatial.Delaunay()

2D

Problem: You're building a network with prohibitively expensive wiring but as long as there is a path from one computer to another the network will function properly. Given a set of 25 computers (nodes) with fixed positions (i.e., in an office), find the minimum amount of wiring and the appropriate connectivity for the network. The key here is to exploit the fact that the Euclidean minimum spanning tree (EMST) is a subgraph of the Delaunay triangulation of the point set.


In [72]:
computer_positions = np.random.random_sample((25,2))
tri = scipy.spatial.Delaunay(computer_positions)
#need undirected graph data in a format ready for scipy.sparse.csgraph.minimum_spanning_tree
#this will be an NxN symmetric matrix with Euclidean distances for direct connections in the triangulation, and 0 for "self" and other points that are not connected

#so we effectively want a special distance matrix, which means we need to know which points are connected, so we will likely have to use the simplices attribute of tri

triangle_indices = tri.simplices #has shape (n_triangles, 3)

#iterate through the triangle indices and populate a template distance matrix (default 0 for no connection)
undirected_graph = np.zeros((25,25))

for triangle_index_array in triangle_indices:
    equivalent_coordinate_array = tri.points[triangle_index_array]
    distance_array = scipy.spatial.distance.pdist(equivalent_coordinate_array)
    undirected_graph[triangle_index_array[0],triangle_index_array[1]] = distance_array[0]
    undirected_graph[triangle_index_array[0],triangle_index_array[2]] = distance_array[1]
    undirected_graph[triangle_index_array[1],triangle_index_array[2]] = distance_array[2]

#sanity check: no point should be connected to itself (all diagonal values zero)
print('diagonal of undirected_graph:', np.diag(undirected_graph))

sparse_result = scipy.sparse.csgraph.minimum_spanning_tree(undirected_graph)

#iterate through sparse matrix, plotting and adding up distances
fig_emst = plt.figure()
fig_emst.set_size_inches(8,8)
ax = fig_emst.add_subplot(111,aspect='equal')

cx = sparse_result.tocoo() #convert to coordinate representation of matrix
label = 0
total_distance = 0
for i,j,v in zip(cx.row, cx.col, cx.data):
    if v != 0: #there's an edge if nonzero
        p1 = computer_positions[i]
        p2 = computer_positions[j]
        total_distance += v
        if not label:
            ax.plot([p1[0],p2[0]],[p1[1],p2[1]],c='green',ls='-',marker='o', label = 'EMST', alpha = 0.2, lw =12)
            label += 1
        else: 
            ax.plot([p1[0],p2[0]],[p1[1],p2[1]],c='green',ls='-',marker='o', alpha = 0.2, lw = 12)
ax.legend()
ax.set_title('EMST (D = {dist}) as subgraph of Delaunay Triangulation'.format(dist=total_distance))

#overlay the original triangulation for comparison
ax.triplot(computer_positions[...,0], computer_positions[...,1], triangle_indices, c = 'k',marker = 'o')


diagonal of undirected_graph: [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.]
Out[72]:
[<matplotlib.lines.Line2D at 0x1a397d518>,
 <matplotlib.lines.Line2D at 0x1a397dd30>]

7.4) Measuring the time complexity of scipy.spatial.Delaunay()


In [73]:
#use a similar empirical approach to that used above for benchmarking the convex hull code

points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []

for num_points in points_list:
    random_2D_points = np.random.random_sample((num_points,2))
    start_time = time.time()
    tri = scipy.spatial.Delaunay(random_2D_points)
    elapsed_time = time.time() - start_time
    list_times.append(elapsed_time)
    print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')


benchmarked 1000 points in: 0.0059430599212646484  seconds
benchmarked 20000 points in: 0.14272785186767578  seconds
benchmarked 30000 points in: 0.22057294845581055  seconds
benchmarked 50000 points in: 0.4191920757293701  seconds
benchmarked 70000 points in: 0.6256179809570312  seconds
benchmarked 100000 points in: 0.9523100852966309  seconds
benchmarked 200000 points in: 2.0345470905303955  seconds
benchmarked 300000 points in: 3.1885480880737305  seconds
benchmarked 500000 points in: 5.580586910247803  seconds
benchmarked 700000 points in: 8.172300100326538  seconds
benchmarked 900000 points in: 10.772706985473633  seconds
benchmarked 1000000 points in: 12.009695053100586  seconds

In [74]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])

fig_bench_Delaunay = plt.figure()
ax = fig_bench_Delaunay.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')

ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)

ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Delaunay Triangulation')
#ax.set_xlim(-50,1.2e+6)
#ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_Delaunay.set_size_inches(8,8)


So, scipy.spatial.Delaunay appears to have loglinear performance, which is the optimum.

8) Voronoi Diagrams

8.1) Definition and Importance

They were first seriously studied by Georgy Voronoi in 1908, but are also known as Thiessen polygons and Dirichlet tessellations.

A Voronoi diagram defines Voronoi regions around each of the points in the original data set. The Voronoi region for a given point represents the portion of space in which all points not in the input data are closer to that point than to any other.

From a more practical standpoint, if you imagine a data set with all the grocery stores in your city as input points (generators), each grocery store will have its own Voronoi region, where all homes are closer to that grocery store than to any other. Applications span geography, marketing, network coverage, biology, robot motion planning, etc.

Crucially, the Voronoi diagram is the mathematical dual of the Delaunay Triangulation.

All Voronoi regions (i.e., polygons in 2D) are convex. This can be proven (although I won't do it!).

General position in 2D: no four generators are cocircular

8.2) Overview of Algorithms and their Time Complexities (2D)

Authors Year Citations Paradigm Performance
Shamos and Hoey 1975 1101 divide-and-conquer $O(n\:{\log}\:n)$
Green and Sibson 1978 861 incremental $O(n^2)$
Guibas and Stolfi 1985 1506 quad-edge data structure $O(n\:{\log}\:n)$
Fortune 1987 1402 sweepline $O(n\:{\log}\:n)$

Loglinear performance is known to be optimal.

8.3) Practical Problem Solving with scipy.spatial.Voronoi()

2D

Problem: Determine which residential locations are closest to a given water pump during the cholera outbreak in London ~150 years ago. This is the classic John Snow example with modern mapping data obtained from Robin Wilson (see below). The sizes of the red points are proportional to the number of fatalities reported at that residential location.


In [75]:
#load data modified from Robin Wilson's (University of Southampton) blog
array_pump_coords = np.load('array_pump_data.npy') 
array_cholera_data = np.load('array_cholera_data.npy') 

#plot
figure_voronoi = plt.figure()
figure_voronoi.set_size_inches(8,8)
ax = figure_voronoi.add_subplot('111', aspect='equal')
vor = scipy.spatial.Voronoi(array_pump_coords)
voronoi_figure = scipy.spatial.voronoi_plot_2d(vor, ax = ax)
ax.scatter(array_cholera_data[...,0], array_cholera_data[...,1], s = array_cholera_data[...,2] * 8, c = 'red', edgecolor='none') #scale scatter point area by number of fatalities
ax.scatter(array_pump_coords[...,0], array_pump_coords[...,1], s = 60, c = 'blue', edgecolor='none') #default pump sizes a bit too small


/Users/treddy/miniconda2/envs/mda_python35/lib/python3.5/site-packages/scipy/spatial/_plotutils.py:20: MatplotlibDeprecationWarning: The ishold function was deprecated in version 2.0.
  was_held = ax.ishold()
Out[75]:
<matplotlib.collections.PathCollection at 0x19fdd1240>

3D

Problem: You're designing an autonomous drone that will fly in an urban location and will have to avoid obstacles that are both stationary (i.e., tall buildings) and dynamic (helicopter traffic, etc.). Demonstrate an approach to determining the safest routes (i.e., farthest from obstacles) for the drone in 3D space. Assume that using a random set of 3D points is a suitable simulation of the kinds of coordinate data the drone will receive in real time.


In [76]:
air_traffic_mess = np.random.random_sample((50,3))
#the edges of a 3D Voronoi diagram will be the farthest from the obstacle coordinates
vor = scipy.spatial.Voronoi(air_traffic_mess)
        
fig_drone_3d = plt.figure()
fig_drone_3d.set_size_inches(8,8)
ax = fig_drone_3d.add_subplot(111, projection = '3d')

for ridge_indices in vor.ridge_vertices:
    voronoi_ridge_coords = vor.vertices[ridge_indices]
    ax.plot(voronoi_ridge_coords[...,0], voronoi_ridge_coords[...,1], voronoi_ridge_coords[...,2], lw=2, c = 'green', alpha = 0.05)
    
vor_vertex_coords = vor.vertices

ax.scatter(air_traffic_mess[...,0], air_traffic_mess[...,1], air_traffic_mess[...,2], c= 'k', label='obstacles', edgecolor='none')
ax.scatter(vor_vertex_coords[...,0], vor_vertex_coords[...,1], vor_vertex_coords[...,2], c= 'orange', label='Voronoi vertices',edgecolors='white', marker = 'o', alpha = 0.9)

ax.legend()
ax.set_xlim3d(air_traffic_mess[...,0].min(), air_traffic_mess[...,0].max())
ax.set_ylim3d(air_traffic_mess[...,1].min(), air_traffic_mess[...,1].max())
ax.set_zlim3d(air_traffic_mess[...,2].min(), air_traffic_mess[...,2].max())


Out[76]:
(0.0019396967597413717, 0.97956029349219897)

We would want the drone to follow the green lines, although the data still needs to be cleaned up a bit. For example, we'd like to deal with the "edges at infinity" (outside the Voronoi diagram, usually denoted as "-1" in the data structures returned by scipy/ qhull) more gracefully.

At the moment it appears to be non-trivial to plot the polyhedra around each of the obstacles. This is probably because the points need to be in the correct order for each polyhedron and again because of the edges at infinity.

A nice take-home exercise would be to attempt to clean this data up and produce a really nice 3D Voronoi diagram. One user reported their frustration / progress with this process in the issue below, and it seems we will eventually try to make this easier.

Development Note Type of Change Issue # scipy Implementation Version
Improve Voronoi (edge) data stucture access (would i.e., facilitate 3D plotting) Enhancement 7103 upcoming?

8.4) Measuring the Time Complexity for 2D scipy.spatial.Voronoi()


In [77]:
#use a similar empirical approach to that used above for benchmarking the convex hull code

points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []

for num_points in points_list:
    random_2D_points = np.random.random_sample((num_points,2))
    start_time = time.time()
    tri = scipy.spatial.Voronoi(random_2D_points)
    elapsed_time = time.time() - start_time
    list_times.append(elapsed_time)
    print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')


benchmarked 1000 points in: 0.026362180709838867  seconds
benchmarked 20000 points in: 0.3999369144439697  seconds
benchmarked 30000 points in: 0.5852129459381104  seconds
benchmarked 50000 points in: 1.0894160270690918  seconds
benchmarked 70000 points in: 1.4797968864440918  seconds
benchmarked 100000 points in: 2.0395379066467285  seconds
benchmarked 200000 points in: 4.092041015625  seconds
benchmarked 300000 points in: 6.178493022918701  seconds
benchmarked 500000 points in: 10.210033893585205  seconds
benchmarked 700000 points in: 15.0747709274292  seconds
benchmarked 900000 points in: 19.156573057174683  seconds
benchmarked 1000000 points in: 21.792407989501953  seconds

In [78]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])

fig_bench_Voronoi = plt.figure()
ax = fig_bench_Voronoi.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')

ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)

ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Voronoi Diagram')
#ax.set_xlim(-50,1.2e+6)
#ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_Voronoi.set_size_inches(8,8)


The loglinear performance is known to be optimal so we can't really improve on this from an algorithmic standpoint.

8.5) scipy.spatial.SphericalVoronoi()

Applications include my work on spherical viruses, geography, astrophysics, MRI analysis, and many other fields that deal with roughly spherical objects. A concise example follows below. The implemented algorithm is getting closer to loglinear time complexity, although there may still be a few weaknesses to iron out.

Wasn't available in a stable release of scipy when I gave the tutorial at PyCon last year!

Development Note Type of Change PR # scipy Implementation Version
adding spherical Voronoi diagram algorithm to scipy.spatial Enhancement 5232 `0.18.0`

Nikolai Nowaczyk and I worked on that PR for quite some time. There have since been several more improvements to the SphericalVoronoi() code. For example, the space complexity of the algorithm was initially too great, with a cap of tractable data sizes of around 200,000 points in the 0.18 implementation on a typical modern computer. Now, we can handle at least 10 million points on a modern machine:

Development Note Type of Change PR # scipy Implementation Version
BUG: SphericalVoronoi Space Complexity Enhancement 6828 `0.19.0`

It still take a few minutes (~5 on my laptop) to handle 10 million generators, and we're probably still a bit of a ways off from i.e., CGAL type standards. ConvexHull works well up to about 238 million points, for reference.

I also had time to improve the performance of the code that sorts the vertices of the spherical Voronoi regions (used for plotting, surface area calculations, etc.):

Development Note Type of Change PR # scipy Implementation Version
ENH: cythonized spherical Voronoi region polygon vertex sorting Enhancement 6768 `0.19.0`

In [79]:
#simple example for usage of scipy.spatial.SphericalVoronoi using random points on the surface of a sphere (but could easily imagine these points representing i.e., airports or cities on the surface of the earth, etc.)

num_points = 600

def marsaglia(num_points):
    '''generate random points on surface of sphere using Marsaglia's method (see: http://mathworld.wolfram.com/SpherePointPicking.html)'''
    x1 = np.random.uniform(low=-1.0, high=1.0,size=num_points)
    x2 = np.random.uniform(low=-1.0, high=1.0,size=num_points)

    #filter out points for which the sum of squares between the two distribution is >= 1
    mask = np.where(x1 ** 2 + x2 ** 2 < 1)
    x1 = x1[mask]
    x2 = x2[mask]

    x_coords = 2 * x1 * np.sqrt(1 - x1 ** 2 - x2 ** 2)
    y_coords = 2 * x2 * np.sqrt(1 - x1 ** 2 - x2 ** 2)
    z_coords = 1 - 2 * (x1 ** 2 + x2 ** 2)
    return np.stack((x_coords,y_coords,z_coords), axis=-1) 

points = marsaglia(num_points=num_points)

fig_spherical_Voronoi = plt.figure()
fig_spherical_Voronoi.set_size_inches(16,8)
ax1 = fig_spherical_Voronoi.add_subplot(121, projection = '3d')
ax1.scatter(points[...,0], points[...,1], points[...,2],c='blue')
ax1.set_title('Random Points on Sphere')

ax2 = fig_spherical_Voronoi.add_subplot(122, projection = '3d')
#calculate the Voronoi diagram on the surface of the sphere and plot the Voronoi region polygons
sv = scipy.spatial.SphericalVoronoi(points, radius=1)
sv.sort_vertices_of_regions() #generally needed for plotting / SA calculation

for region in sv.regions:
    random_color = colors.rgb2hex(np.random.rand(3))
    polygon = Poly3DCollection([sv.vertices[region]], alpha=1.0)
    polygon.set_color(random_color)
    ax2.add_collection3d(polygon)
    
ax2.set_title('Voronoi Regions')

for axis in [ax1,ax2]:
    axis.set_xlim(-1.5,1.5)
    axis.set_ylim(-1.5,1.5)
    axis.set_zlim(-1.5,1.5)


Additional user feedback suggests that we may need to check input more carefully to avoid confusion. In particular, there's no special code for handling duplicate / degenerate points, and we could probably also do a better job of checking if the user-provided generators reasonably match the radius they provide (or the default unit radius):

Development Note Type of Change PR # scipy Implementation Version
ENH: SphericalVoronoi Input Handling Enhancement 7254 In Progress

8.5.1) Calculating Surface Area of Spherical Polygons

Not only have I had to perform this calculation extensively in my own work on viruses, but I frequently get emails from geographers and MRI scientists asking about this calculation for polygons on the surface of a sphere. I recently wrote a proposal to incorporate calculations of this nature into scipy:

Development Note Type of Change Issue # scipy Implementation Version
Proposal: Spherical Polygon Surface Area Calculation Enhancement 6069 In Progress

The best code I have for doing this at the moment was improved by Edd Edmondson (University of Portsmouth). Let's see if we get a sensible result from the sum of the surface areas of all polygons in the above example.


In [80]:
#draft of code to calculate spherical polygon surface area
def convert_cartesian_array_to_spherical_array(coord_array,angle_measure='radians'):
    '''Take shape (N,3) cartesian coord_array and return an array of the same shape in spherical polar form (r, theta, phi). Based on StackOverflow response: http://stackoverflow.com/a/4116899
    use radians for the angles by default, degrees if angle_measure == 'degrees' '''
    spherical_coord_array = np.zeros(coord_array.shape)
    xy = coord_array[...,0]**2 + coord_array[...,1]**2
    spherical_coord_array[...,0] = np.sqrt(xy + coord_array[...,2]**2)
    spherical_coord_array[...,1] = np.arctan2(coord_array[...,1], coord_array[...,0])
    spherical_coord_array[...,2] = np.arccos(coord_array[...,2] / spherical_coord_array[...,0])
    if angle_measure == 'degrees':
        spherical_coord_array[...,1] = np.degrees(spherical_coord_array[...,1])
        spherical_coord_array[...,2] = np.degrees(spherical_coord_array[...,2])
    return spherical_coord_array

def convert_spherical_array_to_cartesian_array(spherical_coord_array,angle_measure='radians'):
    '''Take shape (N,3) spherical_coord_array (r,theta,phi) and return an array of the same shape in cartesian coordinate form (x,y,z). Based on the equations provided at: http://en.wikipedia.org/wiki/List_of_common_coordinate_transformations#From_spherical_coordinates
    use radians for the angles by default, degrees if angle_measure == 'degrees' '''
    cartesian_coord_array = np.zeros(spherical_coord_array.shape)
    #convert to radians if degrees are used in input (prior to Cartesian conversion process)
    if angle_measure == 'degrees':
        spherical_coord_array[...,1] = np.deg2rad(spherical_coord_array[...,1])
        spherical_coord_array[...,2] = np.deg2rad(spherical_coord_array[...,2])
    #now the conversion to Cartesian coords
    cartesian_coord_array[...,0] = spherical_coord_array[...,0] * np.cos(spherical_coord_array[...,1]) * np.sin(spherical_coord_array[...,2])
    cartesian_coord_array[...,1] = spherical_coord_array[...,0] * np.sin(spherical_coord_array[...,1]) * np.sin(spherical_coord_array[...,2])
    cartesian_coord_array[...,2] = spherical_coord_array[...,0] * np.cos(spherical_coord_array[...,2])
    return cartesian_coord_array

def calculate_haversine_distance_between_spherical_points(cartesian_array_1,cartesian_array_2,sphere_radius):
    '''Calculate the haversine-based distance between two points on the surface of a sphere. Should be more accurate than the arc cosine strategy. See, for example: http://en.wikipedia.org/wiki/Haversine_formula'''
    spherical_array_1 = convert_cartesian_array_to_spherical_array(cartesian_array_1)
    spherical_array_2 = convert_cartesian_array_to_spherical_array(cartesian_array_2)
    lambda_1 = spherical_array_1[1]
    lambda_2 = spherical_array_2[1]
    phi_1 = spherical_array_1[2]
    phi_2 = spherical_array_2[2]
    #we rewrite the standard Haversine slightly as long/lat is not the same as spherical coordinates - phi differs by pi/4
    spherical_distance = 2.0 * sphere_radius * math.asin(math.sqrt( ((1 - math.cos(phi_2-phi_1))/2.) + math.sin(phi_1) * math.sin(phi_2) * ( (1 - math.cos(lambda_2-lambda_1))/2.)  ))
    return spherical_distance

def calculate_surface_area_of_a_spherical_Voronoi_polygon(array_ordered_Voronoi_polygon_vertices,sphere_radius):
    '''Calculate the surface area of a polygon on the surface of a sphere. Based on equation provided here: http://mathworld.wolfram.com/LHuiliersTheorem.html
    Decompose into triangles, calculate excess for each'''
    #have to convert to unit sphere before applying the formula
    spherical_coordinates = convert_cartesian_array_to_spherical_array(array_ordered_Voronoi_polygon_vertices)
    spherical_coordinates[...,0] = 1.0
    array_ordered_Voronoi_polygon_vertices = convert_spherical_array_to_cartesian_array(spherical_coordinates)
    n = array_ordered_Voronoi_polygon_vertices.shape[0]
    #point we start from
    root_point = array_ordered_Voronoi_polygon_vertices[0]
    totalexcess = 0
    #loop from 1 to n-2, with point 2 to n-1 as other vertex of triangle
    # this could definitely be written more nicely
    b_point = array_ordered_Voronoi_polygon_vertices[1]
    root_b_dist = calculate_haversine_distance_between_spherical_points(root_point, b_point, 1.0)
    for i in 1 + np.arange(n - 2):
        a_point = b_point
        b_point = array_ordered_Voronoi_polygon_vertices[i+1]
        root_a_dist = root_b_dist
        root_b_dist = calculate_haversine_distance_between_spherical_points(root_point, b_point, 1.0)
        a_b_dist = calculate_haversine_distance_between_spherical_points(a_point, b_point, 1.0)
        s = (root_a_dist + root_b_dist + a_b_dist) / 2.
        totalexcess += 4 * math.atan(math.sqrt( math.tan(0.5 * s) * math.tan(0.5 * (s-root_a_dist)) * math.tan(0.5 * (s-root_b_dist)) * math.tan(0.5 * (s-a_b_dist))))
    return totalexcess * (sphere_radius ** 2)

In [81]:
#sum the polygon areas from above and compare with theoretical surface area of unit sphere
calculated_SA = 0
for region in sv.regions:
    SA = calculate_surface_area_of_a_spherical_Voronoi_polygon(sv.vertices[region], 1.0)
    calculated_SA += SA
    
print('calculated_SA:', calculated_SA)
print('theoretical SA:', 4 * math.pi)


calculated_SA: 12.56637061435891
theoretical SA: 12.566370614359172

Well, that certainly looks like a sensible result! However, as you would be able to see in the above proposal, there are probably a few mysteries left to solve (if only to determine the range of pathological inputs). Also, what if we wanted to avoid a python for loop? Can we vectorize this code in numpy ufunc style if each polygon can have a different shape? Usually ufuncs operate on 'rectangular' arrays.

9) Shape Comparison

9.1) Procrustes analysis

In Greek mythology, Procrustes was a son of Poseidon, and a bandit who would invite visitors to stay the night. Unfortunately for the visitors, he would forcibly fit them to an iron bed, either stretching their limbs to fit the bed frame if they were too small or cutting off their legs if they were longer than the bed. The term 'procrustean' has been used in statistics and computer science in various ways to describe forcibly fitting one concept to another, often in a rather harsh / rigid way.

Here, we focus on the reference to Procrustes analysis in shape analysis.

In general, this means calculating the sum of the squared distances between the points representing two objects AFTER they have been superimposed by:

  • scaling
  • translating
  • rotating
  • (sometimes) reflecting

9.2) Examples using scipy.spatial.procrustes

9.2.1) Simple Example -- Two Circles

Consider two circles with different radii and translational positions.


In [82]:
radius_circle_1 = 1.0
radius_circle_2 = 3.9
centroid_circle_1 = np.array([0.0, 0.0])
centroid_circle_2 = np.array([11.1, 4.2])

angles = np.random.random(50) * np.pi * 2
circle_1_x = (np.cos(angles) * radius_circle_1) + centroid_circle_1[0]
circle_1_y = (np.sin(angles) * radius_circle_1) + centroid_circle_1[1]
circle_2_x = (np.cos(angles) * radius_circle_2) + centroid_circle_2[0]
circle_2_y = (np.sin(angles) * radius_circle_2) + centroid_circle_2[1]

fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(circle_1_x, circle_1_y, c='blue', label='circle 1')
ax.scatter(circle_2_x, circle_2_y, c='orange', label='circle 2')
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=18)
ax.set_ylabel('y', fontsize=18)
fig.set_size_inches(10,6)



In [83]:
ref_matrix, fitted_matrix, disparity = scipy.spatial.procrustes(np.array(list(zip(circle_1_x, circle_1_y))),
                                                                np.array(list(zip(circle_2_x, circle_2_y))))

In [84]:
# let's start by plotting the fit result (shape comparison)
fig = plt.figure()
ax = fig.add_subplot(111)
for matrix, label, color, size in zip([ref_matrix, fitted_matrix], ['circle 1','circle 2'], ['blue', 'orange'], [200,30]):
    ax.scatter(matrix[...,0],
               matrix[...,1],
               label=label,
               color=color,
               s=size,
               alpha=0.7)
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=18)
ax.set_ylabel('y', fontsize=18)
ax.set_title('Procrustes Superimposition of Two Circles', fontsize=25)
fig.set_size_inches(10,8)


Two perfect circles will always have the same shape--the full Procustes analysis allows for translation, rotation, and scaling (and sometimes reflection) to condense the analysis down to a comparison of shape only. We would thus expect the disparity ("distance") between these shapes to be 0:


In [85]:
disparity, round(disparity)


Out[85]:
(3.0031816495429402e-31, 0.0)

The disparity is effectively zero--scipy calculates this as the sum of the squared distances between corresponding points in the two superimposed shapes.

9.2.2 More complex shapes -- comparison of two bone structures


In [86]:
# let's create some random data that might conceivably represent the shapes of some exotic bones discovered during
# a scientific expedition

# use 2D data to facilitate visualization
np.random.seed(589018)
bone_1_data = np.random.random((9,2)) * np.sqrt(27)
np.random.seed(11686)
bone_2_data = np.random.random((50,2)) * 2 + 5

fig = plt.figure()
ax = fig.add_subplot(111)
list_shapes = []
for data, label, color in zip([bone_1_data, bone_2_data],
                              ['bone 1', 'bone 2'],
                              ['blue', 'purple']):
    hull = scipy.spatial.ConvexHull(data)
    shape_outline = hull.points[hull.vertices]
    list_shapes.append(shape_outline)
    polygon = Polygon(shape_outline, 
                      alpha=0.4,
                      facecolor=color)
    ax.scatter(shape_outline[...,0],
               shape_outline[...,1],
               color=color,
               label=label)
    ax.add_patch(polygon)

ax.legend(fontsize=22)
ax.set_ylim(0,8)
ax.set_xlim(0,8)
ax.set_aspect('equal')
ax.set_xlabel('x',fontsize=20)
ax.set_ylabel('y',fontsize=20)
ax.set_title('Bone Shapes from Scientific Expedition',
             fontsize=22)
fig.set_size_inches(9,9)



In [87]:
# perform the Procrustes analysis / superimposition of the two bone structures
ref_matrix, fitted_matrix, disparity = scipy.spatial.procrustes(list_shapes[0],
                                                                list_shapes[1])

In [88]:
# plot the superposed bone structures
fig = plt.figure()
ax = fig.add_subplot(111)
for matrix, label, color in zip([ref_matrix, fitted_matrix],
                                ['bone 1', 'bone 2'],
                                ['blue', 'purple']):
    polygon = Polygon(matrix,
                      alpha=0.4,
                      facecolor=color)
    ax.add_patch(polygon)
    ax.scatter(matrix[...,0],
               matrix[...,1],
               color=color,
               label=label)

ax.legend(fontsize=22)
ax.set_xlim(-0.5,0.5)
ax.set_ylim(-0.5,0.5)
fig.set_size_inches(9,9)


So this is a nice example of a case where the shapes aren't identical and we wouldn't expect the disparity to be ~0:


In [89]:
disparity


Out[89]:
0.28587939064829399

9.3) The Hausdorff Distance

Development Note Type of Change Pull Request # scipy Implementation Version
Added scipy.spatial.distance.directed_hausdorff Enhancement 6649 0.19

In relatively simple terms, the Hausdorff distance is the maximum minimum distance between a given point in one data set and a given point in the other. This is far easier to understand with a simple example. Common applications include computer vision, path similarity analysis, and related analyses that compare shapes. It is worth noting that the directed Hausdorff distance is asymmetric, but it is more common to calculate the actual "metric--" the general (symmetric) Hausdorff distance, which is the maximum of the directed Hausdorff distances in each direction.

9.3.1) Simple Example: Concentric circles with one outlier


In [90]:
radius_circle_1 = 1.0
radius_circle_2 = 2.0

np.random.seed(776)
angles = np.random.random(70) * np.pi * 2
circle_1_x = np.cos(angles) * radius_circle_1
circle_1_y = np.sin(angles) * radius_circle_1
circle_2_x = np.cos(angles) * radius_circle_2
circle_2_y = np.sin(angles) * radius_circle_2
# add one outlier to one of the circles
circle_2_y[-1] = 3.0

circle_1_coords = np.array(list(zip(circle_1_x,
                                    circle_1_y)))

circle_2_coords = np.array(list(zip(circle_2_x,
                                    circle_2_y)))

fig = plt.figure()
ax = fig.add_subplot(111)
for circle_name, circle_coords in zip(['circle 1', 'circle 2'],
                                      [circle_1_coords, circle_2_coords]):
    ax.scatter(circle_coords[...,0],
               circle_coords[...,1],
               label=circle_name)
ax.legend(fontsize=18,
          bbox_to_anchor=[1.0,1.0])
ax.set_aspect('equal')
ax.set_xlim(-3.5,3.5)
ax.set_ylim(-3.5,3.5)
fig.set_size_inches(8,8)


The maximum minimum distances from circle 1 to circle 2 should all be pretty similar ~1. Conversely, for circle 2 to circle 1, the single outlier point has quite a large minimum distance to any point on circle 1, so the max minimum distance (directed Hausdorff distance) from circle 2 to circle 1 should be greater.


In [91]:
# let's see if this is the case
from scipy.spatial.distance import directed_hausdorff
d_c1_c2, index_c1, index_c2 = directed_hausdorff(circle_1_coords, circle_2_coords)
d_c2_c1, index_c2, index_c1 = directed_hausdorff(circle_2_coords, circle_1_coords)
print('directed Hausdorff distance C1 -> C2:', d_c1_c2)
print('directed Hausdorff distance C2 -> C1:', d_c2_c1)


directed Hausdorff distance C1 -> C2: 1.003829850210507
directed Hausdorff distance C2 -> C1: 2.0116136570973846

The indices that form the "Hausdorff pair" can be quite useful to identify for some applications (i.e., simulation work performed by the Beckstein lab at ASU). We'd expect the C2->C1 Hausdorff pair to involve the outlier and one of the "top points" on circle 1--let's verify this:


In [92]:
fig = plt.figure()
ax = fig.add_subplot(111)
for circle_name, circle_coords, pair_index in zip(['circle 1', 'circle 2'],
                                                  [circle_1_coords, circle_2_coords],
                                                  [index_c1, index_c2]):
    ax.scatter(circle_coords[...,0],
               circle_coords[...,1])

    ax.scatter(circle_coords[pair_index][0],
               circle_coords[pair_index][1],
               s=300,
               alpha=1.0,
               marker = 'x',
               c = 'k',
               label='Hausdorff Pair {circle}'.format(circle=circle_name))

ax.legend(bbox_to_anchor=[1.0, 1.0])
ax.set_aspect('equal')
ax.set_xlim(-3.5,3.5)
ax.set_ylim(-3.5,3.5)
fig.set_size_inches(8,8)


And finally the general Hausdorff distance is the max of the two directed distances:


In [93]:
general_Hausdorff_dist = max(d_c1_c2, d_c2_c1)
print('general Hausdorff distance:', general_Hausdorff_dist)


general Hausdorff distance: 2.0116136570973846

9.4) The Frechet Distance (Proposed)

I learned about the Hausdorff distance covered above and the related Frechet distance (proposed for incorporation into scipy) by reading a previously published manuscript from the Beckstein lab at ASU:

Seyler SL, Kumar A, Thorpe MF, Beckstein O (2015) Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology 11(10): e1004568. https://doi.org/10.1371/journal.pcbi.1004568

The Frechet distance is perhaps a bit harder to understand than the Hausdorff distance, and I will therefore borrow a few presentation materials from the authors above (with their permission).

The Frechet distance is sensitive to the ordering of the points in the two curves / shapes being compared, but the Hausdorff distance is not--the Frechet distance can therefore be a more appropriate measure in some cases.

Common applications include:

  • handwriting recognition
  • protein structure alignment
  • path similarity analysis (see above).

9.4.1) The Classic / Intuitive Interpretation (Informal Definition)

Efficient Frechet distance implementations may be a bit more challenging than Hausdorff--the scipy Hausdorff implementation used above (Taha and Hanbury, 2015) has an average run time of O(m), worst case O(m * o), while quadratic / polynomial time complexities are reported for discrete / continuous Frechet distances:

I did, however, stumble upon a publication reporting a subquadratic Frechet code:

Pankaj K. Agarwal, Rinat Ben Avraham, Haim Kaplan, and Micha Sharir (2014) Computing the Discrete Fréchet Distance in Subquadratic Time. SIAM J. Comput., 43: 429–449. DOI: http://dx.doi.org/10.1137/130920526

I asked Dr. Agarwal, the RJR Nabisco Professor of Computer Science and Mathematics at Duke University, about the availability of code implementing this algorithm (in any language) and he replied:

I am not aware of such a code, and I am not sure whether it's worth the effort for a variety of reasons. $n^2$ algorithm is so simple that it will perform significantly better even for polygonal chains with tens of thousands of vertices.

I'd still like to incorporate some Frechet implementation(s) into scipy, as it is a natural complement to the Hausdorff distance, though in many cases they are quite similar:

There are some other varities of the Frechet distance, including for example:

  • the weak Frechet distance: the dog and its owner may backtrack along their respective paths
  • the geodesic Frechet distance: adds the additional complexity of an obstacle in Euclidean space between the curves
  • the homotopic Frechet distance: the leash would have to be able to move over obstacles (i.e., tall bushes) as the owner walks the dog

I'd like to implement all of these in scipy! But, we'll see what I can actually manage!

9.4.2) Discrete Frechet Distance Example -- Path of an Autonomous Vehicle

Consider the case of autonomous vehicle testing. You may have a defined path and you want to check the fidelity of your autonomous vehicle's true path to the desired path. In practice, this kind of data analysis might span many cars using a number of GIS data sets, but let's focus on a single vehicle with a given target path using randomly-generated data for convenience.


In [94]:
num_time_points = 100
np.random.seed(111409)
reference_path_values = np.random.random((num_time_points, 2))
np.random.seed(231)
true_path_values = np.random.random((num_time_points, 2))

In [95]:
fig = plt.figure()
ax = fig.add_subplot(111)
for label, path_vals in zip(['reference path', 'actual path'],
                            [reference_path_values, true_path_values]):
    ax.plot(path_vals[...,0],
            path_vals[...,1],
            label=label,
            marker='o')
ax.legend(bbox_to_anchor=[1.0,1.0],
          fontsize=20)
ax.set_aspect('equal')
ax.set_title('Autonomous Vehicle Path vs. Desired Reference Path',
             fontsize=20)
fig.set_size_inches(10,10)


Ok, so maybe the autonomous vehicle isn't working that well yet! Let's try to quantify the difference between the paths using the general Hausdorff distance:


In [96]:
general_H_dist = max(directed_hausdorff(reference_path_values, true_path_values)[0],
                     directed_hausdorff(true_path_values, reference_path_values)[0])
general_H_dist


Out[96]:
0.12443765757128755

Now, let's try with the discrete Frechet distance, which is actually sensitive to backtracking motions. Since there's currently no Frechet implementation in scipy, I'll borrow some code written by my colleages at ASU for another Python library.


In [97]:
def sqnorm(v, axis=None):
    # modified from original code written by Sean Seyler at Arizona State University
    # and available in the open source Python MDAnalysis library
    return np.sum(v*v, axis=axis)

def get_msd_matrix(P, Q, axis=None):
    # modified from original code written by Sean Seyler at Arizona State University
    # and available in the open source Python MDAnalysis library
    return np.asarray([sqnorm(p - Q, axis=axis) for p in P])

def get_coord_axes(path):
    # modified from original code written by Sean Seyler at Arizona State University
    # and available in the open source Python MDAnalysis library
    path_dimensions = len(path.shape)
    if path_dimensions == 3:
        N = path.shape[1]
        axis = (1,2) # 1st axis: atoms, 2nd axis: x,y,z coords
    elif path_dimensions == 2:
        # can use mod to check if total # coords divisible by 3
        N = path.shape[1] / 3
        axis = (1,) # 1st axis: 3N structural coords (x1,y1,z1,...,xN,xN,zN)
    else:
        err_str = "Path must have 2 or 3 dimensions; the first dimensions (axis"\
                + " 0) must correspond to frames, axis 1 (and axis 2, if"       \
                + " present) must contain atomic coordinates."
        raise ValueError(err_str)
    return N, axis

def discrete_frechet(P, Q):
    # modified from original code written by Sean Seyler at Arizona State University
    # and available in the open source Python MDAnalysis library
    N, axis = get_coord_axes(P)
    Np, Nq = len(P), len(Q)
    d = get_msd_matrix(P, Q, axis=axis)
    ca = -np.ones((Np, Nq))

    def c(i, j):
        """Compute the coupling distance for two partial paths formed by *P* and
        *Q*, where both begin at frame 0 and end (inclusive) at the respective
        frame indices :math:`i-1` and :math:`j-1`. The partial path of *P* (*Q*)
        up to frame *i* (*j*) is formed by the slicing ``P[0:i]`` (``Q[0:j]``).
        :func:`c` is called recursively to compute the coupling distance
        between the two full paths *P* and *Q*  (i.e., the discrete Frechet
        distance) in terms of coupling distances between their partial paths.
        :Arguments:
          *i*
             int, partial path of *P* through final frame *i-1*
          *j*
             int, partial path of *Q* through final frame *j-1*
        :Returns:
          float, the coupling distance between partial paths ``P[0:i]`` and
          ``Q[0:j]``
        """
        if ca[i,j] != -1 : return ca[i,j]
        if i > 0:
            if j > 0: ca[i,j] = max( min(c(i-1,j),c(i,j-1),c(i-1,j-1)), d[i,j] )
            else:     ca[i,j] = max( c(i-1,0), d[i,0] )
        elif j > 0:   ca[i,j] = max( c(0,j-1), d[0,j] )
        else:         ca[i,j] = d[0,0]
        return        ca[i,j]

    return ( c(Np-1, Nq-1) / N )**0.5

In [98]:
discrete_frechet(reference_path_values,
                 true_path_values)


Out[98]:
0.94910964915946772

In simple terms, the length of the leash for the leash-minimizing path is quite a bit larger than the Hausdorff distance, and likely is a better measure of the difference between the shapes in this case because there is backtracking motion involved.

10) Miscellaneous Topics

10.1) Hyperrectangles

Informal definition: a generalization of rectangles to higher dimensions

In practice, I don't think I've had to use hyperrectangles directly, but I've definitely had to use them indirectly (they are used a lot for i.e., KDTree in scipy and more broadly in fields that involve special search types, including database theory).

So, let's just try to explore a bit with what scipy has to offer with hyperrectangles.

10.1.2) Exploring scipy.spatial.Rectangle with some examples


In [99]:
# let's start with a simple unit cube, with coordinates provided as the set of Cartesian "intervals"
hyper_rect_1 = scipy.spatial.Rectangle(maxes=[1.0,1.0,1.0],
                                       mins=[0.0,0.0,0.0])
hyper_rect_1


Out[99]:
<Rectangle [(0.0, 1.0), (0.0, 1.0), (0.0, 1.0)]>

In [100]:
# the volume of the unit cube should be 1 ** 3
hyper_rect_1.volume()


Out[100]:
1.0

Ok, so a method to find the volume of a cubic object isn't particularly impressive--let's see if we can find something more interesting.


In [101]:
child_rect_1, child_rect_2 = hyper_rect_1.split(d=0, split=0.5)

In [102]:
child_rect_1, child_rect_2


Out[102]:
(<Rectangle [(0.0, 0.5), (0.0, 1.0), (0.0, 1.0)]>,
 <Rectangle [(0.5, 1.0), (0.0, 1.0), (0.0, 1.0)]>)

In [103]:
# let's try plotting the resultant hyperrectangle split to see what happened (maybe I'm just bad at 3D visualization in my head!)

In [104]:
child_rect_1.mins, child_rect_1.maxes


Out[104]:
(array([ 0.,  0.,  0.]), array([ 0.5,  1. ,  1. ]))

In [105]:
child_rect_2.mins, child_rect_2.maxes


Out[105]:
(array([ 0.5,  0. ,  0. ]), array([ 1.,  1.,  1.]))

In [106]:
# use the Cartesian boundaries to build the cubic vertices manually (there's probably a faster way to do this programmatically)
child_rect_1_coords = np.array([[0, 0, 0],
                                [0.5, 0, 0],
                                [0, 0, 1],
                                [0.5, 0, 1],
                                [0, 1, 1],
                                [0.5, 1, 1],
                                [0.5, 1, 0],
                                [0, 1, 0]])
# the sibling hyperrectangle is shifted + 0.5 along the first axis
child_rect_2_coords = np.zeros((child_rect_1_coords.shape))
child_rect_2_coords[...,0] = child_rect_1_coords[...,0] + 0.5
child_rect_2_coords[...,1:] = child_rect_1_coords[...,1:]

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for child_rect, label in zip([child_rect_1_coords, child_rect_2_coords],
                             ['child 1', 'child 2']):
    ax.scatter3D(child_rect[...,0],
                 child_rect[...,1],
                 child_rect[...,2],
                 s=200,
                 label=label)
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=25)
ax.set_ylabel('y', fontsize=25)
ax.set_zlabel('z', fontsize=25)
fig.set_size_inches(9,9)


So, we clearly have a split of the hyperrectangle along the x-dimension (central vertices for child 2 conceal the identical central vertices from child 1 because child 2 was plotted after). It is not hard to imagine that this kind of operation would be useful i.e., in the construction of a binary space partition scheme (i.e., KDTree).


In [107]:
# we'd expect the volumes of the children to be halved (0.5):
child_rect_1.volume(), child_rect_2.volume()


Out[107]:
(0.5, 0.5)

In [108]:
# we'd expect the minimum distance between the two child hyperrectangles to be 0 (they have overlapping vertices),
# and the maximum distance between vertices would intuitively be the diagonal with height 1, base sqrt(2):

child_rect_1.min_distance_rectangle(child_rect_2)


Out[108]:
0.0

In [109]:
child_rect_1.max_distance_rectangle(child_rect_2)


Out[109]:
1.7320508075688772

In [110]:
np.sqrt(1 ** 2 + (np.sqrt(2) ** 2))


Out[110]:
1.7320508075688774

My suspicion is that Rectangle was primarily implemented for its internal usage in the KDTree source code, but you never know--this may be useful to have available in some cases. There are also Rectangle methods for looking at the distances between hyperrectangle vertices and an input array of points.

11) Discussion -- What would you really like to have available in scipy.spatial?

Some ideas:

  • vectorized spherical polygon surface area calculation
  • Frechet distance (and variations thereof)
  • scipy.spatial.ConicalVoronoi and / or other Voronoi surface solutions?

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: