This year my updated talk focuses on scipy.spatial
instead of the broader topic of "computational geometry in Python" covered last year. This hopefully cuts down on library dependencies a bit as well. As I've been increasingly involved in contributing to scipy.spatial
, this tutorial contains a number of 'Development Notes' formatted as follows:
Development Note | Type of Change | Pull Request # | scipy Implementation Version |
---|---|---|---|
description of change / improvement to scipy | Enhancement | 9999 | 0.19 |
My objective in placing these 'Development Notes' throughout the tutorial is to provide you with some cutting edge information about scipy.spatial
, especially for those areas that I'm actively working on. It may also inspire some of you to contribute or suggest implementating features we haven't thought of yet.
One of the key references I used in preparing the talk for last year was:
'Discrete and Computational Geometry' (2011) Devadoss and O'Rourke.
Since some key components of that talk are still present this year, that is still a very suitable reference, especially for fundamental topics like convex hulls, Delaunay triangulations, and Voronoi diagrams.
Computational Geometry: the study of computer algorithms that perform geometric operations.
Applications:
cooking ingredient logistics (success polygons)
A "young field," with modern computational geometry starting in ~ the 1970s, although the underlying concepts permeate many fields and existed in some form or other for centuries. Consider the illustration of solar system vertices (matter circling around stars) from Rene Descartes, produced in the 17th Century, and highlighted by Aurenhammer et al. in their recent textbook "Voronoi Diagrams and Delaunay Triangulations":
In [1]:
import scipy
import numpy as np
import math
import matplotlib.pyplot as plt
import time
import scipy.optimize
import PIL
import matplotlib
import networkx as nx
from matplotlib.patches import Polygon
from mpl_toolkits.mplot3d import Axes3D
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
from matplotlib.collections import PolyCollection
from matplotlib.collections import PatchCollection
from matplotlib import colors
import pickle
import circumcircle
%matplotlib inline
scipy.__version__
Out[1]:
There are many cases where it is desirable to calculate the distances between all possible pairs of points from two data sets. Consider a self-driving vehicle that is programmed to avoid animals that enter a roadway. It may have a set of coordinates that represent the vehicle proper relative to some kind of camera / detector, and another set of coordinates that represent the apparent obstruction (animal on the road).
For simplicity, lets define the two sets of coordinates randomly:
In [2]:
vehicle_coords = np.random.random((50,3))
animal_coords = np.random.random((600,3))
In order to calculate the amount of time that the vehicle has to make a course adjustment, it would be sensible to first calculate the closest distance between a point on the vehicle and a point on the animal. Lets try this by calculating all possible distances between the objects and finding the minimum of that distance matrix.
In [3]:
from scipy.spatial import distance_matrix
all_distances = distance_matrix(vehicle_coords, animal_coords)
all_distances.shape
Out[3]:
Each row of the distance matrix represents a point on the vehicle, and each column (value in a given row) is the distance between that point on the vehicle and one of the points on the animal (ordered by index).
In [4]:
closest_distance = all_distances.min()
closest_distance
Out[4]:
Consider another situation, where a self-driving vehicle is aware (i.e., via real-time 'traffic' reporting) of an animal obstructing a roadway, even though the vehicle does not have line of sight to the animal obstacle. The animal is a few city blocks away, but the vehicle would still benefit from a distance & time estimate so that appropriate adjustments may be made.
In this case, the straight-line Euclidean distance would not represent the shortest distance the vehicle would travel in practice--instead, we need to consider the so-called L1 norm / taxicab geometry/ Manhattan distance, etc.
In [5]:
taxicab_distances = distance_matrix(vehicle_coords, animal_coords, p=1)
taxicab_distances.shape
Out[5]:
In [6]:
closest_taxicab_distance = taxicab_distances.min()
closest_taxicab_distance
Out[6]:
In [7]:
%timeit distance_matrix(vehicle_coords, animal_coords)
In [8]:
from scipy.spatial.distance import cdist
In [9]:
%timeit cdist(vehicle_coords, animal_coords)
In [10]:
# confirm same value for p=2 norm
distance_matrix(vehicle_coords, animal_coords, p=2).min() == cdist(vehicle_coords, animal_coords, metric='euclidean').min()
Out[10]:
In [11]:
%timeit distance_matrix(vehicle_coords, animal_coords, p=1)
Slower with minkowski metric specification though!
In [12]:
%timeit cdist(vehicle_coords, animal_coords, metric='minkowski', p=1)
Use cityblock metric instead of p=1 minkowski
In [13]:
# time equivalent calculation using alternative calling convention
%timeit cdist(vehicle_coords, animal_coords, metric='cityblock')
Make sure the two different rectilinear calling conventions produce the same value
In [14]:
cdist(vehicle_coords, animal_coords, metric='minkowski', p=1).min() == cdist(vehicle_coords, animal_coords, metric='cityblock').min()
Out[14]:
And likewise for cdist vs. distance_matrix for rectilinear distance
In [15]:
# confirm same value for p=1 norm
distance_matrix(vehicle_coords, animal_coords, p=1).min() == cdist(vehicle_coords, animal_coords, metric='minkowski', p=1).min()
Out[15]:
Apart from the enhanced performance (at least for the data shapes tested above) of cdist
over distance_matrix
, cdist
also has access to a wide variety of distance metrics with a plethora of possible applications.
Consider the following ellipsoidal and circular data sets:
In [16]:
# ellipsoidal point set generation: http://stackoverflow.com/a/5529199/2942522
np.random.seed(255792)
width = 2.82
height = 0.36
phi = np.random.random((180)) * 2 * math.pi
rho = np.random.random((180))
x_ellipse = np.sqrt(rho) * np.cos(phi)
y_ellipse = np.sqrt(rho) * np.sin(phi)
x_ellipse *= width / 2.0
y_ellipse *= height / 2.0
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(x_ellipse, y_ellipse, c='blue', alpha=0.7, edgecolor='none', s=90)
ax.set_title('Ellipsoidal and Circular Point Sets')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_aspect('equal')
ax.set_xlim(-2,2)
ax.set_ylim(-2,2)
circle_radius = 1.5
x_circle = np.cos(phi) * circle_radius
y_circle = np.sin(phi) * circle_radius
ax.scatter(x_circle, y_circle, c='green', alpha=0.7, edgecolor='none', s=90)
fig.set_size_inches(8,8)
The Mahalanobis distance for each circular point (relative to the elliptical point set) will account for the "shape" of the ellipse and not just the mean of the elliptical points.
That said, it is somewhat confusing that cdist
will actually calculate the Mahalanobis distance between all possible pairs of points, when conceptually one would often just want to look at the distance of a point on the circle from the mean point of the elliptical data set. For clarity, we will perform the latter simplification here, calculating the inverse covariance matrix of the elliptical data set, but only focusing on the centroid of the ellipse for cdist
purposes. If your brain can handle the concept of the covariance matrix and ALL of the points in the ellipse, you can probably better leverage the power of cdist
with Mahalanobis than I!
Let's proceed with the simplified example though.
In [17]:
ellipse_data = np.dstack((x_ellipse, y_ellipse))[0]
circle_data = np.dstack((x_circle, y_circle))[0]
ellipse_centroid = np.average(ellipse_data, axis=0).reshape((1,2))
ellipse_centroid
Out[17]:
In [18]:
ellipse_inv_covariance = np.linalg.inv(np.cov(ellipse_data.T))
In [19]:
mahalanobis_distances = cdist(circle_data,
ellipse_centroid,
metric='mahalanobis',
VI=ellipse_inv_covariance)
mahalanobis_distances.shape
Out[19]:
In [20]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(ellipse_centroid[...,0], ellipse_centroid[...,1], c='k', alpha=0.7, edgecolor='none', s=90)
ax.set_title('Ellipsoidal and Circular Point Sets')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_aspect('equal')
ax.set_xlim(-2,2)
ax.set_ylim(-2,2)
im = ax.scatter(circle_data[...,0], circle_data[...,1], c=mahalanobis_distances,
cmap='viridis', alpha=0.7, edgecolor='none', s=90)
fig.colorbar(im)
fig.set_size_inches(8,8)
So, the smaller Mahalanobis distances are near the long axis of the ellipse even though all Euclidean distances would be effectively identical. This is exactly what we'd expect after accounting for the covariance of the ellipse reference data.
Of course, my simplification to a single-point comparison is good for clarity but bad for emphasizing the power of cdist
--it is designed to perform this calculation for all possible combinations of points in the ellipse and circle!
To be fair, there is also some (ongoing) confusion about the cdist
Mahalanobis distance code for the core developers as well:
Development Note | Type of Change | Issue # | scipy Implementation Version |
---|---|---|---|
possible issues with the mahalanobis distance with default VI argument |
unclear | 4815 | ongoing |
In [21]:
from scipy.spatial.distance import pdist
In [22]:
test_array = np.array([[0,0],
[0,1],
[0,2]])
The condensed distance matrix returned by pdist
for the above test_array
should have three values:
Let's try it:
In [23]:
condensed_matrix = pdist(test_array)
condensed_matrix
Out[23]:
Although it may be relatively easy to track which distance value corresponds to which pair of points in this case, if test_array
were much larger, working with a condensed distance matrix may become confusing.
This is where scipy.spatial.distance.squareform
becomes useful:
In [24]:
from scipy.spatial.distance import squareform
redundant_matrix = squareform(condensed_matrix)
redundant_matrix
Out[24]:
Even though pdist
does not calculate the distance between a point and itself, squareform
converts the condensed distance matrix to the familiar redundant form that would also be returned by cdist
for the data set compared with itself:
In [25]:
cdist(test_array, test_array)
Out[25]:
It is worth noting that squareform
can perform the reverse conversion as well:
In [26]:
squareform(cdist(test_array, test_array)) # produce the same result as pdist by using squareform
Out[26]:
Development Note | Type of Change | Pull Request # | scipy Implementation Version |
---|---|---|---|
squareform no longer converts all input data to float64 |
Enhancement | 6457 | 0.19 |
Imagine that you have experimental scientific data for the behavior of a protein (molecule) in the human body that causes a specific medical condition. Your suspicion is that this molecule becomes "compacted" / squished in the disease state. You want to know which two atoms are closest together in the normal vs. diseased state, to assess whether there really is evidence of compaction.
In [27]:
np.random.seed(387)
normal_coordinates = np.random.random((12000,3))
disease_coordinates = np.random.random((12000,3))
We could try a brute-force approach using cdist
to compare a data set against itself:
In [28]:
normal_dist_matrix_cdist = cdist(normal_coordinates, normal_coordinates)
disease_dist_matrix_cdist = cdist(disease_coordinates, disease_coordinates)
print('normal min:', normal_dist_matrix_cdist.min())
print('disease min:', disease_dist_matrix_cdist.min())
We actually need to filter out the self-comparisons when using cdist
in this manner. To do this, we effectively filter out the diagonal of the square distance matrix before calculating the minimum:
In [29]:
for name, cdist_array in zip(['normal', 'disease'], [normal_dist_matrix_cdist, disease_dist_matrix_cdist]):
mask = np.ones(cdist_array.shape, dtype=bool)
np.fill_diagonal(mask, 0)
print(name, ':', cdist_array[mask].min())
But, as you may have guessed, the redundant square matrices produced by cdist
can be quite large:
In [30]:
normal_dist_matrix_cdist.size, disease_dist_matrix_cdist.size
Out[30]:
It also just feels intuitively a bit strange to have to feed the same array of data to cdist
twice, and this is a strong indicator that pdist
may be a more efficient choice, especially for a large data set.
In [31]:
normal_dist_matrix_pdist = pdist(normal_coordinates)
disease_dist_matrix_pdist = pdist(disease_coordinates)
normal_dist_matrix_pdist.size, disease_dist_matrix_pdist.size
Out[31]:
In [32]:
normal_dist_matrix_cdist.size / normal_dist_matrix_pdist.size
Out[32]:
The number of floating point numbers stored by the result of pdist
is roughly half that stored by the return value of cdist
in this case. This could make a really big difference when working with 'big data' sets.
Now, confirm that the results from the two approaches are simply transformations of the same data:
In [33]:
np.testing.assert_equal(squareform(disease_dist_matrix_cdist), disease_dist_matrix_pdist)
User-defined distance functions / metrics may also be used with pdist
and cdist
, but as it stands there is currently no way to feed any additionally required keyword arguments through either pdist
or cdist
. There are only a few supported additional keyword arguments such as i.e., VI
for the Mahalanobis distance. There's some debate about adjusting the "interface" to Xdist
such that additional custom kwargs may be used:
Development Note | Type of Change | Pull Request # | scipy Implementation Version |
---|---|---|---|
ENH: added args and kwargs to Xdist |
Enhancement | 6974 | In Progress |
For computing the distances between two 1D arrays scipy.spatial.distance
has many of the same functions as are available to the metric
arguments of pdist
and cdist
. However, it is important to note that the metric
for pdist
and cdist
should always be specified as the string
type argument (i.e., "euclidean") to use the optimized C-level implementation instead of providing the built-in 1D vector distance function name (i.e., scipy.spatial.distance.euclidean
).
Of course, in some cases one may actually wish to calculate the distance between two 1D vectors, for which the specific functions in scipy.spatial.distance are appropriate.
In [34]:
array_1 = np.random.random((3,))
array_2 = np.random.random((3,))
euclidean_dist = scipy.spatial.distance.euclidean(array_1, array_2)
euclidean_dist
Out[34]:
In [35]:
# but not appropriate for > 1D arrays
array_1 = np.random.random((2, 3))
array_2 = np.random.random((2, 3))
try:
euclidean_dist = scipy.spatial.distance.euclidean(array_1, array_2)
except ValueError:
print('ValueError raised')
In [36]:
array_1 = np.zeros((2,))
array_2 = np.array([4, 4])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(array_1[0], array_1[1], s=200, label='array 1')
ax.scatter(array_2[0], array_2[1], s=200, label='array 2')
ax.legend(fontsize=16)
# add 'city blocks / steps' to plot
step_val = 0.5
current_pos = np.zeros((2,))
next_pos = np.zeros((2,))
x_val = 0
x_turn = 1
while x_val < 8:
if x_turn:
next_pos[0] += step_val
x_turn = 0
else:
next_pos[1] += step_val
x_turn = 1
ax.plot([current_pos[0], next_pos[0]], [current_pos[1], next_pos[1]], c='k')
current_pos[:] = next_pos[:]
x_val += step_val
ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.set_title('Simple Cityblock Distance', fontsize=16)
fig.set_size_inches(6,6)
In [37]:
cityblock_dist = scipy.spatial.distance.cityblock(array_1, array_2)
cityblock_dist
Out[37]:
Note that the City Block distance is equivalent to the sum of the x and y steps, and would be equivalent for one large block (single step).
In [38]:
array_1 = np.zeros((2,))
array_2 = np.array([1,9])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(array_1[0], array_1[1], s=250, label='array 1')
ax.scatter(array_2[0], array_2[1], s=250, label='array 2')
ax.legend(fontsize=16)
ax.set_xlim(-1,10)
ax.set_ylim(-1,10)
ax.set_xlabel('x', fontsize=16)
ax.set_ylabel('y', fontsize=16)
ax.set_aspect('equal')
ax.set_title('Weighted Minkowski Example', fontsize=16)
fig.set_size_inches(6,6)
We could pick any p-norm for this calculation, but for conceptual simplicity let's stick with the Euclidean L2 norm. For the conventional Euclidean distance we would expect d > 9
by simple visual inspection.
However, given the small difference in x
coordinate values, perhaps we can achieve d < 9
with the appropriate weighting.
In [39]:
weight_vector = [1, 0]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
weighted_euclidean_dist
Out[39]:
In [40]:
weight_vector = [1, 0.1]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
weighted_euclidean_dist
Out[40]:
In [41]:
weight_vector = [1, 1]
weighted_euclidean_dist = scipy.spatial.distance.wminkowski(array_1, array_2, p=2, w=weight_vector)
print('weighted_euclidean_dist with equal unit weights:', weighted_euclidean_dist)
assert weighted_euclidean_dist == scipy.spatial.distance.euclidean(array_1, array_2)
The shifting of the weighted distance matches nicely with intuition.
The Reed-Solomon code adds extra / redundant data bits to a data storage medium (i.e., CD, DVD, etc.) to assist with error correction. For example, error correction may be possible in the event of a minor scratch on the surface of a CD or if the reading device 'skips.' While the precise implementation details are beyond the scope of this example, if we think of these extra error-correcting bits as "code words," imagine a case where the encoder places a set of extra bits in the storage medium and the recovered equivalent word during decoding differs. We'd like a way to quantify the number of substitutions required to restore the original code word, which is an important metric in error correction.
In [42]:
original_code_word = np.array([0, 1, 1, 0, 1, 1, 0, 0], dtype=np.bool)
recovered_code_word = np.array([1, 1, 0, 0, 1, 1, 0, 1], dtype=np.bool)
from scipy.spatial.distance import hamming
minimum_substitutions = hamming(original_code_word, recovered_code_word)
minimum_substitutions
Out[42]:
Why isn't the Hamming distance an integer number corresponding to the number of positions that differ? Because we divide by the total number of "letters" (bits) in each "word" (byte). So, the scipy hamming
function is really the normalized Hamming distance.
In [43]:
np.argwhere(original_code_word != recovered_code_word).size / original_code_word.size
Out[43]:
Of course, the real-world implementation has more details and more bytes, so one might use i.e., pdist
with the metric "hamming" argument for larger comparisons of arrays.
In [44]:
from scipy.spatial.distance import jaccard
from PIL import Image
im = Image.open('emperor-penguin-alpha-construction-site-sign.jpg')
im.size # pixel dimensions
Out[44]:
In [45]:
# define the reference / ground truth bounding box visually
x_start_ref = 350
x_end_ref = 550
y_start_ref = 600
y_end_ref = 900
box_actual = (x_start_ref, x_end_ref, y_start_ref, y_end_ref)
region = im.crop(box_actual)
region
Out[45]:
In [46]:
# define ground truth reference matrix as a 1D boolean array of pixel matches to the original image
def generate_bool_matrix(original_image, x_start, x_end, y_start, y_end):
matrix = np.zeros(original_image.size)
matrix[x_start:x_end, y_start:y_end] = 1
matrix = matrix.astype(np.bool, copy=False).ravel()
return matrix
reference_matrix = generate_bool_matrix(im, x_start_ref, x_end_ref, y_start_ref, y_end_ref)
# sanity check on reference_matrix properties
reference_matrix, reference_matrix.shape, reference_matrix.sum(), ((x_end_ref - x_start_ref) * (y_end_ref - y_start_ref) )
Out[46]:
Of course the numer of preserved pixels is the product of the pixel dimensions of the selection region.
Now, let's pretend we have some pixel bounding boxes selected by a variety of computer vision routines, and we want to identify the "best" approach.
In [47]:
fig = plt.figure()
# the pixel boundaries would typically come from some kind of computer vision routine
x_start_p1 = 0; x_start_p2 = 390; x_start_p3 = 490
x_end_p1 = 200; x_end_p2 = 590; x_end_p3 = 690
y_start_p1 = 600; y_start_p2 = 610; y_start_p3 = 660
y_end_p1 = 900; y_end_p2 = 910; y_end_p3 = 960
box_program_1 = (x_start_p1, y_start_p1, x_end_p1, y_end_p1)
box_program_2 = (x_start_p2, y_start_p2, x_end_p2, y_end_p2)
box_program_3 = (x_start_p3, y_start_p3, x_end_p3, y_end_p3)
plot_num = 1
for program_name, bounding_box in zip(['program 1', 'program 2', 'program 3'],
[box_program_1, box_program_2, box_program_3]):
ax = fig.add_subplot(1, 3, plot_num)
current_region = im.crop(bounding_box)
current_region_bool_matrix = generate_bool_matrix(im, bounding_box[0], bounding_box[2],
bounding_box[1], bounding_box[3])
jaccard_dist = jaccard(current_region_bool_matrix, reference_matrix)
ax.set_title('{program}\nJaccard distance:{dist}'.format(program=program_name,
dist=np.around(jaccard_dist, 4)),
fontsize=20)
ax.imshow(current_region)
ax.axis('off')
plot_num += 1
fig.subplots_adjust(wspace=0.01)
fig.set_size_inches(14,6)
So, larger Jaccard distances are farther away from the ground truth match. We've just used a relatively simple 1D scipy.spatial.distance
function to assess the output of computer vision routines. In pratice, one would normally use the Jaccard index, $J_I = 1 - J_D$, such that the best match has the highest value & values $ > 0.5 $ are typically considered decent matches.
It seems that these Boolean distance calculations have different strengths and weaknesses depending on the application domain. I haven't seen as many interesting examples with the Yule distance as with the last two, but let's see how it behaves for an example case anyway.
Let's consider a very crude way to compare a hand-written letter with computer-generated letters--looking at at the positions of 'dark pixels.'
In [48]:
im_handwritten_T = Image.open('hand_written_t.jpg')
print('im_handwritten_T.size:', im_handwritten_T.size)
print('im_handwritten_T.getbands:', im_handwritten_T.getbands())
im_handwritten_T
Out[48]:
In [49]:
im_computer_T = Image.open('computer_T.jpg')
print('im_computer_T.size:', im_computer_T.size)
print('im_computer_T.getbands:', im_computer_T.getbands())
im_computer_T
Out[49]:
In [50]:
im_computer_S = Image.open('computer_S.jpg')
print('im_computer_S.size:', im_computer_S.size)
print('im_computer_S.getbands:', im_computer_S.getbands())
im_computer_S
Out[50]:
In [51]:
handwritten_T_array = np.asarray(im_handwritten_T)
computer_T_array = np.asarray(im_computer_T)
computer_S_array = np.asarray(im_computer_S)
computer_S_array
Out[51]:
In [52]:
# generate and compare boolean arrays based on pixel darkness
from scipy.spatial.distance import yule
hand_written_boolean = handwritten_T_array.ravel() < 250
for condition, pixel_array in zip(['computer S', 'computer T'], [computer_S_array, computer_T_array]):
computer_boolean = pixel_array.ravel() < 250
yule_dist = yule(computer_boolean, hand_written_boolean)
print(condition, 'yule distance:', yule_dist)
The Yule dissimilarity is bounded by 0 and 2 for boolean data sets, and we can see that the hand written "T" is closer to the computer-generated "T", even using this crude measure. Far more sophisticated use of the Yule distance in handwriting recognition has been described in the literature--see, for example: Zhang and Srihari (2003) Proceedings of SPIE-IS&T Electronic Imaging 5010.
The basic idea behind BSP is to split a space into a set of cells using so-called hyperplanes. In 3-D, this would mean using 2D planes to progressively divide the space into cells containing points, with the data structure typically organized in a "tree" that allows for i.e., rapid searching for nearest neighbors.
Some of the primary applications of BSP include:
In [53]:
# write Python code to mirror the static example at https://en.wikipedia.org/wiki/Binary_space_partitioning#Generation
import networkx as nx
list_lines = [np.array([[-4,3],[-1,-2]]),
np.array([[-1,2],[-1,-1]]),
np.array([[-1,0],[2,0]]),
np.array([[3.5,3], [2.5,-1]])]
line_labels = ['D', 'B', 'A', 'C']
fig = plt.figure()
fig.subplots_adjust(wspace=0.1, right=1.0)
fig.set_size_inches(6,14)
def recursive_BSP(list_lines, line_labels, current_node_index=2, fig=fig, curr_plot=1):
'''A recursive algorithm that crudely generates a BSP tree from a list of lines. Would enable use of i.e.,
the Painter's algorithm in computer graphics, though in practice there are more details to handle.'''
G = nx.Graph()
G.clear()
ax = fig.add_subplot(7,3,curr_plot)
for line, label in zip(list_lines, line_labels):
ax.plot(line[...,0],
line[...,1],
color='k')
ax.annotate(label, xy=line[0], xytext=line[0] + 0.2)
ax.set_aspect('equal')
ax.set_xlim(-5,5)
ax.set_ylim(-5,5)
current_line = list_lines[current_node_index]
current_line_label = line_labels[current_node_index]
# I want to plot a dashed line for the current segment (node) being processed
x_max_system = max([line_array[...,0].max() for line_array in list_lines])
x_min_system = min([line_array[...,0].min() for line_array in list_lines])
y_max_system = max([line_array[...,1].max() for line_array in list_lines])
y_min_system = min([line_array[...,1].min() for line_array in list_lines])
xdiff = current_line[0][0] - current_line[1][0]
ydiff = current_line[0][1] - current_line[1][1]
if xdiff==0:
# vertical line
current_line_extended = np.array([[current_line[0][0],y_min_system],[current_line[1][0],y_max_system]])
elif ydiff==0:
# horizontal line
current_line_extended = np.array([[x_min_system, current_line[0][1]],[x_max_system, current_line[1][1]]])
else:
# other slopes
m = ydiff / xdiff
b = current_line[0][1] - m * current_line[0][0]
current_line_extended = np.array([[x_min_system, m * x_min_system + b],[x_max_system, m * x_max_system + b]])
# determine which (if any) segments are split (intersected) by the line formed by the chosen segment
def line_intersection(line1, line2):
# see: http://stackoverflow.com/a/20677983/2942522
xdiff = (line1[0][0] - line1[1][0], line2[0][0] - line2[1][0])
ydiff = (line1[0][1] - line1[1][1], line2[0][1] - line2[1][1])
def det(a, b):
return a[0] * b[1] - a[1] * b[0]
div = det(xdiff, ydiff)
if div == 0:
raise Exception('lines do not intersect')
d = (det(*line1), det(*line2))
x = det(d, xdiff) / div
y = det(d, ydiff) / div
return x, y
curr_plot += 1
ax = fig.add_subplot(7,3,curr_plot)
ax.set_aspect('equal')
list_back_lines = []
list_front_lines = []
list_back_line_labels = []
list_front_line_labels = []
new_list_lines = []
new_line_labels = []
for index, line in enumerate(list_lines):
if index == current_node_index:
continue
else:
intersection = line_intersection(line, current_line)
# the intersection criterion also needs to check for segment overlap
# with line
line_x_min = line[...,0].min()
line_x_max = line[...,0].max()
line_y_min = line[...,1].min()
line_y_max = line[...,1].max()
if not (line_x_max >= current_line_extended[...,0].min() and line_x_max <= current_line_extended[...,0].max()):
continue
if not line_y_min <= current_line_extended[...,1].min():
continue
#print('intersection:', intersection)
ax.scatter(intersection[0], intersection[1], c='red')
new_segment_1 = np.array([line[0],intersection])
new_segment_2 = np.array([intersection,line[1]])
label = line_labels[index]
for label_suffix, segment, color in zip(['1','2'],
[new_segment_1, new_segment_2],
['blue','red']):
ax.plot(segment[...,0],
segment[...,1],
color=color)
#print('appending segment:', segment)
new_list_lines.append(segment)
new_line_labels.append(label + label_suffix)
ax.annotate(label + label_suffix, xy=segment[0])
ax.set_xlim(-5,5)
ax.set_ylim(-5,5)
if len(new_list_lines) == 0:
new_list_lines = list_lines[:]
new_line_labels = line_labels[:]
if abs(xdiff) > abs(ydiff):
# split over y
index_val = 1
else:
# split over x
index_val = 0
for line, label in zip(new_list_lines, new_line_labels):
if label == current_line_label:
continue
line_first_val = line[0][index_val]
if line_first_val > current_line_extended[0][index_val]:
list_back_lines.append(line)
list_back_line_labels.append(label)
color='blue'
else:
list_front_lines.append(line)
list_front_line_labels.append(label)
color='red'
ax.plot(line[...,0],
line[...,1],
color=color)
ax.annotate(label, xy=line[0])
ax.plot(current_line_extended[...,0],
current_line_extended[...,1],
ls='dashed',
c='k')
curr_plot += 1
ax = fig.add_subplot(7,3,curr_plot)
ax.set_aspect('equal')
ax.set_xlim(-4,4)
ax.set_ylim(-4,4)
pos = {}
G.add_node(current_line_label)
pos[current_line_label] = [0,0]
front_node_labels = '\n'.join(list_front_line_labels)
back_node_labels = '\n'.join(list_back_line_labels)
G.add_node(front_node_labels)
G.add_node(back_node_labels)
pos[front_node_labels] = [2,-2]
pos[back_node_labels] = [-2,-2]
nodelist=[current_line_label,
front_node_labels,
back_node_labels]
if len(list_front_line_labels) == 0 and len(list_back_line_labels) == 0:
nodelist=nodelist[:1]
else:
G.add_edges_from([(current_line_label, front_node_labels),
(current_line_label, back_node_labels)])
colors = ['black','red', 'blue']
nx.draw_networkx(G,
pos=pos,
ax=ax,
nodelist=nodelist,
node_size=3000,
node_color=colors,
linewidths=0,
alpha=0.4,
fontsize=12)
curr_plot += 1
if list_front_line_labels:
values = [list_front_lines, list_front_line_labels]
recursive_BSP(list_lines=values[0],
line_labels=values[1],
current_node_index=0,
curr_plot=curr_plot)
if list_back_line_labels:
curr_plot += 3
values = [list_back_lines, list_back_line_labels]
recursive_BSP(list_lines=values[0],
line_labels=values[1],
current_node_index=0,
curr_plot=curr_plot)
recursive_BSP(list_lines, line_labels)
Although my algorithm is rather crude, and the criteria for placement of the splitting hyperplanes and decision to terminate the tree-building process vary depending on the specific application, this is a Pythonic example representative of the general strategy used in many graphics rendering applications (i.e., early game engines) along with the Painter's algorithm. It provides an efficient way to store and rapidly retrieve information about polygons, though the actual construction of a BSP tree can be time consuming.
Note that the final number of lines (polygons) is greater than at the start, and that the choice of starting node can affect the structure of the tree--controlling polygon population and tree 'balance' can be important in some applications. A BSP tree may be traversed (searched) rapidly--in linear time.
The focus of the next section (KDTrees) is simply a specialized application of BSP Trees, often used for looking at large sets of points instead of large sets of polygons.
Informal Definition: k-dimensional trees are a specialized case of BSP trees, where every node is (usually) a k-dimensional point. The hyperplanes for splitting up the k-dimensional space are chosen perpendicular to the various axes of the space.
There are many different ways to construct k-d trees--it is common to have a build process that alternates the splitting planes as you move down the tree--first an x-aligned plane, then y, then z, then back to x, etc. We will focus on the application of the code implemented in scipy.spatial.KDTree
k-d trees are particularly appropriate for nearest-neighbor searches, where a large portion of the search space can often be rapidly eliminated due to the binary split structure of the tree. However, k-d trees suffer the curse of dimensionality.
In [54]:
np.random.seed(20)
vehicle_coords = np.random.random((500,2))
np.random.seed(21701)
obstacle_coords = np.random.random((775000,2))
fig = plt.figure()
ax = fig.add_subplot(111)
for label, coords, size, alpha in zip(['vehicles', 'obstacles'],
[vehicle_coords, obstacle_coords],
[500,60],
[1.0,0.01]):
ax.scatter(coords[...,0],
coords[...,1],
label=label,
alpha=alpha,
s=size)
ax.set_aspect('equal')
ax.legend()
fig.set_size_inches(12,12)
In [55]:
%%time
# find the indices of the 20 closest points using brute force
dist_matrix = cdist(vehicle_coords, obstacle_coords)
for vehicle_matrix in dist_matrix:
closest_obstacle_indices = np.argsort(vehicle_matrix)[:20]
In [56]:
%%time
# find the indices of the 20 closest points using KDTree (pure Python)
tree = scipy.spatial.KDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
k=20)[1]
In [57]:
%%time
# find the indices of the 20 closest points using cKDTree
tree = scipy.spatial.cKDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
k=20)[1]
In [58]:
%%time
# find the indices of the 20 closest points using cKDTree + approximation
tree = scipy.spatial.cKDTree(obstacle_coords)
closest_obstacle_indices = tree.query(x=vehicle_coords,
k=20,
eps=0.5)[1] # (1 + eps) * true_distance threshold
Note that with a single vehicle, the results favor the brute force approach--it really depends on the size of the query you are trying to do. In practice, when working with huge data sets, I sometimes find that a KDTree is the only way to find nearest neighbors without having the space complexity exceed the physical memory of the system I'm using.
Convex Regions: a region is convex if any two points of the region are visible to one another within the region. The convex hull is the smallest convex region containing the point set S.
Convex Hull by Example -- Back to Kindergarden Geoboards
Calculate it yourself with the scipy.spatial
library
In [59]:
#a simple example of the convex hull for a random set of points in the plane
random_2D_array = np.random.random_sample((35,2))
hull = scipy.spatial.ConvexHull(random_2D_array)
sample_hull_plot = scipy.spatial.convex_hull_plot_2d(hull)
Prerequisites: General Position
A set of points (or other geometric objects) are said to be in general position if they avoid troublesome configurations, known as degenerate situations. The precise definition of general position varies depending on the algorithm. A point set is always in general position if the n points are chosen randomly, but this is (of course) not the case with real world data.
In [60]:
#produce an example of a degenerate point set:
fig_degenerate = plt.figure()
degenerate_vertices = np.array([[0,0],
[1,0],
[2,0],
[3,0],
[3,1],
[2,1],
[1,1],
[1.5,0.6],
[0.4,0.3]])
ax = fig_degenerate.add_subplot(111)
ax.scatter(degenerate_vertices[...,0], degenerate_vertices[...,1], c='k', s=70)
ax.set_title('A degenerate point set')
Out[60]:
In [61]:
#try calculating the convex hull of the above point set
hull = scipy.spatial.ConvexHull(degenerate_vertices, qhull_options='Qc')
plot = scipy.spatial.convex_hull_plot_2d(hull)
In [62]:
hull.coplanar #the above computation is possible because Qhull is able to ignore a subset of the pathological input points (their indices and neighbour indices are only computed with the additional 'Qc' option sent to Qhull)
Out[62]:
In this case the algorithm actually handles the potentially pathological collinear data, but it is important to note that this kind of data could typically cause issues.
Complexity analysis captures the speed of an algorithm as a function of data input size using Big-Oh notation.
For a specific algorithm and an input of size $n$, the running time is captured as $O(f(n))$, and $cf(n)$ is the upper bound on the running time of the algorithm, where $c>0$ is a constant. The upper bound means that we typically ignore lower values of $n$ and focus on the asymptotic 'worst-case' scenarios.
Selected Examples:
big-Oh notation | name | Example |
---|---|---|
$O(1)$ | Constant | Adding two numbers |
$O(n \textrm{ log } n)$ | loglinear | Sorting a list |
$O(n^2)$ | Quadratic | Incremental convex hull algorithm |
$O(n^k)$ | Polynomial | Robot Arm Motion Planning |
$O(c^n)$ | Exponential | Some Brute Force Algorithms |
We focus on time complexity, but there are also cases where memory usage (space complexity) is critical for an algorithm.
In [63]:
#we will use an empirical approach, although an expert could study the algorithm / source code and identify the 'rate-limiting' step based on the type of operations performed
def linear(n, m, c):
return m * n + c
def loglinear(n, m, c):
return m * (n * np.log(n)) + c
def quadratic(n, m, c):
return m * (n ** 2) + c
points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []
for num_points in points_list:
random_2D_points = np.random.random_sample((num_points,2))
start_time = time.time()
hull = scipy.spatial.ConvexHull(random_2D_points)
elapsed_time = time.time() - start_time
list_times.append(elapsed_time)
print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')
In [64]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])
fig_bench_hull = plt.figure()
ax = fig_bench_hull.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')
ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)
ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Convex Hull')
ax.set_xlim(-50,1.2e+6)
ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_hull.set_size_inches(8,8)
Qhull implements the Quickhull algorithm for convex hull [Barber et al. '96]. It has output-sensitive performance that can slightly improve on loglinear in some cases.
While the above Qhull algorithm is highly specialized / optimized and works well across several dimensions, a more straightforward summary of general Convex Hull algorithm performance is shown below:
Paradigm | 2D Complexity | 3D Complexity | Notes |
---|---|---|---|
Incremental | $O(n^2)$ | $O(n^2)$ | often implemented because of conceptual simplicity |
Gift-wrapping | $O(nh)$ | $O(nf)$ | |
Graham scan | $O(n\:{\log}\:n)$ | N/A | |
divide-and-conquer (recursion) | $O(n\:{\log}\:n)$ | $O(n\:{\log}\:n)$ | 4D and up: $\Omega(n^{d/2})$ |
NO, not from an algorithmic standpoint, because loglinear performance is the lower bound on sorting operations and we can reduce the convex hull problem (in the simplest case) to a sorting problem for a parabola:
In [65]:
x = np.arange(12)
y = x ** 2
fig_hull = plt.figure()
ax = fig_hull.add_subplot(111)
ax.scatter(x,y,s=80,c ='k')
closed_parabola = np.zeros((x.shape[0] + 1,2))
closed_parabola[:-1,0] = x
closed_parabola[:-1,1] = y
closed_parabola[:-1] = np.array(x[0],y[0])
parabola = np.array(list(zip(x,y)))
p = Polygon(parabola, alpha = 0.2)
ax.add_patch(p)
Out[65]:
It turns out that, even if we didn't care about connectivity (i.e., only wanted the hull points without their order), the fastest possible performance is still loglinear. This was discovered relatively recently (1985) by Franco Preparata and Michael Shamos.
Problem: Let's say you have a startup that built a long-distance robot that you'd like to send along land from the far West of Oregon to its Eastern Border. However, a sufficient number of Oregon residents have objected to this project such that a straight line through the State will not be possible. Instead, you decide to treat the entire area of Oregon as hazardous for the robot, planning to travel on the borders (or in the more permissive neibhouring States) only. Assuming flat terrain, find the minimum total distance that may be traveled by the robot from the Westernmost point of Oregon to its Easternmost point.
In [66]:
#determine the indices of the West / East limit points (which also fall on the convex hull, by definition):
Oregon_vertices = np.load('Oregon_vertices.npy', allow_pickle=False)
x_min_index = np.argmin(Oregon_vertices[...,0])
x_max_index = np.argmax(Oregon_vertices[...,0])
x_min_index, x_max_index
Out[66]:
In [67]:
#confirm positions of the West/ East limits in the context of the state and its convex hull
x_min_coord = Oregon_vertices[0]
x_max_coord = Oregon_vertices[137]
fig_Oregon_limits = plt.figure()
ax = fig_Oregon_limits.add_subplot(111)
ax.scatter(x_min_coord[0], x_min_coord[1], c='red', s = 300)
ax.scatter(x_max_coord[0], x_max_coord[1], c='red', s = 300)
hull = scipy.spatial.ConvexHull(Oregon_vertices)
hull_plot = scipy.spatial.convex_hull_plot_2d(hull, ax)
In [68]:
#plot and assess travel distance for the 2 possible border paths and 2 possible Convex Hull paths:
fig_Oregon_paths = plt.figure()
fig_Oregon_paths.set_size_inches(10,10)
ax_border_1 = fig_Oregon_paths.add_subplot(221)
ax_border_1.plot(Oregon_vertices[:138, 0], Oregon_vertices[:138, 1])
dist_1 = np.diag(scipy.spatial.distance_matrix(Oregon_vertices[:137],Oregon_vertices[1:138])).sum()
ax_border_1.set_title('Border path 1\n D={dist}'.format(dist=dist_1))
ax_border_2 = fig_Oregon_paths.add_subplot(222)
cycled_array = np.concatenate((Oregon_vertices,np.array([Oregon_vertices[0,...]])))
ax_border_2.plot(cycled_array[137:, 0], cycled_array[137:, 1])
dist_2 = np.diag(scipy.spatial.distance_matrix(cycled_array[137:-1],cycled_array[138:])).sum()
ax_border_2.set_title('Border path 2\n D={dist}'.format(dist=dist_2))
#note: in 2D scipy returns hull coords in CCW order
hull_coords = hull.points[hull.vertices]
hull_min_index = np.argmin(hull_coords[...,0])
hull_max_index = np.argmax(hull_coords[...,0])
ax_border_3 = fig_Oregon_paths.add_subplot(223)
ax_border_3.plot(hull_coords[hull_max_index:hull_min_index + 1,0],hull_coords[hull_max_index:hull_min_index + 1,1])
dist_3 = np.diag(scipy.spatial.distance_matrix(hull_coords[hull_max_index:hull_min_index],hull_coords[hull_max_index + 1:hull_min_index + 1])).sum()
ax_border_3.set_title('Hull path 1\n D={dist}'.format(dist=dist_3))
ax_border_4 = fig_Oregon_paths.add_subplot(224)
cycled_hull_coords = np.concatenate((hull_coords,hull_coords))
ax_border_4.plot(cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 2,0], cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 2,1])
dist_4 = np.diag(scipy.spatial.distance_matrix(cycled_hull_coords[hull_min_index:hull_coords.shape[0] + 1],cycled_hull_coords[hull_min_index + 1:hull_coords.shape[0] + 2])).sum()
ax_border_4.set_title('Hull path 2\n D={dist}'.format(dist=dist_4))
for axis in [ax_border_1, ax_border_2, ax_border_3,ax_border_4]:
axis.scatter(x_min_coord[0], x_min_coord[1], c='red', s = 300, alpha = 0.3)
axis.scatter(x_max_coord[0], x_max_coord[1], c='red', s = 300, alpha = 0.3)
axis.set_xlim(-128,-112)
axis.set_ylim(40,48)
So, clockwise Hull Path 1 is the shortest.
Problem: Estimate the surface area of a spherical influenza A virus based on my simulation coordinates.
In [69]:
#load and plot the data:
fig_flu = plt.figure()
fig_flu.set_size_inches(7,7)
flu_coords = np.load('flu_coords.npy', allow_pickle=False)
ax = fig_flu.add_subplot(111,projection = '3d')
ax.scatter(flu_coords[...,0], flu_coords[...,1], flu_coords[...,2])
ax.set_xlabel('x ($\AA$)')
ax.set_ylabel('y ($\AA$)')
ax.set_zlabel('z ($\AA$)')
ax.set_xlim3d(400,1200)
ax.set_ylim3d(400,1200)
ax.set_zlim3d(0,800)
ax.set_title('Flu Envelope Coordinates')
Out[69]:
In [70]:
#calculate the 3D convex hull, plot the facets (triangles) of the hull, and sum together their areas to estimate the overall area of the viral surface
fig_flu_hull = plt.figure()
ax = fig_flu_hull.add_subplot(111, projection = '3d')
flu_hull = scipy.spatial.ConvexHull(flu_coords)
hull_triangle_coords = flu_hull.points[flu_hull.simplices]
flu_triangles = Poly3DCollection(hull_triangle_coords, alpha = 0.1)
ax.add_collection3d(flu_triangles)
ax.set_xlabel('x ($\AA$)')
ax.set_ylabel('y ($\AA$)')
ax.set_zlabel('z ($\AA$)')
ax.set_xlim3d(400,1200)
ax.set_ylim3d(400,1200)
ax.set_zlim3d(0,800)
fig_flu_hull.set_size_inches(7,7)
ax.set_title('Flu Convex Hull')
print(hull_triangle_coords.shape)
print('surface area estimate from Convex Hull:', flu_hull.area)
#compare again surface area of roughly equivalent sphere
crude_radius = (flu_coords[...,2].max() - flu_coords[...,2].min()) / 2.
sphere_SA = 4. * math.pi * (crude_radius ** 2)
print('surface area of roughly equivalent sphere:', sphere_SA)
print('% reconstitution of sphere:', flu_hull.area / sphere_SA * 100)
Considering that flu isn't a perfect sphere, that % reconstitution of SA is an excellent indication that the calcluation was an excellent estimate.
Named after Boris Delaunay (Russian Mathematician)
A Delaunay Triangulation is a triangulation that only has legal edges--edges that avoid small angles in triangles. The lexicographically sorted comparison of the full set of angles in two triangulations will always have larger angles for the Delaunay triangulation (other triangulations will have a smaller angle first).
General Position for Delaunay in 2D: no four points are cocircular
Textbook example of terrain reconstruction sensitivity to triangulation:
It is actually possible to construct a 2D Delaunay triangulation starting from any arbitrary triangulation by flipping one edge at a time (progressive illegal to legal edge flips), although this algorithm is slow ($O(n^2)$) and cannot be extended to 3D cases.
In [71]:
#none of the original data points in S can fall within the circumcircle of a triangle for a Delaunay triangulation (alternative and very important definition)
#demonstration using scipy.spatial once again:
fig_Delaunay_circles = plt.figure()
fig_Delaunay_circles.set_size_inches(8,8)
ax = fig_Delaunay_circles.add_subplot(111, aspect='equal')
random_points = np.random.random_sample((10,2))
tri = scipy.spatial.Delaunay(random_points)
ax.triplot(random_points[...,0], random_points[...,1], tri.simplices, 'go-')
#deal with calculating and plotting the circumcircles
circumcenters, circumradii = circumcircle.calc_circumcircle(tri.points[tri.simplices])
patches = []
for circumcenter_coordinates, circumradius in zip(circumcenters,circumradii):
patches.append(matplotlib.patches.Circle((circumcenter_coordinates[0],circumcenter_coordinates[1]),circumradius[0],fill=False, alpha=1.0, fc='none', ec='none',))
p = PatchCollection(patches, alpha=0.1,match_original = True)
ax.add_collection(p)
ax.set_title('Delaunay Empty Circle Property')
Out[71]:
Also note that connecting the circumcenters of the circumcircles would produce the Voronoi diagram (they are dual graphs) -- this can be a very useful property and indeed Delaunay was a student of Voronoi.
Problem: You're building a network with prohibitively expensive wiring but as long as there is a path from one computer to another the network will function properly. Given a set of 25 computers (nodes) with fixed positions (i.e., in an office), find the minimum amount of wiring and the appropriate connectivity for the network. The key here is to exploit the fact that the Euclidean minimum spanning tree (EMST) is a subgraph of the Delaunay triangulation of the point set.
In [72]:
computer_positions = np.random.random_sample((25,2))
tri = scipy.spatial.Delaunay(computer_positions)
#need undirected graph data in a format ready for scipy.sparse.csgraph.minimum_spanning_tree
#this will be an NxN symmetric matrix with Euclidean distances for direct connections in the triangulation, and 0 for "self" and other points that are not connected
#so we effectively want a special distance matrix, which means we need to know which points are connected, so we will likely have to use the simplices attribute of tri
triangle_indices = tri.simplices #has shape (n_triangles, 3)
#iterate through the triangle indices and populate a template distance matrix (default 0 for no connection)
undirected_graph = np.zeros((25,25))
for triangle_index_array in triangle_indices:
equivalent_coordinate_array = tri.points[triangle_index_array]
distance_array = scipy.spatial.distance.pdist(equivalent_coordinate_array)
undirected_graph[triangle_index_array[0],triangle_index_array[1]] = distance_array[0]
undirected_graph[triangle_index_array[0],triangle_index_array[2]] = distance_array[1]
undirected_graph[triangle_index_array[1],triangle_index_array[2]] = distance_array[2]
#sanity check: no point should be connected to itself (all diagonal values zero)
print('diagonal of undirected_graph:', np.diag(undirected_graph))
sparse_result = scipy.sparse.csgraph.minimum_spanning_tree(undirected_graph)
#iterate through sparse matrix, plotting and adding up distances
fig_emst = plt.figure()
fig_emst.set_size_inches(8,8)
ax = fig_emst.add_subplot(111,aspect='equal')
cx = sparse_result.tocoo() #convert to coordinate representation of matrix
label = 0
total_distance = 0
for i,j,v in zip(cx.row, cx.col, cx.data):
if v != 0: #there's an edge if nonzero
p1 = computer_positions[i]
p2 = computer_positions[j]
total_distance += v
if not label:
ax.plot([p1[0],p2[0]],[p1[1],p2[1]],c='green',ls='-',marker='o', label = 'EMST', alpha = 0.2, lw =12)
label += 1
else:
ax.plot([p1[0],p2[0]],[p1[1],p2[1]],c='green',ls='-',marker='o', alpha = 0.2, lw = 12)
ax.legend()
ax.set_title('EMST (D = {dist}) as subgraph of Delaunay Triangulation'.format(dist=total_distance))
#overlay the original triangulation for comparison
ax.triplot(computer_positions[...,0], computer_positions[...,1], triangle_indices, c = 'k',marker = 'o')
Out[72]:
In [73]:
#use a similar empirical approach to that used above for benchmarking the convex hull code
points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []
for num_points in points_list:
random_2D_points = np.random.random_sample((num_points,2))
start_time = time.time()
tri = scipy.spatial.Delaunay(random_2D_points)
elapsed_time = time.time() - start_time
list_times.append(elapsed_time)
print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')
In [74]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])
fig_bench_Delaunay = plt.figure()
ax = fig_bench_Delaunay.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')
ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)
ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Delaunay Triangulation')
#ax.set_xlim(-50,1.2e+6)
#ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_Delaunay.set_size_inches(8,8)
So, scipy.spatial.Delaunay
appears to have loglinear performance, which is the optimum.
They were first seriously studied by Georgy Voronoi in 1908, but are also known as Thiessen polygons and Dirichlet tessellations.
A Voronoi diagram defines Voronoi regions around each of the points in the original data set. The Voronoi region for a given point represents the portion of space in which all points not in the input data are closer to that point than to any other.
From a more practical standpoint, if you imagine a data set with all the grocery stores in your city as input points (generators), each grocery store will have its own Voronoi region, where all homes are closer to that grocery store than to any other. Applications span geography, marketing, network coverage, biology, robot motion planning, etc.
Crucially, the Voronoi diagram is the mathematical dual of the Delaunay Triangulation.
All Voronoi regions (i.e., polygons in 2D) are convex. This can be proven (although I won't do it!).
General position in 2D: no four generators are cocircular
Authors | Year | Citations | Paradigm | Performance |
---|---|---|---|---|
Shamos and Hoey | 1975 | 1101 | divide-and-conquer | $O(n\:{\log}\:n)$ |
Green and Sibson | 1978 | 861 | incremental | $O(n^2)$ |
Guibas and Stolfi | 1985 | 1506 | quad-edge data structure | $O(n\:{\log}\:n)$ |
Fortune | 1987 | 1402 | sweepline | $O(n\:{\log}\:n)$ |
Loglinear performance is known to be optimal.
Problem: Determine which residential locations are closest to a given water pump during the cholera outbreak in London ~150 years ago. This is the classic John Snow example with modern mapping data obtained from Robin Wilson (see below). The sizes of the red points are proportional to the number of fatalities reported at that residential location.
In [75]:
#load data modified from Robin Wilson's (University of Southampton) blog
array_pump_coords = np.load('array_pump_data.npy')
array_cholera_data = np.load('array_cholera_data.npy')
#plot
figure_voronoi = plt.figure()
figure_voronoi.set_size_inches(8,8)
ax = figure_voronoi.add_subplot('111', aspect='equal')
vor = scipy.spatial.Voronoi(array_pump_coords)
voronoi_figure = scipy.spatial.voronoi_plot_2d(vor, ax = ax)
ax.scatter(array_cholera_data[...,0], array_cholera_data[...,1], s = array_cholera_data[...,2] * 8, c = 'red', edgecolor='none') #scale scatter point area by number of fatalities
ax.scatter(array_pump_coords[...,0], array_pump_coords[...,1], s = 60, c = 'blue', edgecolor='none') #default pump sizes a bit too small
Out[75]:
Problem: You're designing an autonomous drone that will fly in an urban location and will have to avoid obstacles that are both stationary (i.e., tall buildings) and dynamic (helicopter traffic, etc.). Demonstrate an approach to determining the safest routes (i.e., farthest from obstacles) for the drone in 3D space. Assume that using a random set of 3D points is a suitable simulation of the kinds of coordinate data the drone will receive in real time.
In [76]:
air_traffic_mess = np.random.random_sample((50,3))
#the edges of a 3D Voronoi diagram will be the farthest from the obstacle coordinates
vor = scipy.spatial.Voronoi(air_traffic_mess)
fig_drone_3d = plt.figure()
fig_drone_3d.set_size_inches(8,8)
ax = fig_drone_3d.add_subplot(111, projection = '3d')
for ridge_indices in vor.ridge_vertices:
voronoi_ridge_coords = vor.vertices[ridge_indices]
ax.plot(voronoi_ridge_coords[...,0], voronoi_ridge_coords[...,1], voronoi_ridge_coords[...,2], lw=2, c = 'green', alpha = 0.05)
vor_vertex_coords = vor.vertices
ax.scatter(air_traffic_mess[...,0], air_traffic_mess[...,1], air_traffic_mess[...,2], c= 'k', label='obstacles', edgecolor='none')
ax.scatter(vor_vertex_coords[...,0], vor_vertex_coords[...,1], vor_vertex_coords[...,2], c= 'orange', label='Voronoi vertices',edgecolors='white', marker = 'o', alpha = 0.9)
ax.legend()
ax.set_xlim3d(air_traffic_mess[...,0].min(), air_traffic_mess[...,0].max())
ax.set_ylim3d(air_traffic_mess[...,1].min(), air_traffic_mess[...,1].max())
ax.set_zlim3d(air_traffic_mess[...,2].min(), air_traffic_mess[...,2].max())
Out[76]:
We would want the drone to follow the green lines, although the data still needs to be cleaned up a bit. For example, we'd like to deal with the "edges at infinity" (outside the Voronoi diagram, usually denoted as "-1" in the data structures returned by scipy/ qhull) more gracefully.
At the moment it appears to be non-trivial to plot the polyhedra around each of the obstacles. This is probably because the points need to be in the correct order for each polyhedron and again because of the edges at infinity.
A nice take-home exercise would be to attempt to clean this data up and produce a really nice 3D Voronoi diagram. One user reported their frustration / progress with this process in the issue below, and it seems we will eventually try to make this easier.
Development Note | Type of Change | Issue # | scipy Implementation Version |
---|---|---|---|
Improve Voronoi (edge) data stucture access (would i.e., facilitate 3D plotting) | Enhancement | 7103 | upcoming? |
In [77]:
#use a similar empirical approach to that used above for benchmarking the convex hull code
points_list = [1000,20000,30000,50000,70000,100000,200000,300000,500000,700000,900000,1000000]
list_times = []
for num_points in points_list:
random_2D_points = np.random.random_sample((num_points,2))
start_time = time.time()
tri = scipy.spatial.Voronoi(random_2D_points)
elapsed_time = time.time() - start_time
list_times.append(elapsed_time)
print('benchmarked', num_points, 'points in:', elapsed_time, ' seconds')
In [78]:
popt, pcov = scipy.optimize.curve_fit(linear, points_list, list_times)
linear_y_data = linear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(loglinear, points_list, list_times)
loglinear_y_data = loglinear(np.array(points_list), popt[0], popt[1])
popt, pcov = scipy.optimize.curve_fit(quadratic, points_list, list_times)
quadratic_y_data = quadratic(np.array(points_list), popt[0], popt[1])
fig_bench_Voronoi = plt.figure()
ax = fig_bench_Voronoi.add_subplot(111)
ax.scatter(points_list, list_times, c='k', label='original data', s = 80)
ax.plot(points_list, list_times, c='k', label='original data')
ax.plot(points_list, linear_y_data, c = 'blue', lw=7,alpha = 0.3, label = 'linear')
ax.plot(points_list, loglinear_y_data, c = 'red', lw=7,alpha = 0.3, label = 'loglinear')
ax.plot(points_list, quadratic_y_data, c = 'green', lw=7,alpha = 0.3, label = 'quadratic')
ax.legend(loc=2)
ax.set_title('Crude Time Complexity Assessment\nFor Qhull 2D Voronoi Diagram')
#ax.set_xlim(-50,1.2e+6)
#ax.set_ylim(-0.02,0.20)
ax.set_xlabel('# Points',fontsize=16)
ax.set_ylabel('Time (s)', fontsize=16)
fig_bench_Voronoi.set_size_inches(8,8)
The loglinear performance is known to be optimal so we can't really improve on this from an algorithmic standpoint.
Applications include my work on spherical viruses, geography, astrophysics, MRI analysis, and many other fields that deal with roughly spherical objects. A concise example follows below. The implemented algorithm is getting closer to loglinear time complexity, although there may still be a few weaknesses to iron out.
Wasn't available in a stable release of scipy
when I gave the tutorial at PyCon last year!
Development Note | Type of Change | PR # | scipy Implementation Version |
---|---|---|---|
adding spherical Voronoi diagram algorithm to scipy.spatial |
Enhancement | 5232 | `0.18.0` |
Nikolai Nowaczyk and I worked on that PR for quite some time. There have since been several more improvements to the SphericalVoronoi()
code. For example, the space complexity of the algorithm was initially too great, with a cap of tractable data sizes of around 200,000 points in the 0.18
implementation on a typical modern computer. Now, we can handle at least 10 million points on a modern machine:
Development Note | Type of Change | PR # | scipy Implementation Version |
---|---|---|---|
BUG: SphericalVoronoi Space Complexity | Enhancement | 6828 | `0.19.0` |
It still take a few minutes (~5 on my laptop) to handle 10 million generators, and we're probably still a bit of a ways off from i.e., CGAL type standards. ConvexHull
works well up to about 238 million points, for reference.
I also had time to improve the performance of the code that sorts the vertices of the spherical Voronoi regions (used for plotting, surface area calculations, etc.):
Development Note | Type of Change | PR # | scipy Implementation Version |
---|---|---|---|
ENH: cythonized spherical Voronoi region polygon vertex sorting | Enhancement | 6768 | `0.19.0` |
In [79]:
#simple example for usage of scipy.spatial.SphericalVoronoi using random points on the surface of a sphere (but could easily imagine these points representing i.e., airports or cities on the surface of the earth, etc.)
num_points = 600
def marsaglia(num_points):
'''generate random points on surface of sphere using Marsaglia's method (see: http://mathworld.wolfram.com/SpherePointPicking.html)'''
x1 = np.random.uniform(low=-1.0, high=1.0,size=num_points)
x2 = np.random.uniform(low=-1.0, high=1.0,size=num_points)
#filter out points for which the sum of squares between the two distribution is >= 1
mask = np.where(x1 ** 2 + x2 ** 2 < 1)
x1 = x1[mask]
x2 = x2[mask]
x_coords = 2 * x1 * np.sqrt(1 - x1 ** 2 - x2 ** 2)
y_coords = 2 * x2 * np.sqrt(1 - x1 ** 2 - x2 ** 2)
z_coords = 1 - 2 * (x1 ** 2 + x2 ** 2)
return np.stack((x_coords,y_coords,z_coords), axis=-1)
points = marsaglia(num_points=num_points)
fig_spherical_Voronoi = plt.figure()
fig_spherical_Voronoi.set_size_inches(16,8)
ax1 = fig_spherical_Voronoi.add_subplot(121, projection = '3d')
ax1.scatter(points[...,0], points[...,1], points[...,2],c='blue')
ax1.set_title('Random Points on Sphere')
ax2 = fig_spherical_Voronoi.add_subplot(122, projection = '3d')
#calculate the Voronoi diagram on the surface of the sphere and plot the Voronoi region polygons
sv = scipy.spatial.SphericalVoronoi(points, radius=1)
sv.sort_vertices_of_regions() #generally needed for plotting / SA calculation
for region in sv.regions:
random_color = colors.rgb2hex(np.random.rand(3))
polygon = Poly3DCollection([sv.vertices[region]], alpha=1.0)
polygon.set_color(random_color)
ax2.add_collection3d(polygon)
ax2.set_title('Voronoi Regions')
for axis in [ax1,ax2]:
axis.set_xlim(-1.5,1.5)
axis.set_ylim(-1.5,1.5)
axis.set_zlim(-1.5,1.5)
Additional user feedback suggests that we may need to check input more carefully to avoid confusion. In particular, there's no special code for handling duplicate / degenerate points, and we could probably also do a better job of checking if the user-provided generators reasonably match the radius they provide (or the default unit radius):
Development Note | Type of Change | PR # | scipy Implementation Version |
---|---|---|---|
ENH: SphericalVoronoi Input Handling | Enhancement | 7254 | In Progress |
Not only have I had to perform this calculation extensively in my own work on viruses, but I frequently get emails from geographers and MRI scientists asking about this calculation for polygons on the surface of a sphere. I recently wrote a proposal to incorporate calculations of this nature into scipy:
Development Note | Type of Change | Issue # | scipy Implementation Version |
---|---|---|---|
Proposal: Spherical Polygon Surface Area Calculation | Enhancement | 6069 | In Progress |
The best code I have for doing this at the moment was improved by Edd Edmondson (University of Portsmouth). Let's see if we get a sensible result from the sum of the surface areas of all polygons in the above example.
In [80]:
#draft of code to calculate spherical polygon surface area
def convert_cartesian_array_to_spherical_array(coord_array,angle_measure='radians'):
'''Take shape (N,3) cartesian coord_array and return an array of the same shape in spherical polar form (r, theta, phi). Based on StackOverflow response: http://stackoverflow.com/a/4116899
use radians for the angles by default, degrees if angle_measure == 'degrees' '''
spherical_coord_array = np.zeros(coord_array.shape)
xy = coord_array[...,0]**2 + coord_array[...,1]**2
spherical_coord_array[...,0] = np.sqrt(xy + coord_array[...,2]**2)
spherical_coord_array[...,1] = np.arctan2(coord_array[...,1], coord_array[...,0])
spherical_coord_array[...,2] = np.arccos(coord_array[...,2] / spherical_coord_array[...,0])
if angle_measure == 'degrees':
spherical_coord_array[...,1] = np.degrees(spherical_coord_array[...,1])
spherical_coord_array[...,2] = np.degrees(spherical_coord_array[...,2])
return spherical_coord_array
def convert_spherical_array_to_cartesian_array(spherical_coord_array,angle_measure='radians'):
'''Take shape (N,3) spherical_coord_array (r,theta,phi) and return an array of the same shape in cartesian coordinate form (x,y,z). Based on the equations provided at: http://en.wikipedia.org/wiki/List_of_common_coordinate_transformations#From_spherical_coordinates
use radians for the angles by default, degrees if angle_measure == 'degrees' '''
cartesian_coord_array = np.zeros(spherical_coord_array.shape)
#convert to radians if degrees are used in input (prior to Cartesian conversion process)
if angle_measure == 'degrees':
spherical_coord_array[...,1] = np.deg2rad(spherical_coord_array[...,1])
spherical_coord_array[...,2] = np.deg2rad(spherical_coord_array[...,2])
#now the conversion to Cartesian coords
cartesian_coord_array[...,0] = spherical_coord_array[...,0] * np.cos(spherical_coord_array[...,1]) * np.sin(spherical_coord_array[...,2])
cartesian_coord_array[...,1] = spherical_coord_array[...,0] * np.sin(spherical_coord_array[...,1]) * np.sin(spherical_coord_array[...,2])
cartesian_coord_array[...,2] = spherical_coord_array[...,0] * np.cos(spherical_coord_array[...,2])
return cartesian_coord_array
def calculate_haversine_distance_between_spherical_points(cartesian_array_1,cartesian_array_2,sphere_radius):
'''Calculate the haversine-based distance between two points on the surface of a sphere. Should be more accurate than the arc cosine strategy. See, for example: http://en.wikipedia.org/wiki/Haversine_formula'''
spherical_array_1 = convert_cartesian_array_to_spherical_array(cartesian_array_1)
spherical_array_2 = convert_cartesian_array_to_spherical_array(cartesian_array_2)
lambda_1 = spherical_array_1[1]
lambda_2 = spherical_array_2[1]
phi_1 = spherical_array_1[2]
phi_2 = spherical_array_2[2]
#we rewrite the standard Haversine slightly as long/lat is not the same as spherical coordinates - phi differs by pi/4
spherical_distance = 2.0 * sphere_radius * math.asin(math.sqrt( ((1 - math.cos(phi_2-phi_1))/2.) + math.sin(phi_1) * math.sin(phi_2) * ( (1 - math.cos(lambda_2-lambda_1))/2.) ))
return spherical_distance
def calculate_surface_area_of_a_spherical_Voronoi_polygon(array_ordered_Voronoi_polygon_vertices,sphere_radius):
'''Calculate the surface area of a polygon on the surface of a sphere. Based on equation provided here: http://mathworld.wolfram.com/LHuiliersTheorem.html
Decompose into triangles, calculate excess for each'''
#have to convert to unit sphere before applying the formula
spherical_coordinates = convert_cartesian_array_to_spherical_array(array_ordered_Voronoi_polygon_vertices)
spherical_coordinates[...,0] = 1.0
array_ordered_Voronoi_polygon_vertices = convert_spherical_array_to_cartesian_array(spherical_coordinates)
n = array_ordered_Voronoi_polygon_vertices.shape[0]
#point we start from
root_point = array_ordered_Voronoi_polygon_vertices[0]
totalexcess = 0
#loop from 1 to n-2, with point 2 to n-1 as other vertex of triangle
# this could definitely be written more nicely
b_point = array_ordered_Voronoi_polygon_vertices[1]
root_b_dist = calculate_haversine_distance_between_spherical_points(root_point, b_point, 1.0)
for i in 1 + np.arange(n - 2):
a_point = b_point
b_point = array_ordered_Voronoi_polygon_vertices[i+1]
root_a_dist = root_b_dist
root_b_dist = calculate_haversine_distance_between_spherical_points(root_point, b_point, 1.0)
a_b_dist = calculate_haversine_distance_between_spherical_points(a_point, b_point, 1.0)
s = (root_a_dist + root_b_dist + a_b_dist) / 2.
totalexcess += 4 * math.atan(math.sqrt( math.tan(0.5 * s) * math.tan(0.5 * (s-root_a_dist)) * math.tan(0.5 * (s-root_b_dist)) * math.tan(0.5 * (s-a_b_dist))))
return totalexcess * (sphere_radius ** 2)
In [81]:
#sum the polygon areas from above and compare with theoretical surface area of unit sphere
calculated_SA = 0
for region in sv.regions:
SA = calculate_surface_area_of_a_spherical_Voronoi_polygon(sv.vertices[region], 1.0)
calculated_SA += SA
print('calculated_SA:', calculated_SA)
print('theoretical SA:', 4 * math.pi)
Well, that certainly looks like a sensible result! However, as you would be able to see in the above proposal, there are probably a few mysteries left to solve (if only to determine the range of pathological inputs). Also, what if we wanted to avoid a python for loop? Can we vectorize this code in numpy ufunc style if each polygon can have a different shape? Usually ufuncs operate on 'rectangular' arrays.
In Greek mythology, Procrustes was a son of Poseidon, and a bandit who would invite visitors to stay the night. Unfortunately for the visitors, he would forcibly fit them to an iron bed, either stretching their limbs to fit the bed frame if they were too small or cutting off their legs if they were longer than the bed. The term 'procrustean' has been used in statistics and computer science in various ways to describe forcibly fitting one concept to another, often in a rather harsh / rigid way.
Here, we focus on the reference to Procrustes analysis in shape analysis.
In general, this means calculating the sum of the squared distances between the points representing two objects AFTER they have been superimposed by:
In [82]:
radius_circle_1 = 1.0
radius_circle_2 = 3.9
centroid_circle_1 = np.array([0.0, 0.0])
centroid_circle_2 = np.array([11.1, 4.2])
angles = np.random.random(50) * np.pi * 2
circle_1_x = (np.cos(angles) * radius_circle_1) + centroid_circle_1[0]
circle_1_y = (np.sin(angles) * radius_circle_1) + centroid_circle_1[1]
circle_2_x = (np.cos(angles) * radius_circle_2) + centroid_circle_2[0]
circle_2_y = (np.sin(angles) * radius_circle_2) + centroid_circle_2[1]
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(circle_1_x, circle_1_y, c='blue', label='circle 1')
ax.scatter(circle_2_x, circle_2_y, c='orange', label='circle 2')
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=18)
ax.set_ylabel('y', fontsize=18)
fig.set_size_inches(10,6)
In [83]:
ref_matrix, fitted_matrix, disparity = scipy.spatial.procrustes(np.array(list(zip(circle_1_x, circle_1_y))),
np.array(list(zip(circle_2_x, circle_2_y))))
In [84]:
# let's start by plotting the fit result (shape comparison)
fig = plt.figure()
ax = fig.add_subplot(111)
for matrix, label, color, size in zip([ref_matrix, fitted_matrix], ['circle 1','circle 2'], ['blue', 'orange'], [200,30]):
ax.scatter(matrix[...,0],
matrix[...,1],
label=label,
color=color,
s=size,
alpha=0.7)
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=18)
ax.set_ylabel('y', fontsize=18)
ax.set_title('Procrustes Superimposition of Two Circles', fontsize=25)
fig.set_size_inches(10,8)
Two perfect circles will always have the same shape--the full Procustes analysis allows for translation, rotation, and scaling (and sometimes reflection) to condense the analysis down to a comparison of shape only. We would thus expect the disparity ("distance") between these shapes to be 0:
In [85]:
disparity, round(disparity)
Out[85]:
The disparity is effectively zero--scipy calculates this as the sum of the squared distances between corresponding points in the two superimposed shapes.
In [86]:
# let's create some random data that might conceivably represent the shapes of some exotic bones discovered during
# a scientific expedition
# use 2D data to facilitate visualization
np.random.seed(589018)
bone_1_data = np.random.random((9,2)) * np.sqrt(27)
np.random.seed(11686)
bone_2_data = np.random.random((50,2)) * 2 + 5
fig = plt.figure()
ax = fig.add_subplot(111)
list_shapes = []
for data, label, color in zip([bone_1_data, bone_2_data],
['bone 1', 'bone 2'],
['blue', 'purple']):
hull = scipy.spatial.ConvexHull(data)
shape_outline = hull.points[hull.vertices]
list_shapes.append(shape_outline)
polygon = Polygon(shape_outline,
alpha=0.4,
facecolor=color)
ax.scatter(shape_outline[...,0],
shape_outline[...,1],
color=color,
label=label)
ax.add_patch(polygon)
ax.legend(fontsize=22)
ax.set_ylim(0,8)
ax.set_xlim(0,8)
ax.set_aspect('equal')
ax.set_xlabel('x',fontsize=20)
ax.set_ylabel('y',fontsize=20)
ax.set_title('Bone Shapes from Scientific Expedition',
fontsize=22)
fig.set_size_inches(9,9)
In [87]:
# perform the Procrustes analysis / superimposition of the two bone structures
ref_matrix, fitted_matrix, disparity = scipy.spatial.procrustes(list_shapes[0],
list_shapes[1])
In [88]:
# plot the superposed bone structures
fig = plt.figure()
ax = fig.add_subplot(111)
for matrix, label, color in zip([ref_matrix, fitted_matrix],
['bone 1', 'bone 2'],
['blue', 'purple']):
polygon = Polygon(matrix,
alpha=0.4,
facecolor=color)
ax.add_patch(polygon)
ax.scatter(matrix[...,0],
matrix[...,1],
color=color,
label=label)
ax.legend(fontsize=22)
ax.set_xlim(-0.5,0.5)
ax.set_ylim(-0.5,0.5)
fig.set_size_inches(9,9)
So this is a nice example of a case where the shapes aren't identical and we wouldn't expect the disparity to be ~0:
In [89]:
disparity
Out[89]:
Development Note | Type of Change | Pull Request # | scipy Implementation Version |
---|---|---|---|
Added scipy.spatial.distance.directed_hausdorff |
Enhancement | 6649 | 0.19 |
In relatively simple terms, the Hausdorff distance is the maximum minimum distance between a given point in one data set and a given point in the other. This is far easier to understand with a simple example. Common applications include computer vision, path similarity analysis, and related analyses that compare shapes. It is worth noting that the directed Hausdorff distance is asymmetric, but it is more common to calculate the actual "metric--" the general (symmetric) Hausdorff distance, which is the maximum of the directed Hausdorff distances in each direction.
In [90]:
radius_circle_1 = 1.0
radius_circle_2 = 2.0
np.random.seed(776)
angles = np.random.random(70) * np.pi * 2
circle_1_x = np.cos(angles) * radius_circle_1
circle_1_y = np.sin(angles) * radius_circle_1
circle_2_x = np.cos(angles) * radius_circle_2
circle_2_y = np.sin(angles) * radius_circle_2
# add one outlier to one of the circles
circle_2_y[-1] = 3.0
circle_1_coords = np.array(list(zip(circle_1_x,
circle_1_y)))
circle_2_coords = np.array(list(zip(circle_2_x,
circle_2_y)))
fig = plt.figure()
ax = fig.add_subplot(111)
for circle_name, circle_coords in zip(['circle 1', 'circle 2'],
[circle_1_coords, circle_2_coords]):
ax.scatter(circle_coords[...,0],
circle_coords[...,1],
label=circle_name)
ax.legend(fontsize=18,
bbox_to_anchor=[1.0,1.0])
ax.set_aspect('equal')
ax.set_xlim(-3.5,3.5)
ax.set_ylim(-3.5,3.5)
fig.set_size_inches(8,8)
The maximum minimum distances from circle 1 to circle 2 should all be pretty similar ~1. Conversely, for circle 2 to circle 1, the single outlier point has quite a large minimum distance to any point on circle 1, so the max minimum distance (directed Hausdorff distance) from circle 2 to circle 1 should be greater.
In [91]:
# let's see if this is the case
from scipy.spatial.distance import directed_hausdorff
d_c1_c2, index_c1, index_c2 = directed_hausdorff(circle_1_coords, circle_2_coords)
d_c2_c1, index_c2, index_c1 = directed_hausdorff(circle_2_coords, circle_1_coords)
print('directed Hausdorff distance C1 -> C2:', d_c1_c2)
print('directed Hausdorff distance C2 -> C1:', d_c2_c1)
The indices that form the "Hausdorff pair" can be quite useful to identify for some applications (i.e., simulation work performed by the Beckstein lab at ASU). We'd expect the C2->C1 Hausdorff pair to involve the outlier and one of the "top points" on circle 1--let's verify this:
In [92]:
fig = plt.figure()
ax = fig.add_subplot(111)
for circle_name, circle_coords, pair_index in zip(['circle 1', 'circle 2'],
[circle_1_coords, circle_2_coords],
[index_c1, index_c2]):
ax.scatter(circle_coords[...,0],
circle_coords[...,1])
ax.scatter(circle_coords[pair_index][0],
circle_coords[pair_index][1],
s=300,
alpha=1.0,
marker = 'x',
c = 'k',
label='Hausdorff Pair {circle}'.format(circle=circle_name))
ax.legend(bbox_to_anchor=[1.0, 1.0])
ax.set_aspect('equal')
ax.set_xlim(-3.5,3.5)
ax.set_ylim(-3.5,3.5)
fig.set_size_inches(8,8)
And finally the general Hausdorff distance is the max of the two directed distances:
In [93]:
general_Hausdorff_dist = max(d_c1_c2, d_c2_c1)
print('general Hausdorff distance:', general_Hausdorff_dist)
I learned about the Hausdorff distance covered above and the related Frechet distance (proposed for incorporation into scipy) by reading a previously published manuscript from the Beckstein lab at ASU:
Seyler SL, Kumar A, Thorpe MF, Beckstein O (2015) Path Similarity Analysis: A Method for Quantifying Macromolecular Pathways. PLOS Computational Biology 11(10): e1004568. https://doi.org/10.1371/journal.pcbi.1004568
The Frechet distance is perhaps a bit harder to understand than the Hausdorff distance, and I will therefore borrow a few presentation materials from the authors above (with their permission).
The Frechet distance is sensitive to the ordering of the points in the two curves / shapes being compared, but the Hausdorff distance is not--the Frechet distance can therefore be a more appropriate measure in some cases.
Common applications include:
Efficient Frechet distance implementations may be a bit more challenging than Hausdorff--the scipy Hausdorff implementation used above (Taha and Hanbury, 2015) has an average run time of O(m), worst case O(m * o), while quadratic / polynomial time complexities are reported for discrete / continuous Frechet distances:
I did, however, stumble upon a publication reporting a subquadratic Frechet code:
Pankaj K. Agarwal, Rinat Ben Avraham, Haim Kaplan, and Micha Sharir (2014) Computing the Discrete Fréchet Distance in Subquadratic Time. SIAM J. Comput., 43: 429–449. DOI: http://dx.doi.org/10.1137/130920526
I asked Dr. Agarwal, the RJR Nabisco Professor of Computer Science and Mathematics at Duke University, about the availability of code implementing this algorithm (in any language) and he replied:
I am not aware of such a code, and I am not sure whether it's worth the effort for a variety of reasons. $n^2$ algorithm is so simple that it will perform significantly better even for polygonal chains with tens of thousands of vertices.
I'd still like to incorporate some Frechet implementation(s) into scipy, as it is a natural complement to the Hausdorff distance, though in many cases they are quite similar:
There are some other varities of the Frechet distance, including for example:
I'd like to implement all of these in scipy! But, we'll see what I can actually manage!
Consider the case of autonomous vehicle testing. You may have a defined path and you want to check the fidelity of your autonomous vehicle's true path to the desired path. In practice, this kind of data analysis might span many cars using a number of GIS data sets, but let's focus on a single vehicle with a given target path using randomly-generated data for convenience.
In [94]:
num_time_points = 100
np.random.seed(111409)
reference_path_values = np.random.random((num_time_points, 2))
np.random.seed(231)
true_path_values = np.random.random((num_time_points, 2))
In [95]:
fig = plt.figure()
ax = fig.add_subplot(111)
for label, path_vals in zip(['reference path', 'actual path'],
[reference_path_values, true_path_values]):
ax.plot(path_vals[...,0],
path_vals[...,1],
label=label,
marker='o')
ax.legend(bbox_to_anchor=[1.0,1.0],
fontsize=20)
ax.set_aspect('equal')
ax.set_title('Autonomous Vehicle Path vs. Desired Reference Path',
fontsize=20)
fig.set_size_inches(10,10)
Ok, so maybe the autonomous vehicle isn't working that well yet! Let's try to quantify the difference between the paths using the general Hausdorff distance:
In [96]:
general_H_dist = max(directed_hausdorff(reference_path_values, true_path_values)[0],
directed_hausdorff(true_path_values, reference_path_values)[0])
general_H_dist
Out[96]:
Now, let's try with the discrete Frechet distance, which is actually sensitive to backtracking motions. Since there's currently no Frechet implementation in scipy
, I'll borrow some code written by my colleages at ASU for another Python library.
In [97]:
def sqnorm(v, axis=None):
# modified from original code written by Sean Seyler at Arizona State University
# and available in the open source Python MDAnalysis library
return np.sum(v*v, axis=axis)
def get_msd_matrix(P, Q, axis=None):
# modified from original code written by Sean Seyler at Arizona State University
# and available in the open source Python MDAnalysis library
return np.asarray([sqnorm(p - Q, axis=axis) for p in P])
def get_coord_axes(path):
# modified from original code written by Sean Seyler at Arizona State University
# and available in the open source Python MDAnalysis library
path_dimensions = len(path.shape)
if path_dimensions == 3:
N = path.shape[1]
axis = (1,2) # 1st axis: atoms, 2nd axis: x,y,z coords
elif path_dimensions == 2:
# can use mod to check if total # coords divisible by 3
N = path.shape[1] / 3
axis = (1,) # 1st axis: 3N structural coords (x1,y1,z1,...,xN,xN,zN)
else:
err_str = "Path must have 2 or 3 dimensions; the first dimensions (axis"\
+ " 0) must correspond to frames, axis 1 (and axis 2, if" \
+ " present) must contain atomic coordinates."
raise ValueError(err_str)
return N, axis
def discrete_frechet(P, Q):
# modified from original code written by Sean Seyler at Arizona State University
# and available in the open source Python MDAnalysis library
N, axis = get_coord_axes(P)
Np, Nq = len(P), len(Q)
d = get_msd_matrix(P, Q, axis=axis)
ca = -np.ones((Np, Nq))
def c(i, j):
"""Compute the coupling distance for two partial paths formed by *P* and
*Q*, where both begin at frame 0 and end (inclusive) at the respective
frame indices :math:`i-1` and :math:`j-1`. The partial path of *P* (*Q*)
up to frame *i* (*j*) is formed by the slicing ``P[0:i]`` (``Q[0:j]``).
:func:`c` is called recursively to compute the coupling distance
between the two full paths *P* and *Q* (i.e., the discrete Frechet
distance) in terms of coupling distances between their partial paths.
:Arguments:
*i*
int, partial path of *P* through final frame *i-1*
*j*
int, partial path of *Q* through final frame *j-1*
:Returns:
float, the coupling distance between partial paths ``P[0:i]`` and
``Q[0:j]``
"""
if ca[i,j] != -1 : return ca[i,j]
if i > 0:
if j > 0: ca[i,j] = max( min(c(i-1,j),c(i,j-1),c(i-1,j-1)), d[i,j] )
else: ca[i,j] = max( c(i-1,0), d[i,0] )
elif j > 0: ca[i,j] = max( c(0,j-1), d[0,j] )
else: ca[i,j] = d[0,0]
return ca[i,j]
return ( c(Np-1, Nq-1) / N )**0.5
In [98]:
discrete_frechet(reference_path_values,
true_path_values)
Out[98]:
In simple terms, the length of the leash for the leash-minimizing path is quite a bit larger than the Hausdorff distance, and likely is a better measure of the difference between the shapes in this case because there is backtracking motion involved.
Informal definition: a generalization of rectangles to higher dimensions
In practice, I don't think I've had to use hyperrectangles directly, but I've definitely had to use them indirectly (they are used a lot for i.e., KDTree
in scipy
and more broadly in fields that involve special search types, including database theory).
So, let's just try to explore a bit with what scipy
has to offer with hyperrectangles.
In [99]:
# let's start with a simple unit cube, with coordinates provided as the set of Cartesian "intervals"
hyper_rect_1 = scipy.spatial.Rectangle(maxes=[1.0,1.0,1.0],
mins=[0.0,0.0,0.0])
hyper_rect_1
Out[99]:
In [100]:
# the volume of the unit cube should be 1 ** 3
hyper_rect_1.volume()
Out[100]:
Ok, so a method to find the volume of a cubic object isn't particularly impressive--let's see if we can find something more interesting.
In [101]:
child_rect_1, child_rect_2 = hyper_rect_1.split(d=0, split=0.5)
In [102]:
child_rect_1, child_rect_2
Out[102]:
In [103]:
# let's try plotting the resultant hyperrectangle split to see what happened (maybe I'm just bad at 3D visualization in my head!)
In [104]:
child_rect_1.mins, child_rect_1.maxes
Out[104]:
In [105]:
child_rect_2.mins, child_rect_2.maxes
Out[105]:
In [106]:
# use the Cartesian boundaries to build the cubic vertices manually (there's probably a faster way to do this programmatically)
child_rect_1_coords = np.array([[0, 0, 0],
[0.5, 0, 0],
[0, 0, 1],
[0.5, 0, 1],
[0, 1, 1],
[0.5, 1, 1],
[0.5, 1, 0],
[0, 1, 0]])
# the sibling hyperrectangle is shifted + 0.5 along the first axis
child_rect_2_coords = np.zeros((child_rect_1_coords.shape))
child_rect_2_coords[...,0] = child_rect_1_coords[...,0] + 0.5
child_rect_2_coords[...,1:] = child_rect_1_coords[...,1:]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for child_rect, label in zip([child_rect_1_coords, child_rect_2_coords],
['child 1', 'child 2']):
ax.scatter3D(child_rect[...,0],
child_rect[...,1],
child_rect[...,2],
s=200,
label=label)
ax.legend()
ax.set_aspect('equal')
ax.set_xlabel('x', fontsize=25)
ax.set_ylabel('y', fontsize=25)
ax.set_zlabel('z', fontsize=25)
fig.set_size_inches(9,9)
So, we clearly have a split of the hyperrectangle along the x-dimension (central vertices for child 2 conceal the identical central vertices from child 1 because child 2 was plotted after). It is not hard to imagine that this kind of operation would be useful i.e., in the construction of a binary space partition scheme (i.e., KDTree
).
In [107]:
# we'd expect the volumes of the children to be halved (0.5):
child_rect_1.volume(), child_rect_2.volume()
Out[107]:
In [108]:
# we'd expect the minimum distance between the two child hyperrectangles to be 0 (they have overlapping vertices),
# and the maximum distance between vertices would intuitively be the diagonal with height 1, base sqrt(2):
child_rect_1.min_distance_rectangle(child_rect_2)
Out[108]:
In [109]:
child_rect_1.max_distance_rectangle(child_rect_2)
Out[109]:
In [110]:
np.sqrt(1 ** 2 + (np.sqrt(2) ** 2))
Out[110]:
My suspicion is that Rectangle
was primarily implemented for its internal usage in the KDTree
source code, but you never know--this may be useful to have available in some cases. There are also Rectangle
methods for looking at the distances between hyperrectangle vertices and an input array of points.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: