"big O" notation, $\mathcal{O}(N)$
In [2]:
from IPython.display import Image
from IPython.display import display
search_efficiency = Image(url='http://www.astroml.org/_images/fig_search_scaling_1.png',width=500)
display(search_efficiency)
unidimensional searches: also called "range" searches, specifying a single attribute, i.e. all objects smaller than x
hash tables: an array that maps input data to array indices using a hash function, i.e. sorting a data set into a pixelized grid
In [3]:
import numpy as np
In [4]:
np.random.seed(0)
In [5]:
x = np.random.rand(1E7)
In [6]:
%time x.sort() # time a single run
In [7]:
print x
In [8]:
np.searchsorted(x, 0.5)
Out[8]:
In [9]:
%timeit np.searchsorted(x, 0.3)
In [10]:
X = np.random.random((5,3))
In [11]:
np.set_printoptions(precision=2)
In [12]:
print X
In [13]:
i_sort = np.argsort(X[:, 0]) # sort by the first column
In [14]:
print X[i_sort]
In [15]:
X.sort(0) # sort every column
In [18]:
print X
As an example, consider a nearest neighbor search among a series of particles.
Easy! Just loop!
In [19]:
import numpy as np
In [20]:
# file: easy_nearest_neighbor.py import numpy as np
def easy_nn(X):
N, D = X.shape
neighbors = np.zeros(N, dtype=int)
for i in range(N):
# initialize closest distance to infinity
j_closest = i
d_closest = np.inf
for j in range(N):
# skip distance between a point and itself
if i == j:
continue
d = np.sqrt(np.sum((X[i] - X[j]) ** 2))
if d < d_closest:
j_closest = j
d_closest = d
neighbors[i] = j_closest
return neighbors
In [21]:
np.random.seed(0)
In [22]:
X = np.random.random((10, 3))
In [23]:
X.shape
Out[23]:
In [24]:
%timeit easy_nn(X)
By the way, %timeit is an ipython magic function. There are several. Useful ones include \%run, %paste, and %cpaste, which will preserve indenting, which is important in python.
To learn more about them, just type %magic at an ipython prompt.
In [25]:
X = np.random.random((1000, 3))
In [26]:
%timeit easy_nn(X)
Oops. Going from 10 to 1000 results in an increase on the order of $\mathcal{O}(N^2)$ ($15/1.5e-3 \approx 1e4$ while $N$ increased by a factor of 100.). This is clearly not sustainable once you get N at any appreciable value.
Numpy can do better using vectorized notations due to smart implmentation of arrays. I'm going to skip this in favor of trees though. You can see the discussion on pp 55-57.
One programming habit I've taken from MATLAB: the programmer should be ashamed of every for loop (s)he uses. Make liberal use of the reshape and tile methods, as well as broadcasting.
http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
There are a number of options for speeding up your code. Googling "writing fast python" or "writing fast matlab" will often turn up multiple hits with very useful information. As an example:
http://scipy-lectures.github.io/advanced/optimizing/
There's also the option of writing code in an intrinsically fast language (like C), compiling it, and then calling the compiled code from Python. There are many interfaces for this out there, e.g. Cython, Numba, etc.
Finally, there's the algorithmic choice. You can achieve significant speedup by tackling projects in a new way using data structures and techniques more appropriate to the problem.
Trees are what I guess a computer scientist would call a "data structure". I've seen them called that in CS class web notes.. I'm not sure though.
Each tree has a node and children.
Credit: http://www.magickofthought.com/2010/06/powering-new-mental-areas/
Here is a URL that provides an introduction to trees, often the first non-linear data structure encountered by computer science students, it seems.
http://pages.cs.wisc.edu/~hasti/cs367-1/readings/Trees/index.html
In [37]:
%matplotlib inline
from matplotlib import pyplot as plt
In [38]:
from astroML.plotting import setup_text_plots
In [39]:
setup_text_plots(fontsize=8, usetex=True)
In [40]:
# We'll create a QuadTree class which will recursively subdivide the
# space into quadrants
class QuadTree:
"""Simple Quad-tree class"""
# class initialization function
def __init__(self, data, mins, maxs, depth=3):
self.data = np.asarray(data)
# data should be two-dimensional
assert self.data.shape[1] == 2
if mins is None:
mins = data.min(0)
if maxs is None:
maxs = data.max(0)
self.mins = np.asarray(mins)
self.maxs = np.asarray(maxs)
self.sizes = self.maxs - self.mins
self.children = []
mids = 0.5 * (self.mins + self.maxs)
xmin, ymin = self.mins
xmax, ymax = self.maxs
xmid, ymid = mids
if depth > 0:
# split the data into four quadrants
data_q1 = data[(data[:, 0] < mids[0])
& (data[:, 1] < mids[1])]
data_q2 = data[(data[:, 0] < mids[0])
& (data[:, 1] >= mids[1])]
data_q3 = data[(data[:, 0] >= mids[0])
& (data[:, 1] < mids[1])]
data_q4 = data[(data[:, 0] >= mids[0])
& (data[:, 1] >= mids[1])]
# recursively build a quad tree on each quadrant which has data
if data_q1.shape[0] > 0:
self.children.append(QuadTree(data_q1,
[xmin, ymin], [xmid, ymid],
depth - 1))
if data_q2.shape[0] > 0:
self.children.append(QuadTree(data_q2,
[xmin, ymid], [xmid, ymax],
depth - 1))
if data_q3.shape[0] > 0:
self.children.append(QuadTree(data_q3,
[xmid, ymin], [xmax, ymid],
depth - 1))
if data_q4.shape[0] > 0:
self.children.append(QuadTree(data_q4,
[xmid, ymid], [xmax, ymax],
depth - 1))
def draw_rectangle(self, ax, depth):
"""Recursively plot a visualization of the quad tree region"""
if depth is None or depth == 0:
rect = plt.Rectangle(self.mins, *self.sizes, zorder=2,
ec='#000000', fc='none')
ax.add_patch(rect)
if depth is None or depth > 0:
for child in self.children:
child.draw_rectangle(ax, depth - 1)
In [41]:
def draw_grid(ax, xlim, ylim, Nx, Ny, **kwargs):
""" draw a background grid for the quad tree"""
for x in np.linspace(xlim[0], xlim[1], Nx):
ax.plot([x, x], ylim, **kwargs)
for y in np.linspace(ylim[0], ylim[1], Ny):
ax.plot(xlim, [y, y], **kwargs)
In [42]:
#------------------------------------------------------------
# Create a set of structured random points in two dimensions
np.random.seed(0)
X = np.random.random((30, 2)) * 2 - 1
X[:, 1] *= 0.1
X[:, 1] += X[:, 0] ** 2
In [43]:
#------------------------------------------------------------
# Use our Quad Tree class to recursively divide the space
mins = (-1.1, -0.1)
maxs = (1.1, 1.1)
QT = QuadTree(X, mins, maxs, depth=3)
In [44]:
#------------------------------------------------------------
# Plot four different levels of the quad tree
fig = plt.figure(figsize=(5, 5))
fig.subplots_adjust(wspace=0.1, hspace=0.15,
left=0.1, right=0.9,
bottom=0.05, top=0.9)
for level in range(1, 5):
ax = fig.add_subplot(2, 2, level, xticks=[], yticks=[])
ax.scatter(X[:, 0], X[:, 1])
QT.draw_rectangle(ax, depth=level - 1)
Nlines = 1 + 2 ** (level - 1)
draw_grid(ax, (mins[0], maxs[0]), (mins[1], maxs[1]),
Nlines, Nlines, linewidth=1,
color='#CCCCCC', zorder=0)
ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-0.15, 1.15)
ax.set_title('level %i' % level)
# suptitle() adds a title to the entire figure
fig.suptitle('Quad-tree Example')
#plt.show()
Out[44]:
If your dataset isn't changing, then the tree build needs to be done only once and from then on out any repeated calls are $\mathcal{O}(\log N)$. The tree is a one-time $\mathcal{O}(N \log N)$ calculation.
However, if we use straightforward bisection, the memory requirements for the tree (i.e. number of nodes) quickly ramp out of control. The number of children per node is $2^n$ where $n$ is the dimension of the problem. For a 6-d phasespace, that's $2^6 = 256$ floats or doubles per node. This rapidly grows beyond control.
kd-Trees attempt to mitigate this by abandoning the bisection requirement and doing something smart about the point of division based on the shape of the data. It also limits the number of children per node to two, operating in a hyperspace with the whole data.
This same consideration will lead to Monte Carlo techniques later in the book.
In [45]:
import numpy as np
from astroML.plotting import setup_text_plots
from matplotlib import pyplot as plt
setup_text_plots(fontsize=8, usetex=True)
In [46]:
%matplotlib inline
In [47]:
# We'll create a KDTree class which will recursively subdivide the
# space into rectangular regions. Note that this is just an example
# and shouldn't be used for real computation; instead use the optimized
# code in scipy.spatial.cKDTree or sklearn.neighbors.BallTree
class KDTree:
"""Simple KD tree class"""
# class initialization function
def __init__(self, data, mins, maxs):
self.data = np.asarray(data)
# data should be two-dimensional
assert self.data.shape[1] == 2
if mins is None:
mins = data.min(0)
if maxs is None:
maxs = data.max(0)
self.mins = np.asarray(mins)
self.maxs = np.asarray(maxs)
self.sizes = self.maxs - self.mins
self.child1 = None
self.child2 = None
if len(data) > 1:
# sort on the dimension with the largest spread
largest_dim = np.argmax(self.sizes)
i_sort = np.argsort(self.data[:, largest_dim])
self.data[:] = self.data[i_sort, :]
# find split point
N = self.data.shape[0]
split_point = 0.5 * (self.data[N / 2, largest_dim]
+ self.data[N / 2 - 1, largest_dim])
# create subnodes
mins1 = self.mins.copy()
mins1[largest_dim] = split_point
maxs2 = self.maxs.copy()
maxs2[largest_dim] = split_point
# Recursively build a KD-tree on each sub-node
self.child1 = KDTree(self.data[N / 2:], mins1, self.maxs)
self.child2 = KDTree(self.data[:N / 2], self.mins, maxs2)
def draw_rectangle(self, ax, depth=None):
"""Recursively plot a visualization of the KD tree region"""
if depth == 0:
rect = plt.Rectangle(self.mins, *self.sizes, ec='k', fc='none')
ax.add_patch(rect)
if self.child1 is not None:
if depth is None:
self.child1.draw_rectangle(ax)
self.child2.draw_rectangle(ax)
elif depth > 0:
self.child1.draw_rectangle(ax, depth - 1)
self.child2.draw_rectangle(ax, depth - 1)
In [48]:
#------------------------------------------------------------
# Create a set of structured random points in two dimensions
np.random.seed(0)
X = np.random.random((30, 2)) * 2 - 1
X[:, 1] *= 0.1
X[:, 1] += X[:, 0] ** 2
#------------------------------------------------------------
# Use our KD Tree class to recursively divide the space
KDT = KDTree(X, [-1.1, -0.1], [1.1, 1.1])
#------------------------------------------------------------
# Plot four different levels of the KD tree
fig = plt.figure(figsize=(5, 5))
fig.subplots_adjust(wspace=0.1, hspace=0.15,
left=0.1, right=0.9,
bottom=0.05, top=0.9)
for level in range(1, 5):
ax = fig.add_subplot(2, 2, level, xticks=[], yticks=[])
ax.scatter(X[:, 0], X[:, 1], s=9)
KDT.draw_rectangle(ax, depth=level - 1)
ax.set_xlim(-1.2, 1.2)
ax.set_ylim(-0.15, 1.15)
ax.set_title('level %i' % level)
# suptitle() adds a title to the entire figure
fig.suptitle('$k$d-tree Example')
Out[48]:
Fortunately there exist implementations of all these things in Python done properly.
http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html
In [13]:
import numpy as np
from scipy.spatial import cKDTree
In [14]:
np.random.seed(0)
In [15]:
X = np.random.random((1000, 3))
In [16]:
kdt = cKDTree(X) # build the KDTree
In [17]:
%timeit kdt.query(X, k=2)
In [18]:
X = np.random.random ((100000, 3))
In [19]:
kdt = cKDTree(X)
In [20]:
%timeit kdt.query(X, k=2)
See http://scikit-learn.org/stable/modules/neighbors.html