Chapter 2: Fast Computation on Massive Data Sets

Analysis of Algorithmic Efficiency

"big O" notation, $\mathcal{O}(N)$

considers only the order of growth, ignoring constant factors
more naive algorithms tend to be $\mathcal{O}(N^2)$ or more



In [2]:

    
from IPython.display import Image
from IPython.display import display

search_efficiency = Image(url='http://www.astroml.org/_images/fig_search_scaling_1.png',width=500)

display(search_efficiency)

Seven Types of Computational Problem

Basic Problems: statistics (means, variences, covarience matrices), 1D sorts and searches; typically $\mathcal{O}(N)$ or $\mathcal{O}(N \mathrm{log} N)$
Generalized N-body problems: problems involving pairs, such as nearest-neighbor seraches, correlation functions, kernal density estimates; typically $\mathcal{O}(N^2)$ or $\mathcal{O}(N^3)$ if done naively
Linear algebraic problems: linear systems, eigenvalue problems, inverses; if $N >> D$, $\mathcal{O}(N)$ but if matrix is $N \times N$, $\mathcal{O}(N^3)$
Optimization problems: min or max of a function; order varies depending on the problem
Integration problems: estimation of Bayesian models, typically involves high dimensional functions; order increases exponentially with the dimensionality; MCMC methods can improve efficiency
Graph-theoretic problems: traversals of graphs(?), such as manifold learning
Alignment problems: matchings between two or more data sets; worst case is exponential in $N$; tree-based strategies are helpful

Seven Strategies for Speeding Things Up

Trees: subdivide the space, then prove that some parts can be ignored or approximated during the computation
Subproblem reuse: store solutions and recall them if needed; memory can be a bottleneck
Locality: computationally, rearrange the problem so that work is localized in memory
Streaming: act on only one data point at a time, even for a calculation that theoretically requires access to the entire data set
Function transforms: use simpler functions to achieve the same result, i.e. Taylor series or Fourier transforms
Sampling: use a subset of points that effectively represent the whole data set
Parallelism: break up the problem into parts that can be computed simultaneously; important to have a fast serial version first
Problem transformation

Focusing on Trees: searching, sorting, and multidimensional neighbor searches

Searching

unidimensional searches: also called "range" searches, specifying a single attribute, i.e. all objects smaller than x

efficient searching can be done using the numpy.searchsorted function (see figure above), $\mathcal{O}(N \mathrm{log} N)$

hash tables: an array that maps input data to array indices using a hash function, i.e. sorting a data set into a pixelized grid

hashing can be done using numpy.histogram
can be order independent (constant in time), but may not be worth the cost of computing the hash function

Sorting



In [3]:

    
import numpy as np



In [4]:

    
np.random.seed(0)



In [5]:

    
x = np.random.rand(1E7)



In [6]:

    
%time x.sort()  # time a single run









    



CPU times: user 961 ms, sys: 841 µs, total: 962 ms
Wall time: 962 ms



In [7]:

    
print x









    



[  2.51678389e-08   1.63714365e-07   1.89048978e-07 ...,   9.99999814e-01
   9.99999837e-01   9.99999863e-01]



In [8]:

    
np.searchsorted(x, 0.5)









    Out[8]:





4998210



In [9]:

    
%timeit np.searchsorted(x, 0.3)









    



100000 loops, best of 3: 2.09 µs per loop



In [10]:

    
X = np.random.random((5,3))



In [11]:

    
np.set_printoptions(precision=2)



In [12]:

    
print X









    



[[ 0.96  0.92  1.  ]
 [ 0.71  0.22  0.63]
 [ 0.34  0.82  0.97]
 [ 0.33  0.98  0.44]
 [ 0.95  0.33  0.73]]



In [13]:

    
i_sort = np.argsort(X[:, 0]) # sort by the first column



In [14]:

    
print X[i_sort]









    



[[ 0.33  0.98  0.44]
 [ 0.34  0.82  0.97]
 [ 0.71  0.22  0.63]
 [ 0.95  0.33  0.73]
 [ 0.96  0.92  1.  ]]



In [15]:

    
X.sort(0) # sort every column



In [18]:

    
print X









    



[[ 0.33  0.22  0.44]
 [ 0.34  0.33  0.63]
 [ 0.71  0.82  0.73]
 [ 0.95  0.92  0.97]
 [ 0.96  0.98  1.  ]]

Multi-dimensional Sorting/Searching: Trees

The problem: doing things more complicated than a simple search is time consuming if done naievely.

As an example, consider a nearest neighbor search among a series of particles.

$N$X$D$ array, representing N points in D dimensions (spatial, momenta, parameters, etc.)
A point $X_i$, where $i \in [1, 2, \cdots, N]$ has coordinates $x_j$ where $j \in [1, 2, \cdots, D]$.
Given a query point X, we want to find the $X_i$ that has coordinates closest to X.
Closeness is defined as $\chi(X,X_i) = \sqrt{\Sigma_{d=1}^D (x_d - x_{i,d})^2}$

Easy! Just loop!



In [19]:

    
import numpy as np



In [20]:

    
# file: easy_nearest_neighbor.py import numpy as np
def easy_nn(X):
    N, D = X.shape
    neighbors = np.zeros(N, dtype=int) 
    for i in range(N):
        # initialize closest distance to infinity
        j_closest = i 
        d_closest = np.inf 
        for j in range(N):
            # skip distance between a point and itself
            if i == j: 
                continue
            d = np.sqrt(np.sum((X[i] - X[j]) ** 2)) 
            if d < d_closest:
                j_closest = j
                d_closest = d 
        neighbors[i] = j_closest
    return neighbors



In [21]:

    
np.random.seed(0)



In [22]:

    
X = np.random.random((10, 3))



In [23]:

    
X.shape









    Out[23]:





(10, 3)



In [24]:

    
%timeit easy_nn(X)









    



1000 loops, best of 3: 1.46 ms per loop

By the way, %timeit is an ipython magic function. There are several. Useful ones include \%run, %paste, and %cpaste, which will preserve indenting, which is important in python.

To learn more about them, just type %magic at an ipython prompt.



In [25]:

    
X = np.random.random((1000, 3))



In [26]:

    
%timeit easy_nn(X)









    



1 loops, best of 3: 15.7 s per loop

Oops. Going from 10 to 1000 results in an increase on the order of $\mathcal{O}(N^2)$ ($15/1.5e-3 \approx 1e4$ while $N$ increased by a factor of 100.). This is clearly not sustainable once you get N at any appreciable value.

The solution: don't be naive.

Numpy can do better using vectorized notations due to smart implmentation of arrays. I'm going to skip this in favor of trees though. You can see the discussion on pp 55-57.

One programming habit I've taken from MATLAB: the programmer should be ashamed of every for loop (s)he uses. Make liberal use of the reshape and tile methods, as well as broadcasting.

http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

There are a number of options for speeding up your code. Googling "writing fast python" or "writing fast matlab" will often turn up multiple hits with very useful information. As an example:

http://scipy-lectures.github.io/advanced/optimizing/

There's also the option of writing code in an intrinsically fast language (like C), compiling it, and then calling the compiled code from Python. There are many interfaces for this out there, e.g. Cython, Numba, etc.

Finally, there's the algorithmic choice. You can achieve significant speedup by tackling projects in a new way using data structures and techniques more appropriate to the problem.

Lets try trees, as an example.

Trees are what I guess a computer scientist would call a "data structure". I've seen them called that in CS class web notes.. I'm not sure though.

Each tree has a node and children.

Credit: http://www.magickofthought.com/2010/06/powering-new-mental-areas/

Here is a URL that provides an introduction to trees, often the first non-linear data structure encountered by computer science students, it seems.

http://pages.cs.wisc.edu/~hasti/cs367-1/readings/Trees/index.html

Specific examples:

Quad Trees: Work in 2 dimensions. Each dimension is divided in half, yielding four quadrants.
Oct Trees: Work in 3 dimesions. Each dimension is bisected, yielding octants.
kd Trees: Slightly more advanced, but works in k dimensions.

Here's an octtree example taken from the book.



In [37]:

    
%matplotlib inline
from matplotlib import pyplot as plt



In [38]:

    
from astroML.plotting import setup_text_plots



In [39]:

    
setup_text_plots(fontsize=8, usetex=True)



In [40]:

    
# We'll create a QuadTree class which will recursively subdivide the
# space into quadrants
class QuadTree:
    """Simple Quad-tree class"""

    # class initialization function
    def __init__(self, data, mins, maxs, depth=3):
        self.data = np.asarray(data)

        # data should be two-dimensional
        assert self.data.shape[1] == 2

        if mins is None:
            mins = data.min(0)
        if maxs is None:
            maxs = data.max(0)

        self.mins = np.asarray(mins)
        self.maxs = np.asarray(maxs)
        self.sizes = self.maxs - self.mins

        self.children = []

        mids = 0.5 * (self.mins + self.maxs)
        xmin, ymin = self.mins
        xmax, ymax = self.maxs
        xmid, ymid = mids

        if depth > 0:
            # split the data into four quadrants
            data_q1 = data[(data[:, 0] < mids[0])
                           & (data[:, 1] < mids[1])]
            data_q2 = data[(data[:, 0] < mids[0])
                           & (data[:, 1] >= mids[1])]
            data_q3 = data[(data[:, 0] >= mids[0])
                           & (data[:, 1] < mids[1])]
            data_q4 = data[(data[:, 0] >= mids[0])
                           & (data[:, 1] >= mids[1])]

            # recursively build a quad tree on each quadrant which has data
            if data_q1.shape[0] > 0:
                self.children.append(QuadTree(data_q1,
                                              [xmin, ymin], [xmid, ymid],
                                              depth - 1))
            if data_q2.shape[0] > 0:
                self.children.append(QuadTree(data_q2,
                                              [xmin, ymid], [xmid, ymax],
                                              depth - 1))
            if data_q3.shape[0] > 0:
                self.children.append(QuadTree(data_q3,
                                              [xmid, ymin], [xmax, ymid],
                                              depth - 1))
            if data_q4.shape[0] > 0:
                self.children.append(QuadTree(data_q4,
                                              [xmid, ymid], [xmax, ymax],
                                              depth - 1))

    def draw_rectangle(self, ax, depth):
        """Recursively plot a visualization of the quad tree region"""
        if depth is None or depth == 0:
            rect = plt.Rectangle(self.mins, *self.sizes, zorder=2,
                                 ec='#000000', fc='none')
            ax.add_patch(rect)
        if depth is None or depth > 0:
            for child in self.children:
                child.draw_rectangle(ax, depth - 1)



In [41]:

    
def draw_grid(ax, xlim, ylim, Nx, Ny, **kwargs):
    """ draw a background grid for the quad tree"""
    for x in np.linspace(xlim[0], xlim[1], Nx):
        ax.plot([x, x], ylim, **kwargs)
    for y in np.linspace(ylim[0], ylim[1], Ny):
        ax.plot(xlim, [y, y], **kwargs)



In [42]:

    
#------------------------------------------------------------
# Create a set of structured random points in two dimensions
np.random.seed(0)

X = np.random.random((30, 2)) * 2 - 1
X[:, 1] *= 0.1
X[:, 1] += X[:, 0] ** 2



In [43]:

    
#------------------------------------------------------------
# Use our Quad Tree class to recursively divide the space
mins = (-1.1, -0.1)
maxs = (1.1, 1.1)
QT = QuadTree(X, mins, maxs, depth=3)



In [44]:

    
#------------------------------------------------------------
# Plot four different levels of the quad tree
fig = plt.figure(figsize=(5, 5))
fig.subplots_adjust(wspace=0.1, hspace=0.15,
                    left=0.1, right=0.9,
                    bottom=0.05, top=0.9)

for level in range(1, 5):
    ax = fig.add_subplot(2, 2, level, xticks=[], yticks=[])
    ax.scatter(X[:, 0], X[:, 1])
    QT.draw_rectangle(ax, depth=level - 1)

    Nlines = 1 + 2 ** (level - 1)
    draw_grid(ax, (mins[0], maxs[0]), (mins[1], maxs[1]),
              Nlines, Nlines, linewidth=1,
              color='#CCCCCC', zorder=0)

    ax.set_xlim(-1.2, 1.2)
    ax.set_ylim(-0.15, 1.15)
    ax.set_title('level %i' % level)

# suptitle() adds a title to the entire figure
fig.suptitle('Quad-tree Example')
#plt.show()









    Out[44]:





<matplotlib.text.Text at 0x10eca2f90>

Wow. Such efficiency.

If your dataset isn't changing, then the tree build needs to be done only once and from then on out any repeated calls are $\mathcal{O}(\log N)$. The tree is a one-time $\mathcal{O}(N \log N)$ calculation.

However, if we use straightforward bisection, the memory requirements for the tree (i.e. number of nodes) quickly ramp out of control. The number of children per node is $2^n$ where $n$ is the dimension of the problem. For a 6-d phasespace, that's $2^6 = 256$ floats or doubles per node. This rapidly grows beyond control.

kd-Trees attempt to mitigate this by abandoning the bisection requirement and doing something smart about the point of division based on the shape of the data. It also limits the number of children per node to two, operating in a hyperspace with the whole data.

This same consideration will lead to Monte Carlo techniques later in the book.



In [45]:

    
import numpy as np
from astroML.plotting import setup_text_plots
from matplotlib import pyplot as plt
setup_text_plots(fontsize=8, usetex=True)



In [46]:

    
%matplotlib inline



In [47]:

    
# We'll create a KDTree class which will recursively subdivide the
# space into rectangular regions.  Note that this is just an example
# and shouldn't be used for real computation; instead use the optimized
# code in scipy.spatial.cKDTree or sklearn.neighbors.BallTree
class KDTree:
    """Simple KD tree class"""

    # class initialization function
    def __init__(self, data, mins, maxs):
        self.data = np.asarray(data)

        # data should be two-dimensional
        assert self.data.shape[1] == 2

        if mins is None:
            mins = data.min(0)
        if maxs is None:
            maxs = data.max(0)

        self.mins = np.asarray(mins)
        self.maxs = np.asarray(maxs)
        self.sizes = self.maxs - self.mins

        self.child1 = None
        self.child2 = None

        if len(data) > 1:
            # sort on the dimension with the largest spread
            largest_dim = np.argmax(self.sizes)
            i_sort = np.argsort(self.data[:, largest_dim])
            self.data[:] = self.data[i_sort, :]

            # find split point
            N = self.data.shape[0]
            split_point = 0.5 * (self.data[N / 2, largest_dim]
                                 + self.data[N / 2 - 1, largest_dim])

            # create subnodes
            mins1 = self.mins.copy()
            mins1[largest_dim] = split_point
            maxs2 = self.maxs.copy()
            maxs2[largest_dim] = split_point

            # Recursively build a KD-tree on each sub-node
            self.child1 = KDTree(self.data[N / 2:], mins1, self.maxs)
            self.child2 = KDTree(self.data[:N / 2], self.mins, maxs2)

    def draw_rectangle(self, ax, depth=None):
        """Recursively plot a visualization of the KD tree region"""
        if depth == 0:
            rect = plt.Rectangle(self.mins, *self.sizes, ec='k', fc='none')
            ax.add_patch(rect)

        if self.child1 is not None:
            if depth is None:
                self.child1.draw_rectangle(ax)
                self.child2.draw_rectangle(ax)
            elif depth > 0:
                self.child1.draw_rectangle(ax, depth - 1)
                self.child2.draw_rectangle(ax, depth - 1)



In [48]:

    
#------------------------------------------------------------
# Create a set of structured random points in two dimensions
np.random.seed(0)

X = np.random.random((30, 2)) * 2 - 1
X[:, 1] *= 0.1
X[:, 1] += X[:, 0] ** 2

#------------------------------------------------------------
# Use our KD Tree class to recursively divide the space
KDT = KDTree(X, [-1.1, -0.1], [1.1, 1.1])

#------------------------------------------------------------
# Plot four different levels of the KD tree
fig = plt.figure(figsize=(5, 5))
fig.subplots_adjust(wspace=0.1, hspace=0.15,
                    left=0.1, right=0.9,
                    bottom=0.05, top=0.9)

for level in range(1, 5):
    ax = fig.add_subplot(2, 2, level, xticks=[], yticks=[])
    ax.scatter(X[:, 0], X[:, 1], s=9)
    KDT.draw_rectangle(ax, depth=level - 1)

    ax.set_xlim(-1.2, 1.2)
    ax.set_ylim(-0.15, 1.15)
    ax.set_title('level %i' % level)

# suptitle() adds a title to the entire figure
fig.suptitle('$k$d-tree Example')









    Out[48]:





<matplotlib.text.Text at 0x10eca2e90>

Fortunately there exist implementations of all these things in Python done properly.

http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html



In [13]:

    
import numpy as np
from scipy.spatial import cKDTree



In [14]:

    
np.random.seed(0)



In [15]:

    
X = np.random.random((1000, 3))



In [16]:

    
kdt = cKDTree(X) # build the KDTree



In [17]:

    
%timeit kdt.query(X, k=2)









    



100 loops, best of 3: 2.76 ms per loop



In [18]:

    
X = np.random.random ((100000, 3))



In [19]:

    
kdt = cKDTree(X)



In [20]:

    
%timeit kdt.query(X, k=2)









    



1 loops, best of 3: 494 ms per loop

Other kinds of tree strategies

See http://scikit-learn.org/stable/modules/neighbors.html

Ball Trees: Make use of the fact that data are often clustered and uses hyper-spherical nodes/children. See pp. 60-62.
Basis Change: Often dimensions are strongly correlated and it's possible to reduce the dimensionality of the problem, e.g. PCA.
Hexagon Balls Trees: http://www.noao.edu/noao/staff/yao/sdss_papers/kunszt.pdf
Maximum margin trees: time sensitive, do the best you can in a specific time budget
Approximate Neighbor Methods