Introduction to NumPy: Slicing through big data faster

California Polytechnic State University, San Luis Obispo

This notebook was created from a selection of materials in the SciPy Lectures series and these notebooks prepared by Jake Vanderplas: Intro to NumPy, Efficient NumPy, and Intro to Scipy.

Introduction

While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.

SciPy is the premiere package for scientific computing in Python. It has many, many useful tools, which are central to scientific computing tasks you'll come across.

Some of the submodules available in SciPy that one could use in the analysis of large datasets include:

File input/output: scipy.io
Special functions: scipy.special
Linear algebra operations: scipy.linalg
Fast Fourier transforms: scipy.fftpack
Optimization and fit: scipy.optimize
Statistics and random numbers: scipy.stats
Interpolation: scipy.interpolate
Numerical integration: scipy.integrate
Signal processing: scipy.signal
Image processing: scipy.ndimage
Sparse Matrices: scipy.sparse
Spatial metrics: scipy.spatial

The NumPy library forms the base layer for the entire SciPy ecosystem. It can be imported into the notebook with



In [ ]:

    
import numpy as np

Basics of NumPy Arrays

Python lists are very flexible containers, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices.

The main object provided by NumPy is a powerful array. We can explore how the NumPy array differs from a Python list by creating a simple list and an array with the same contents as the list:



In [ ]:

    
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])

Elements of a one-dimensional array are accessed with the same syntax as a list:



In [ ]:

    
arr[0]



In [ ]:

    
lst[0]



In [ ]:

    
arr[-1]



In [ ]:

    
arr[2:]

Differences between arrays and lists

The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string, but the same action won't work with a NumPy array:



In [ ]:

    
lst[-1] = 'a string inside a list'
lst



In [ ]:

    
arr[-1] = 'a string inside an array'

The information about the type of an array is contained in its dtype attribute:



In [ ]:

    
arr.dtype

Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:



In [ ]:

    
arr[-1] = 1.234
arr

Array Creation

As we did above, you can create an array from a python list, but there are other methods of creating arrays that are very useful. For example, we may want to create an array initialized to a specific value, like 0 or 1, which can then be built upon with addition or multiplication to get arrays of any initial value. To create these, use zeros or ones and specify the desired type:



In [ ]:

    
np.zeros(5, dtype=float)



In [ ]:

    
np.zeros(3, dtype=int)



In [ ]:

    
np.zeros(3, dtype=complex)



In [ ]:

    
print '5 ones:', np.ones(5)



In [ ]:

    
a = np.empty(4)
a.fill(5.5)
a



In [ ]:

    
a = np.empty(3,dtype=int)
a.fill(7)
a

NumPy also offers the arange function, which works like the builtin range but returns an array instead of a list:



In [ ]:

    
np.arange(5)

But notice something important about how it is used:



In [ ]:

    
c = np.arange(2,16,2)
c

The max number in the range is exclusive because Python numbering starts at zero.

Alternatively, the linspace and logspace functions create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:



In [ ]:

    
print "A linear grid between 0 and 1 with 4 elements:"
print np.linspace(0, 1, 4)



In [ ]:

    
print "A logarithmic grid between 10**1 and 10**3 with 4 elements:"
print np.logspace(1, 3, 4)

It is often useful to create arrays with random numbers that follow a specific distribution. The np.random module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):



In [ ]:

    
np.random.randn(5)

whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:



In [ ]:

    
norm10 = np.random.normal(10, 3, 5)
norm10



In [ ]:

    
np.random.normal(5.0,2.,10)

Multidimensional arrays

NumPy can create arrays of aribtrary dimensions. For example, a list of lists can be used to initialize a two dimensional array:



In [ ]:

    
lst2 = [[1, 2], [3, 4]]
arr2 = np.array([[1, 2], [3, 4]])
arr2

With two-dimensional arrays we start seeing the power of NumPy: while a nested list can be indexed using repeatedly the [ ] operator, multidimensional arrays support a much more natural indexing syntax with a single [ ] and a set of indices separated by commas:



In [ ]:

    
print lst2[0][1]
print arr2[0,1]

Most of the array creation functions listed above can be used with more than one dimension, for example:



In [ ]:

    
np.zeros((2,3))

The shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:



In [ ]:

    
arr = np.arange(8).reshape(2, 4)
print arr

But note that reshaping (like most NumPy operations) provides a view of the same memory:



In [ ]:

    
arr = np.arange(8)
arr2 = arr.reshape(2, 4)

arr[0] = 1000
print arr
print arr2

This lack of copying allows for very efficient vectorized operations.

With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:



In [ ]:

    
print 'Slicing in the second row:', arr2[1, 2:4]
print 'All rows, third column   :', arr2[:, 2]



In [ ]:

    
print 'First row:  ', arr2[0]
print 'Second row: ', arr2[1]
print arr2[2]



In [ ]:

    
#Print some properties of the array arr2
print 'Data type                :', arr2.dtype
print 'Total number of elements :', arr2.size
print 'Number of dimensions     :', arr2.ndim
print 'Shape (dimensionality)   :', arr2.shape
print 'Memory used (in bytes)   :', arr2.nbytes



In [ ]:

    
#Print some useful information that the arrays can calculate for us
print 'Minimum and maximum             :', arr2.min(), arr2.max()
print 'Sum and product of all elements :', arr2.sum(), arr2.prod()
print 'Mean and standard deviation     :', arr2.mean(), arr2.std()

It's also possible to do the computation along a single dimension, by passing the axis parameter; for example:



In [ ]:

    
print 'For the following array:\n', arr2
print 'The sum of elements along the rows is    :', arr2.sum(axis=1)
print 'The sum of elements along the columns is :', arr2.sum(axis=0)

As you can see in this example, the value of the axis parameter is the dimension which will be consumed once the operation has been carried out. This is why to sum along the rows we use axis=1.

Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:



In [ ]:

    
print 'Array:\n', arr2
print 'Transpose:\n', arr2.T

NumPy can also create some useful matrices:



In [ ]:

    
#identity matrix
c = np.eye(3)
c



In [ ]:

    
#diagonal matrix wth elements of an array
d = np.diag(np.array([1,2,3,4]))
d



In [ ]:

    
e = np.diag(np.linspace(0,1,6))
e

To access the elements of a multidimensional (in this case 2D) array:



In [ ]:

    
e[1,1]



In [ ]:

    
e[2,2]

Exercise 1

Create the following arrays with correct data types:

[[1 1 1 1]
 [1 1 1 1]
 [1 1 1 2]
 [1 6 1 1]]

[[0. 0. 0. 0. 0.]
 [2. 0. 0. 0. 0.]
 [0. 3. 0. 0. 0.]
 [0. 0. 4. 0. 0.]
 [0. 0. 0. 5. 0.]
 [0. 0. 0. 0. 6.]]

You should be able to accomplish this in 3 lines of code or less for each one. The documentation for the diag method may be helpful.



In [ ]:

    
#your code here



In [ ]:

    
#See solution for the first array
%load sols/ex1a.py



In [ ]:

    
#See solution for the second array
%load sols/ex1b.py

Exercise 2

Read the documentation for np.tile and use this function to construct the following array:

[[4 3 4 3 4 3]
 [2 1 2 1 2 1]
 [4 3 4 3 4 3]
 [2 1 2 1 2 1]]



In [ ]:

    
#your code here



In [ ]:

    
#See solution for this array
%load sols/ex2.py

Slicing Basics

To select a range of elements in the array, you can use slices, just like for Python lists:



In [ ]:

    
a = np.arange(10)
a



In [ ]:

    
a[:4]



In [ ]:

    
a[2:4]



In [ ]:

    
a[5:]



In [ ]:

    
a[:]

You can also give the step size:



In [ ]:

    
a[2:9:3]

Arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.

Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:



In [ ]:

    
norm10 = np.random.normal(10, 3, 5)
norm10



In [ ]:

    
mask = norm10 > 9
mask

Now that we have this mask, we can use it to either read those values or to reset them to 0:



In [ ]:

    
print 'Values above 9:', norm10[mask]



In [ ]:

    
print 'Resetting all values above 9 to 0...'
norm10[mask] = 0
print norm10

Exercise 3

Let’s create a prime number sieve. We will use the sieve to see what numbers between 0 and 100 are prime.

First, construct an array of booleans called is_prime with shape (100,), filled with True in all the elements.



In [ ]:

    
#your code here



In [ ]:

    
#See solution
%load sols/ex3a.py

The index of each boolean element represents the number. “Cross out” 0 and 1, which are not primes. You can either set them to False or 0 (which python recognizes as equivalent to False for boolean types).



In [ ]:

    
#your code here



In [ ]:

    
#See solution
%load sols/ex3b.py

For each subsequent integer j starting from 2, cross out its higher multiples. This is the famous sieve algorithm developed by Eratosthenes and described on wikipedia.



In [ ]:

    
for j in range(2, is_prime.size):
    is_prime[2*j::j] = False #Make sure you understand what this code is doing. What does this slicing of the array mean?
print is_prime

The np.nonzero function (try help(np.nonzero) or np.nonzero?) can be used to print a tuple of the prime numbers, which are the elements in the is_prime array that are still True.



In [ ]:

    
#your code here



In [ ]:

    
#See solution
%load sols/ex3c.py

How many primes are there? Try np.count_nonzero().



In [ ]:

    
#your code here



In [ ]:

    
#See solution
%load sols/ex3d.py

Combine all of this to create a new function, called eratosthenes_sieve(maximum) that takes one argument, maximum, the maximum number to test for primes, and returns an array containing the prime numbers between 2 and maximum. Print the result for the test cases where maximum = 10 and 100.



In [ ]:

    
def eratosthenes_sieve(maximum):
    """Use the famous sieve of Eratosthenes to find all the
    prime numbers between 2 and maximum.
        Input: maximum range for the primes
        Output: NumPy array of prime numbers between 2 and maximum"""

    #
    # your code here
    #
    return np.arange(maximum, dtype=int)



In [ ]:

    
#See solution
%load sols/ex3e.py



In [ ]:

    
#Call it with maximum 10
output10 = eratosthenes_sieve(10)
print output10
print np.count_nonzero(output10)



In [ ]:

    
#Call it with maximum 100
output100 = eratosthenes_sieve(100)
print output100
print np.count_nonzero(output100)

Operations with Arrays

Arrays support all regular arithmetic operators, and the NumPy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied element-wise, i.e., are applied to all the elements of the array at the same time. Consider for example:



In [ ]:

    
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print arr1, '+', arr2, '=', arr1+arr2

Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is not the matrix multiplication from linear algebra (as is the case in Matlab, for example):



In [ ]:

    
print arr1, '*', arr2, '=', arr1*arr2

We may also multiply an array by a scalar:



In [ ]:

    
1.5 * arr1

Broadcasting

While in principle arrays must always match in their dimensionality in order for an operation to be valid, NumPy will broadcast dimensions when possible. Here is an example of broadcasting a scalar to a 1D array:



In [ ]:

    
print np.arange(3)
print np.arange(3) + 5

We can also broadcast a 1D array to a 2D array, in this case adding a vector to all rows of a matrix:



In [ ]:

    
print np.ones((3,3))
print " + "
print np.arange(3)
print " = "
print np.ones((3, 3)) + np.arange(3)

We can also broadcast in two directions at a time:



In [ ]:

    
np.arange(3).reshape((3, 1)) + np.arange(3)

Rules of Broadcasting

Broadcasting rules can do the following:

If the two arrays differ in their number of dimensions, the shape of the array with fewer dimensions is padded with additional elements until it matches the other on its leading (left) side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Note that all of this happens without ever actually creating the stretched arrays in memory! This broadcasting behavior is in practice enormously powerful, especially because when NumPy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.

For example, when we do

np.arange(3) + 5

The scalar 5 is

first 'promoted' to a 1-dimensional array of length 1
then, this array is 'stretched' to length 3 to match the first array.

After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 3.

The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when

they are equal to begin with, or
one of them is 1; in this case numpy will do the 'stretching' to make them equal.

If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

Universal Functions or `ufuncs`

NumPy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, SciPy ships a rich special function library in the scipy.special module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between 0 and $2\pi$ is as simple as:



In [ ]:

    
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

These are examples of Universal Functions or ufuncs in NumPy. They operate element-wise on an array. We've already seen examples of these in the various arithmetic operations:



In [ ]:

    
x = np.random.random(4)
print x
print x + 1  # add 1 to each element of x



In [ ]:

    
print x * 2  # multiply each element of x by 2



In [ ]:

    
print x * x  # multiply each element of x by itself



In [ ]:

    
print x[1:] - x[:-1]

These are binary ufuncs: they take two arguments. There are also unary ufuncs:



In [ ]:

    
-x



In [ ]:

    
np.sin(x)

Ufuncs are very fast:



In [ ]:

    
x = np.random.random(10000)



In [ ]:

    
%%timeit
# compute element-wise x + 1 via a ufunc 
y = np.zeros_like(x)
y = x + 1



In [ ]:

    
%%timeit
# compute element-wise x + 1 via a loop
y = np.zeros_like(x)
for i in range(len(x)):
    y[i] = x[i] + 1

NumPy ufuncs are faster than Python functions involving loops, because the looping happens in compiled code. This is only possible when types are known beforehand, which is why NumPy arrays must be typed.

Other Available Ufuncs:

Trigonometric functions (np.sin, np.cos, etc.)
SciPy special functions (scipy.special.j0, scipy.special.gammaln, etc.)
Element-wise minimum/maximum (np.minimum, np.maximum)
User-defined ufuncs



In [ ]:

    
x = np.random.random(5)
print x
print np.minimum(x, 0.5)
print np.maximum(x, 0.5)

How is this different from min() and max()?



In [ ]:

    
print np.min(x)
print np.max(x)

NumPy Aggregates

Aggregates are functions over arrays which return smaller arrays. NumPy has several built-in:



In [ ]:

    
# 10 x 10 array drawn from a standard normal
x = np.random.randn(10, 10)



In [ ]:

    
#Mean of the values
print x.mean()



In [ ]:

    
#Standard deviation
print x.std()



In [ ]:

    
#Variance
print x.var()



In [ ]:

    
#Sum all elements
print x.sum()



In [ ]:

    
#Multiply all elements
print x.prod()



In [ ]:

    
#Median
print np.median(x)

Beware the built-in Python aggregates!

Python has a min, max, and sum aggregate built-in. These are much more general than the versions in NumPy, so they are much slower:



In [ ]:

    
x = np.random.random(10000)

%timeit np.sum(x)
%timeit sum(x)

Moral of the story: Dynamic type-checking is slow. Make sure to use NumPy's sum, min, and max.

You can also take aggregates along certain axes:



In [ ]:

    
x = np.random.rand(3, 5)
print x



In [ ]:

    
print x.sum(0)  # sum along columns



In [ ]:

    
print x.sum(1)  # sum along rows



In [ ]:

    
print np.mean(x, 1) # mean along rows

More slicing and Masking

Imagine you have an array of data where negative values indicate some kind of error:



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])

How might you clean this array, setting all negative values to, say, zero?



In [ ]:

    
for i in range(len(x)):
    if x[i] < 0:
        x[i] = 0
print x

A faster way is to construct a boolean mask:



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])

mask = (x < 0)
print mask

And the mask can be used directly to set the value you desire:



In [ ]:

    
x[mask] = 0
print x

Or do it directly:



In [ ]:

    
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[x < 0] = 0
print x

Useful masking functions:



In [ ]:

    
x = np.random.random(5)
print x



In [ ]:

    
x[x > 0.5] = np.nan
print x



In [ ]:

    
x[np.isnan(x)] = np.inf
print x

But don't try to use np.nan for comparison purposes. By definition, they are never equal because their values are "unknown".



In [ ]:

    
np.nan == np.nan



In [ ]:

    
x[np.isinf(x)] = 0
print x



In [ ]:

    
x = np.array([1, 0, -np.inf, np.inf, np.nan])
print "input   ", x
print "x < 0   ", (x < 0)
print "x > 0   ", (x > 0)
print "isinf   ", np.isinf(x)
print "isfinite", np.isfinite(x)
print "isnan   ", np.isnan(x)
print "isposinf", np.isposinf(x)
print "isneginf", np.isneginf(x)

You can use the np.nan_to_num function to map nan to 0, inf to the max float, -inf to min float



In [ ]:

    
print np.nan_to_num(x)

You can also use bitwise boolean operations on masks with parentheses:



In [ ]:

    
x = np.arange(16).reshape((4, 4))
print x



In [ ]:

    
print (x < 5)



In [ ]:

    
#Or the complement of that operation:
print ~(x < 5)



In [ ]:

    
#Elements evaluate to true if they match both conditions
print (x < 10) & (x % 2 == 0)



In [ ]:

    
#select elements within a range
print (x > 3) & (x < 8)

Sum over a mask to find the number of True elements:



In [ ]:

    
x = np.random.random(100)
print "array is length", len(x), "and has"
print (x > 0.5).sum(), "elements that are greater than 0.5"

Clip is a useful function. See if you can figure out what it does.



In [ ]:

    
x = np.random.random(10)
x = np.clip(x, 0.3, 0.6)

print np.sum(x < 0.3)
print np.sum(x > 0.6)



In [ ]:

    
print x

You can also turn a mask into information about the indices using the where function. For 2D arrays, the result is a tuple of matched x,y indices showing the locations in the original array.



In [ ]:

    
x = np.random.random((3, 3))
print x



In [ ]:

    
print np.where(x < 0.3)



In [ ]:

    
print x[x < 0.3]



In [ ]:

    
print x[np.where(x < 0.3)]

Another useful trick (poker, anyone?):



In [ ]:

    
i = np.arange(6)
print "Original order:",i
np.random.shuffle(i)
print "Randomly shuffled:",i

Final Example: Image Processing

This example introduces some of the image processing capabilities available with NumPy and the SciPy ndimage package. More extensive documentation and tutorials can be found through the SciPy Lectures series.

Image I/O

Reading in and manipulating image data can be accomplished with NumPy or another built-in image library designed for image processing, such as the "Python Imaging Library" or PIL.

For example, we might want to read in a beautiful image of a star field taken by the Hubble Space Telescope. Have a look at some of these images:

Recall that color images are just arrays of numbers representing pixel locations and the color depth of each in Red, Green, and Blue. The different file formats (jpg, png, tiff, etc.) also contain additional information (headers, for example) that an image reading function needs to know about. We can use the SciPy ndimage library to read image data into NumPy arrays. If we want to fetch a file off the web, we also need some help from the requests and StringIO libraries:



In [ ]:

    
import scipy.ndimage as ndi
import requests
from StringIO import StringIO



In [ ]:

    
#Pick an image from the list above and fetch it with requests.get 
#The default picture here is of M45 - the Pleiades Star Cluster.
response = requests.get("http://imgsrc.hubblesite.org/hu/db/images/hs-2004-20-a-large_web.jpg")
pic = ndi.imread(StringIO(response.content))

Image Visualization

We can plot the image using matplotlib:



In [ ]:

    
%pylab inline
import matplotlib.pyplot as plt



In [ ]:

    
plt.imshow(pic);

Image Inspection

We can examine the image properties:



In [ ]:

    
print pic.shape

Pixel coordinates are (column,row).

Colors are represented by RGB triples. Black is (0,0,0), White is (255, 255, 255) or (0xFF, 0xFF, 0xFF) in hexadecimal. Think of it as a color cube with the three axes representing the different possible colors. The furthest away from the origin (black) is white.



In [ ]:

    
#Color array [R,G,B] of very first pixel
print pic[0,0]

We could write code to find the brightest pixel in the image, where "brightest" means highest value of R+G+B. For the 256 color scale, the greatest possible value is 3 * 255, or 765. One way to do this would be to write a set of nested loops over the pixel dimensions, calculating the sum R+G+B for each pixel, but that would be rather tedious and slow.

We could process the information faster if we take advantage of the speedy NumPy slicing, aggregates, and ufuncs. Remember that any time we can eliminate interpreted Python loops we save a lot of processing time.



In [ ]:

    
#find value of max pixel with aggregates
print pic.sum(axis=2).max() #numbering from 0, axis 2 is the color depth

Image Feature Extraction

Now that we know how to read in the image as a NumPy array, let's count the stars above some threshold brightness. Start by converting the image to B/W, so that which pixels belong to stars and which don't is unambiguous. We'll use black for stars and white for background, since it's easier to see black-on-white than the reverse.



In [ ]:

    
def monochrome(pic_array, threshold):
    """replace the RGB values in the loaded image with either 
    black or white depending on whether its total 
    luminance is above or below some threshold 
    passed in by the user"""
    mask = (pic_array.sum(axis=2) >= threshold) #could also be done in one step
    pic_array[mask] = 0 #BLACK - broadcasting at work here
    pic_array[~mask] = 255 #WHITE - broadcasting at work here
    return



In [ ]:

    
#Get another copy to convert to B/W
bwpic = ndi.imread(StringIO(response.content))

#This threshold is a scalar, not an RGB triple
#We're looking for pixels whose total color value is 600 or greater
monochrome(bwpic,200+200+200)
plt.imshow(bwpic);

The way to count the features (stars) in the image is to identify "blobs" of connected or adjacent black pixels.

We need to decide what we mean by "adjacent": sharing sides of square pixels but not corners. i.e. directly above, below, left, or right from each other, not if they are touching diagonally.
Count each one just once.
- Scan the image left-right, top-bottom
- Each time we find a new blob, increment the count

How do we tell whether the pixel we're looking at is part of a new blob or not?

In the following illustration, we mark counted black pixels by turning them red. If the pixel we're looking at touches one that has already been turned red, then it's part of a blob we have already counted. We'll turn it red to show that we have looked at it, but we won't count it as a star, since it belongs to a star we've already counted.

Sweep across the field left-to-right, top-to-bottom, until we find the first non-zero pixel.	Once it is found mark it as red.	Look to the left and above to see if it is part of an already counted blob. If not, increment the counter.	Proceed to the right.
Follow the same procedure as before to decide whether the count should be incremented.	When we reach the next blob, repeat.	Hmm...Maybe we should modify the "already counted" procedure?	We could add the diagonals...
But that could fail on the this pixel.	And what about a case like this?	Maybe we could try a "flood-fill" algorithm. Sweep across the field until a black pixel is encountered.	Mark it red.
Check for neighbors on all sides.	Move to the first neighbor and mark it red.	Now check its neighbors and repeat until there are no more unchecked neighbors.	Done.

A traditional implementation of this algorithm using plain Python loops is presented in the Multimedia Programming lesson from Software Carpentry. I use that when I teach introductory programming to physics students. Let's skip ahead to see how to implement such an algorithm much more efficiently using numpy and scipy.ndimage.

The scipy.ndimage.label function will use a structuring element (cross-shaped by default) to search for features. As an example, consider the simple array:



In [ ]:

    
a = np.array([[0,0,1,1,0,0],
              [0,0,0,1,0,0],
              [1,1,0,0,1,0],
              [0,0,0,1,0,0]])

There are four unique features here, if we only count those that have neighbors along a cross-shaped structuring element.



In [ ]:

    
labeled_array, num_features = ndi.label(a)



In [ ]:

    
print(num_features)



In [ ]:

    
print(labeled_array)

If we wish to consider elements connected on the diagonal, as well as the cross structure, we define a new structuring element:



In [ ]:

    
s = [[1,1,1],
     [1,1,1],
     [1,1,1]]

#Note, that scipy.ndimage.generate_binary_structure(2,2) would also do the same thing.
print s

Label the image using the new structuring element:



In [ ]:

    
labeled_array, num_features = ndi.label(a, structure=s)



In [ ]:

    
print(num_features)

Note that features 1, 3, and 4 from above are now considered a single feature



In [ ]:

    
print(labeled_array)

Let's use ndi.label to count up the stars in our B/W starfield image.



In [ ]:

    
labeled_array, num_stars = ndi.label(~bwpic) #Count and label the complement



In [ ]:

    
print num_stars



In [ ]:

    
plt.imshow(labeled_array);

Label returns an array the same shape as the input where each "unique feature has a unique value", so if you want the indices of the features you use a list comprehension to extract the exact feature indices. Something like:

label_indices = [(labeled_array[:,:,0] == i).nonzero() for i in xrange(1, num_stars+1)]

or use the ndi.find_objects method to obtain a tuple of feature locations as slices to obtain the general location of the star but not necessarily the correct shape.



In [ ]:

    
locations = ndi.find_objects(labeled_array)
print locations[9]



In [ ]:

    
label_indices = [(labeled_array[:,:,0] == i).nonzero() for i in xrange(1, num_stars+1)]
print label_indices[9]

Let's change the color of the largest star in the field to red. To find the largest star, look at the lengths of the arrays stored in label_indices.



In [ ]:

    
star_sizes = [(label_indices[i-1][0]).size for i in xrange(1, num_stars+1)]
print len(star_sizes)



In [ ]:

    
biggest_star = np.where(star_sizes == np.max(star_sizes))[0]
print biggest_star



In [ ]:

    
print star_sizes[biggest_star]



In [ ]:

    
bwpic[label_indices[biggest_star][0],label_indices[biggest_star][1],:] = (255,0,0)
plt.imshow(bwpic);

Exercises

For fun and practice, you can use these tricks to analyze the whole set of pictures listed at the top of this section. Suggested activities include:

Find the brightest pixel within each star identified using the scipy.ndimage.label function
Draw the image with different colors for identified stars of different brightnesses
Use the scipy.ndimage.center_of_mass function to identify the "location" of each star in pixel coordinates.
Use NumPy's I/O functions to write out a text file containing
- a header which includes
  - the link to the image
  - the threshold value used in the conversion to monochrome
  - the structuring element used to count its stars
  - how many stars were found
  - a description of the data in the subsequent columns
- a list of the star data organized in three columns and formatted for easy reading as follows:
  - location (x,y in pixel units)
  - maximum brightness value for the star

Summary:

NumPy offers convenient and fast numerical data manipulation with a large body of support libraries. Use it with large datasets to very quickly slice through the information and extract meaningful results.

The advantage of using NumPy is all about moving loops into compiled code:

Use NumPy ufuncs to your advantage (eliminate loops!)
Use NumPy aggregates to your advantage (eliminate loops!)
Use NumPy broadcasting to your advantage (eliminate loops!)
Use NumPy slicing and masking to your advantage (eliminate loops!)

For more details and documentation, consult the SciPy lecture series or the NumPy/SciPy documentation.

For an interesting look at how NumPy stacks up against traditional for loops and Python list comprehensions, see Jess Hamrick's nice blog post The Demise of For Loops from 2012.