Jennifer Klay jklay@calpoly.edu
California Polytechnic State University, San Luis Obispo
This notebook was created from a selection of materials in the SciPy Lectures series and these notebooks prepared by Jake Vanderplas: Intro to NumPy, Efficient NumPy, and Intro to Scipy.
While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.
SciPy is the premiere package for scientific computing in Python. It has many, many useful tools, which are central to scientific computing tasks you'll come across.
Some of the submodules available in SciPy that one could use in the analysis of large datasets include:
scipy.io
scipy.special
scipy.linalg
scipy.fftpack
scipy.optimize
scipy.stats
scipy.interpolate
scipy.integrate
scipy.signal
scipy.ndimage
scipy.sparse
scipy.spatial
The NumPy library forms the base layer for the entire SciPy ecosystem. It can be imported into the notebook with
In [ ]:
import numpy as np
Python lists are very flexible containers, but they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices.
The main object provided by NumPy is a powerful array. We can explore how the NumPy array differs from a Python list by creating a simple list and an array with the same contents as the list:
In [ ]:
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])
Elements of a one-dimensional array are accessed with the same syntax as a list:
In [ ]:
arr[0]
In [ ]:
lst[0]
In [ ]:
arr[-1]
In [ ]:
arr[2:]
The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string, but the same action won't work with a NumPy array:
In [ ]:
lst[-1] = 'a string inside a list'
lst
In [ ]:
arr[-1] = 'a string inside an array'
The information about the type of an array is contained in its dtype
attribute:
In [ ]:
arr.dtype
Once an array has been created, its dtype
is fixed and it can only store elements of the same type. For this example where the dtype
is integer, if we store a floating point number it will be automatically converted into an integer:
In [ ]:
arr[-1] = 1.234
arr
As we did above, you can create an array from a python list, but there are other methods of creating arrays that are very useful. For example, we may want to create an array initialized to a specific value, like 0 or 1, which can then be built upon with addition or multiplication to get arrays of any initial value. To create these, use zeros
or ones
and specify the desired type:
In [ ]:
np.zeros(5, dtype=float)
In [ ]:
np.zeros(3, dtype=int)
In [ ]:
np.zeros(3, dtype=complex)
In [ ]:
print '5 ones:', np.ones(5)
In [ ]:
a = np.empty(4)
a.fill(5.5)
a
In [ ]:
a = np.empty(3,dtype=int)
a.fill(7)
a
NumPy also offers the arange
function, which works like the builtin range
but returns an array instead of a list:
In [ ]:
np.arange(5)
But notice something important about how it is used:
In [ ]:
c = np.arange(2,16,2)
c
The max number in the range is exclusive because Python numbering starts at zero.
Alternatively, the linspace
and logspace
functions create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:
In [ ]:
print "A linear grid between 0 and 1 with 4 elements:"
print np.linspace(0, 1, 4)
In [ ]:
print "A logarithmic grid between 10**1 and 10**3 with 4 elements:"
print np.logspace(1, 3, 4)
It is often useful to create arrays with random numbers that follow a specific distribution. The np.random
module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):
In [ ]:
np.random.randn(5)
whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:
In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10
In [ ]:
np.random.normal(5.0,2.,10)
NumPy can create arrays of aribtrary dimensions. For example, a list of lists can be used to initialize a two dimensional array:
In [ ]:
lst2 = [[1, 2], [3, 4]]
arr2 = np.array([[1, 2], [3, 4]])
arr2
With two-dimensional arrays we start seeing the power of NumPy: while a nested list can be indexed using repeatedly the [ ]
operator, multidimensional arrays support a much more natural indexing syntax with a single [ ]
and a set of indices separated by commas:
In [ ]:
print lst2[0][1]
print arr2[0,1]
Most of the array creation functions listed above can be used with more than one dimension, for example:
In [ ]:
np.zeros((2,3))
The shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:
In [ ]:
arr = np.arange(8).reshape(2, 4)
print arr
But note that reshaping (like most NumPy operations) provides a view of the same memory:
In [ ]:
arr = np.arange(8)
arr2 = arr.reshape(2, 4)
arr[0] = 1000
print arr
print arr2
This lack of copying allows for very efficient vectorized operations.
With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:
In [ ]:
print 'Slicing in the second row:', arr2[1, 2:4]
print 'All rows, third column :', arr2[:, 2]
In [ ]:
print 'First row: ', arr2[0]
print 'Second row: ', arr2[1]
print arr2[2]
In [ ]:
#Print some properties of the array arr2
print 'Data type :', arr2.dtype
print 'Total number of elements :', arr2.size
print 'Number of dimensions :', arr2.ndim
print 'Shape (dimensionality) :', arr2.shape
print 'Memory used (in bytes) :', arr2.nbytes
In [ ]:
#Print some useful information that the arrays can calculate for us
print 'Minimum and maximum :', arr2.min(), arr2.max()
print 'Sum and product of all elements :', arr2.sum(), arr2.prod()
print 'Mean and standard deviation :', arr2.mean(), arr2.std()
It's also possible to do the computation along a single dimension, by passing the axis parameter; for example:
In [ ]:
print 'For the following array:\n', arr2
print 'The sum of elements along the rows is :', arr2.sum(axis=1)
print 'The sum of elements along the columns is :', arr2.sum(axis=0)
As you can see in this example, the value of the axis
parameter is the dimension which will be consumed once the operation has been carried out. This is why to sum along the rows we use axis=1
.
Another widely used property of arrays is the .T
attribute, which allows you to access the transpose of the array:
In [ ]:
print 'Array:\n', arr2
print 'Transpose:\n', arr2.T
NumPy can also create some useful matrices:
In [ ]:
#identity matrix
c = np.eye(3)
c
In [ ]:
#diagonal matrix wth elements of an array
d = np.diag(np.array([1,2,3,4]))
d
In [ ]:
e = np.diag(np.linspace(0,1,6))
e
To access the elements of a multidimensional (in this case 2D) array:
In [ ]:
e[1,1]
In [ ]:
e[2,2]
Create the following arrays with correct data types:
[[1 1 1 1]
[1 1 1 1]
[1 1 1 2]
[1 6 1 1]]
[[0. 0. 0. 0. 0.]
[2. 0. 0. 0. 0.]
[0. 3. 0. 0. 0.]
[0. 0. 4. 0. 0.]
[0. 0. 0. 5. 0.]
[0. 0. 0. 0. 6.]]
You should be able to accomplish this in 3 lines of code or less for each one. The documentation for the diag
method may be helpful.
In [ ]:
#your code here
In [ ]:
#See solution for the first array
%load sols/ex1a.py
In [ ]:
#See solution for the second array
%load sols/ex1b.py
Read the documentation for np.tile
and use this function to construct the following array:
[[4 3 4 3 4 3]
[2 1 2 1 2 1]
[4 3 4 3 4 3]
[2 1 2 1 2 1]]
In [ ]:
#your code here
In [ ]:
#See solution for this array
%load sols/ex2.py
To select a range of elements in the array, you can use slices, just like for Python lists:
In [ ]:
a = np.arange(10)
a
In [ ]:
a[:4]
In [ ]:
a[2:4]
In [ ]:
a[5:]
In [ ]:
a[:]
You can also give the step size:
In [ ]:
a[2:9:3]
Arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.
Consider for example that in the array norm10
we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:
In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10
In [ ]:
mask = norm10 > 9
mask
Now that we have this mask, we can use it to either read those values or to reset them to 0:
In [ ]:
print 'Values above 9:', norm10[mask]
In [ ]:
print 'Resetting all values above 9 to 0...'
norm10[mask] = 0
print norm10
Let’s create a prime number sieve. We will use the sieve to see what numbers between 0 and 100 are prime.
First, construct an array of booleans called is_prime
with shape (100,)
, filled with
True
in all the elements.
In [ ]:
#your code here
In [ ]:
#See solution
%load sols/ex3a.py
The index of each boolean element represents the number. “Cross out” 0
and 1
, which are not primes. You can either set them to False
or 0
(which python recognizes as equivalent to False
for boolean types).
In [ ]:
#your code here
In [ ]:
#See solution
%load sols/ex3b.py
For each subsequent integer j
starting from 2
, cross out its higher multiples. This is the famous sieve algorithm developed by Eratosthenes and described on wikipedia.
In [ ]:
for j in range(2, is_prime.size):
is_prime[2*j::j] = False #Make sure you understand what this code is doing. What does this slicing of the array mean?
print is_prime
The np.nonzero
function (try help(np.nonzero)
or np.nonzero?
) can be used to print a tuple of the prime numbers, which are the elements in the is_prime
array that are still True
.
In [ ]:
#your code here
In [ ]:
#See solution
%load sols/ex3c.py
How many primes are there? Try np.count_nonzero()
.
In [ ]:
#your code here
In [ ]:
#See solution
%load sols/ex3d.py
Combine all of this to create a new function, called eratosthenes_sieve(maximum)
that takes one argument, maximum
, the maximum number to test for primes, and returns an array containing the prime numbers between 2
and maximum
. Print the result for the test cases where maximum
= 10
and 100
.
In [ ]:
def eratosthenes_sieve(maximum):
"""Use the famous sieve of Eratosthenes to find all the
prime numbers between 2 and maximum.
Input: maximum range for the primes
Output: NumPy array of prime numbers between 2 and maximum"""
#
# your code here
#
return np.arange(maximum, dtype=int)
In [ ]:
#See solution
%load sols/ex3e.py
In [ ]:
#Call it with maximum 10
output10 = eratosthenes_sieve(10)
print output10
print np.count_nonzero(output10)
In [ ]:
#Call it with maximum 100
output100 = eratosthenes_sieve(100)
print output100
print np.count_nonzero(output100)
Arrays support all regular arithmetic operators, and the NumPy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied element-wise, i.e., are applied to all the elements of the array at the same time. Consider for example:
In [ ]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print arr1, '+', arr2, '=', arr1+arr2
Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is not the matrix multiplication from linear algebra (as is the case in Matlab, for example):
In [ ]:
print arr1, '*', arr2, '=', arr1*arr2
We may also multiply an array by a scalar:
In [ ]:
1.5 * arr1
While in principle arrays must always match in their dimensionality in order for an operation to be valid, NumPy will broadcast dimensions when possible. Here is an example of broadcasting a scalar to a 1D array:
In [ ]:
print np.arange(3)
print np.arange(3) + 5
We can also broadcast a 1D array to a 2D array, in this case adding a vector to all rows of a matrix:
In [ ]:
print np.ones((3,3))
print " + "
print np.arange(3)
print " = "
print np.ones((3, 3)) + np.arange(3)
We can also broadcast in two directions at a time:
In [ ]:
np.arange(3).reshape((3, 1)) + np.arange(3)
Broadcasting rules can do the following:
If the two arrays differ in their number of dimensions, the shape of the array with fewer dimensions is padded with additional elements until it matches the other on its leading (left) side.
If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither is equal to 1, an error is raised.
Note that all of this happens without ever actually creating the stretched arrays in memory! This broadcasting behavior is in practice enormously powerful, especially because when NumPy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.
For example, when we do
np.arange(3) + 5
The scalar 5 is
After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 3.
The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when
If these conditions are not met, a ValueError: frames are not aligned
exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.
NumPy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, SciPy ships a rich special function library in the scipy.special
module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between 0 and $2\pi$ is as simple as:
In [ ]:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
These are examples of Universal Functions or ufuncs
in NumPy. They operate element-wise on an array. We've already seen examples of these in the various arithmetic operations:
In [ ]:
x = np.random.random(4)
print x
print x + 1 # add 1 to each element of x
In [ ]:
print x * 2 # multiply each element of x by 2
In [ ]:
print x * x # multiply each element of x by itself
In [ ]:
print x[1:] - x[:-1]
These are binary ufuncs
: they take two arguments. There are also unary ufuncs
:
In [ ]:
-x
In [ ]:
np.sin(x)
Ufuncs are very fast:
In [ ]:
x = np.random.random(10000)
In [ ]:
%%timeit
# compute element-wise x + 1 via a ufunc
y = np.zeros_like(x)
y = x + 1
In [ ]:
%%timeit
# compute element-wise x + 1 via a loop
y = np.zeros_like(x)
for i in range(len(x)):
y[i] = x[i] + 1
NumPy ufuncs
are faster than Python functions involving loops, because the looping happens in compiled code. This is only possible when types are known beforehand, which is why NumPy arrays must be typed.
np.sin
, np.cos
, etc.)scipy.special.j0
, scipy.special.gammaln
, etc.)np.minimum
, np.maximum
)ufuncs
In [ ]:
x = np.random.random(5)
print x
print np.minimum(x, 0.5)
print np.maximum(x, 0.5)
How is this different from min()
and max()
?
In [ ]:
print np.min(x)
print np.max(x)
Aggregates are functions over arrays which return smaller arrays. NumPy has several built-in:
In [ ]:
# 10 x 10 array drawn from a standard normal
x = np.random.randn(10, 10)
In [ ]:
#Mean of the values
print x.mean()
In [ ]:
#Standard deviation
print x.std()
In [ ]:
#Variance
print x.var()
In [ ]:
#Sum all elements
print x.sum()
In [ ]:
#Multiply all elements
print x.prod()
In [ ]:
#Median
print np.median(x)
Python has a min
, max
, and sum
aggregate built-in. These are much more general than the versions in NumPy, so they are much slower:
In [ ]:
x = np.random.random(10000)
%timeit np.sum(x)
%timeit sum(x)
Moral of the story: Dynamic type-checking is slow. Make sure to use NumPy's sum
, min
, and max
.
You can also take aggregates along certain axes:
In [ ]:
x = np.random.rand(3, 5)
print x
In [ ]:
print x.sum(0) # sum along columns
In [ ]:
print x.sum(1) # sum along rows
In [ ]:
print np.mean(x, 1) # mean along rows
Imagine you have an array of data where negative values indicate some kind of error:
In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])
How might you clean this array, setting all negative values to, say, zero?
In [ ]:
for i in range(len(x)):
if x[i] < 0:
x[i] = 0
print x
A faster way is to construct a boolean mask:
In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])
mask = (x < 0)
print mask
And the mask can be used directly to set the value you desire:
In [ ]:
x[mask] = 0
print x
Or do it directly:
In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[x < 0] = 0
print x
In [ ]:
x = np.random.random(5)
print x
In [ ]:
x[x > 0.5] = np.nan
print x
In [ ]:
x[np.isnan(x)] = np.inf
print x
But don't try to use np.nan
for comparison purposes. By definition, they are never equal because their values are "unknown".
In [ ]:
np.nan == np.nan
In [ ]:
x[np.isinf(x)] = 0
print x
In [ ]:
x = np.array([1, 0, -np.inf, np.inf, np.nan])
print "input ", x
print "x < 0 ", (x < 0)
print "x > 0 ", (x > 0)
print "isinf ", np.isinf(x)
print "isfinite", np.isfinite(x)
print "isnan ", np.isnan(x)
print "isposinf", np.isposinf(x)
print "isneginf", np.isneginf(x)
You can use the np.nan_to_num
function to map nan
to 0, inf
to the max float
, -inf
to min float
In [ ]:
print np.nan_to_num(x)
You can also use bitwise boolean operations on masks with parentheses:
In [ ]:
x = np.arange(16).reshape((4, 4))
print x
In [ ]:
print (x < 5)
In [ ]:
#Or the complement of that operation:
print ~(x < 5)
In [ ]:
#Elements evaluate to true if they match both conditions
print (x < 10) & (x % 2 == 0)
In [ ]:
#select elements within a range
print (x > 3) & (x < 8)
Sum over a mask to find the number of True elements:
In [ ]:
x = np.random.random(100)
print "array is length", len(x), "and has"
print (x > 0.5).sum(), "elements that are greater than 0.5"
Clip is a useful function. See if you can figure out what it does.
In [ ]:
x = np.random.random(10)
x = np.clip(x, 0.3, 0.6)
print np.sum(x < 0.3)
print np.sum(x > 0.6)
In [ ]:
print x
You can also turn a mask into information about the indices using the where
function. For 2D arrays, the result is a tuple of matched x,y indices showing the locations in the original array.
In [ ]:
x = np.random.random((3, 3))
print x
In [ ]:
print np.where(x < 0.3)
In [ ]:
print x[x < 0.3]
In [ ]:
print x[np.where(x < 0.3)]
Another useful trick (poker, anyone?):
In [ ]:
i = np.arange(6)
print "Original order:",i
np.random.shuffle(i)
print "Randomly shuffled:",i
This example introduces some of the image processing capabilities available with NumPy and the SciPy ndimage
package. More extensive documentation and tutorials can be found through the SciPy Lectures series.
Reading in and manipulating image data can be accomplished with NumPy or another built-in image library designed for image processing, such as the "Python Imaging Library" or PIL.
For example, we might want to read in a beautiful image of a star field taken by the Hubble Space Telescope. Have a look at some of these images:
Recall that color images are just arrays of numbers representing pixel locations and the color depth of each in Red, Green, and Blue. The different file formats (jpg
, png
, tiff
, etc.) also contain additional information (headers, for example) that an image reading function needs to know about. We can use the SciPy ndimage
library to read image data into NumPy arrays. If we want to fetch a file off the web, we also need some help from the requests
and StringIO
libraries:
In [ ]:
import scipy.ndimage as ndi
import requests
from StringIO import StringIO
In [ ]:
#Pick an image from the list above and fetch it with requests.get
#The default picture here is of M45 - the Pleiades Star Cluster.
response = requests.get("http://imgsrc.hubblesite.org/hu/db/images/hs-2004-20-a-large_web.jpg")
pic = ndi.imread(StringIO(response.content))
We can plot the image using matplotlib
:
In [ ]:
%pylab inline
import matplotlib.pyplot as plt
In [ ]:
plt.imshow(pic);
We can examine the image properties:
In [ ]:
print pic.shape
Pixel coordinates are (column,row).
Colors are represented by RGB triples. Black is (0,0,0), White is (255, 255, 255) or (0xFF, 0xFF, 0xFF) in hexadecimal. Think of it as a color cube with the three axes representing the different possible colors. The furthest away from the origin (black) is white.
In [ ]:
#Color array [R,G,B] of very first pixel
print pic[0,0]
We could write code to find the brightest pixel in the image, where "brightest" means highest value of R+G+B. For the 256 color scale, the greatest possible value is 3 * 255, or 765. One way to do this would be to write a set of nested loops over the pixel dimensions, calculating the sum R+G+B for each pixel, but that would be rather tedious and slow.
We could process the information faster if we take advantage of the speedy NumPy slicing, aggregates, and ufuncs. Remember that any time we can eliminate interpreted Python loops we save a lot of processing time.
In [ ]:
#find value of max pixel with aggregates
print pic.sum(axis=2).max() #numbering from 0, axis 2 is the color depth
Now that we know how to read in the image as a NumPy array, let's count the stars above some threshold brightness. Start by converting the image to B/W, so that which pixels belong to stars and which don't is unambiguous. We'll use black for stars and white for background, since it's easier to see black-on-white than the reverse.
In [ ]:
def monochrome(pic_array, threshold):
"""replace the RGB values in the loaded image with either
black or white depending on whether its total
luminance is above or below some threshold
passed in by the user"""
mask = (pic_array.sum(axis=2) >= threshold) #could also be done in one step
pic_array[mask] = 0 #BLACK - broadcasting at work here
pic_array[~mask] = 255 #WHITE - broadcasting at work here
return
In [ ]:
#Get another copy to convert to B/W
bwpic = ndi.imread(StringIO(response.content))
#This threshold is a scalar, not an RGB triple
#We're looking for pixels whose total color value is 600 or greater
monochrome(bwpic,200+200+200)
plt.imshow(bwpic);
The way to count the features (stars) in the image is to identify "blobs" of connected or adjacent black pixels.
We need to decide what we mean by "adjacent": sharing sides of square pixels but not corners. i.e. directly above, below, left, or right from each other, not if they are touching diagonally.
Count each one just once.
How do we tell whether the pixel we're looking at is part of a new blob or not?
In the following illustration, we mark counted black pixels by turning them red. If the pixel we're looking at touches one that has already been turned red, then it's part of a blob we have already counted. We'll turn it red to show that we have looked at it, but we won't count it as a star, since it belongs to a star we've already counted.
A traditional implementation of this algorithm using plain Python loops is presented in the Multimedia Programming lesson from Software Carpentry. I use that when I teach introductory programming to physics students. Let's skip ahead to see how to implement such an algorithm much more efficiently using numpy
and scipy.ndimage
.
The scipy.ndimage.label
function will use a structuring element (cross-shaped by default) to search for features. As an example, consider the simple array:
In [ ]:
a = np.array([[0,0,1,1,0,0],
[0,0,0,1,0,0],
[1,1,0,0,1,0],
[0,0,0,1,0,0]])
There are four unique features here, if we only count those that have neighbors along a cross-shaped structuring element.
In [ ]:
labeled_array, num_features = ndi.label(a)
In [ ]:
print(num_features)
In [ ]:
print(labeled_array)
If we wish to consider elements connected on the diagonal, as well as the cross structure, we define a new structuring element:
In [ ]:
s = [[1,1,1],
[1,1,1],
[1,1,1]]
#Note, that scipy.ndimage.generate_binary_structure(2,2) would also do the same thing.
print s
Label the image using the new structuring element:
In [ ]:
labeled_array, num_features = ndi.label(a, structure=s)
In [ ]:
print(num_features)
Note that features 1, 3, and 4 from above are now considered a single feature
In [ ]:
print(labeled_array)
Let's use ndi.label
to count up the stars in our B/W starfield image.
In [ ]:
labeled_array, num_stars = ndi.label(~bwpic) #Count and label the complement
In [ ]:
print num_stars
In [ ]:
plt.imshow(labeled_array);
Label returns an array the same shape as the input where each "unique feature has a unique value", so if you want the indices of the features you use a list comprehension to extract the exact feature indices. Something like:
label_indices = [(labeled_array[:,:,0] == i).nonzero() for i in xrange(1, num_stars+1)]
or use the ndi.find_objects
method to obtain a tuple of feature locations as slices to obtain the general location of the star but not necessarily the correct shape.
In [ ]:
locations = ndi.find_objects(labeled_array)
print locations[9]
In [ ]:
label_indices = [(labeled_array[:,:,0] == i).nonzero() for i in xrange(1, num_stars+1)]
print label_indices[9]
Let's change the color of the largest star in the field to red. To find the largest star, look at the lengths of the arrays stored in label_indices
.
In [ ]:
star_sizes = [(label_indices[i-1][0]).size for i in xrange(1, num_stars+1)]
print len(star_sizes)
In [ ]:
biggest_star = np.where(star_sizes == np.max(star_sizes))[0]
print biggest_star
In [ ]:
print star_sizes[biggest_star]
In [ ]:
bwpic[label_indices[biggest_star][0],label_indices[biggest_star][1],:] = (255,0,0)
plt.imshow(bwpic);
For fun and practice, you can use these tricks to analyze the whole set of pictures listed at the top of this section. Suggested activities include:
scipy.ndimage.label
functionscipy.ndimage.center_of_mass
function to identify the "location" of each star in pixel coordinates.Use NumPy's I/O functions to write out a text file containing
NumPy offers convenient and fast numerical data manipulation with a large body of support libraries. Use it with large datasets to very quickly slice through the information and extract meaningful results.
The advantage of using NumPy is all about moving loops into compiled code:
Use NumPy ufuncs to your advantage (eliminate loops!)
Use NumPy aggregates to your advantage (eliminate loops!)
Use NumPy broadcasting to your advantage (eliminate loops!)
Use NumPy slicing and masking to your advantage (eliminate loops!)
For more details and documentation, consult the SciPy lecture series or the NumPy/SciPy documentation.
For an interesting look at how NumPy stacks up against traditional for loops and Python list comprehensions, see Jess Hamrick's nice blog post The Demise of For Loops from 2012.