Using the NumPy libraries to learn efficient
Getting hands-on experience with
timeit
to compare performance for different forumlations of the same calculationmatplotlib
Scientists write a lot of software to simulate physical phenomena like diffusion-limited aggregation, but a lot more to crunch numbers. From the matrices used to calculate the stress and strain on a bridge to the tables used in statistics, the bulk of scientific data lives in arrays of one kind or another.
We can always manipulate arrays using loops, as in this function that calculates the dot product of two vectors:
In [3]:
def dot(left, right):
'''Calculate dot product of two equal-sized vectors.'''
assert len(left) == len(right), 'Vector lengths unequal: {0} vs. {1}'.format(len(left), len(right))
result = 0.0
for i in range(len(left)):
result += left[i] * right[i]
return result
However, writing code like this is inefficient in two senses. First, the same handful of matrix operations come up so often that it's worth developing a special notation for them. Second, because those operations are common, it's worth investing time in optimizing their performance. Thousands of software developers have done exactly that over fifty years, producing libraries that are much faster, and much more reliable, than anything a single person could develop. These libraries are typically written in low-level languages like Fortran and C, and then wrapped up in MATLAB or Python to make them easier to use.
This lesson explores the capabilities of these libraries. We use Python's NumPy library for our examples, but all of the ideas can be used in other languages as well.
Let's start by making a vector containing three values:
In [4]:
import numpy as np
values = [1, 2, 3]
vector = np.array(values)
print vector
By convention,
most programmers import numpy
under the name np
,
since they're going to be typing it a lot.
We then ask the library to create an array containing three values,
which we then print.
Let's take a closer look at that vector:
In [5]:
print 'vector type is', type(vector)
print 'vector shape is', vector.shape
print 'vector data type is', vector.dtype
vector
is not a list:
it's a special kind of object called an N-dimensional array (or ndarray
).
When we ask for its shape,
we get back a tuple showing how large it is along each axis.
In this case,
that tuple only has the value 3
because our vector has three elements along the first axis.
The property vector.dtype
tells us that the elements of the vector are all 64-bit integers.
We'll take a look at the end of this section at why all the array's elements must be of the same type.
We can create arrays whose elements have a different data type by providing initial values of that type. If we provide a mix of types, NumPy converts everything to the most general of those types. For example, if we provide a mix of integers and floats, we get an array of floats:
In [6]:
vector = np.array([1., 2, 3]) # one float
print 'vector mixing floats and integers', vector
print 'data type', vector.dtype
We can force everything to be a particular data type by providing an optional argument:
In [7]:
vector = np.array([1, 2, 3], dtype=np.float32)
print 'forcing integer values to be 32-bit floats gives', vector, 'of type', vector.dtype
Of course, we won't normally type in all our data. Instead, we will either construct arrays in stereotyped ways, or read data from files. Here are some examples of the first approach:
In [8]:
z = np.zeros((5, 3))
print z
In [9]:
print np.ones((5, 3))
In [10]:
print np.identity(5)
These examples show two features of NumPy that are important to keep straight.
The first is that array dimensions are written as a single tuple containing 1 or more sizes:
np.zeros
and np.ones
take (5,3)
as a single argument,
rather than 5 and 3 separately.
The second thing to note is the way arrays are displayed.
The shape (5,3)
doesn't mean "5 elements along the X axis, and 3 along the Y".
Instead,
it means "5 along the first axis, and 3 along the second".
This is called row-major order,
and when we print the array,
NumPy shows us 5 sub-arrays,
each containing 3 elements.
And yes, this means that we can create multi-dimensional arrays by giving NumPy lists of lists:
In [11]:
rectangle = np.array([ [11, 22], [33, 44], [55, 66] ])
print rectangle
There are two more things we need to explore before we start using arrays to solve real scientific problems. The first is the rules governing what data is copied when. Here's an experiment:
In [12]:
first = np.ones((2, 2))
print 'first is', first
second = first
second[0, 0] = 9
print 'second is', second
print '...and first is now', first
As the figure below shows,
the expression second = first
doesn't copy data.
Instead,
it makes the variable second
point at the same array as first
,
so that when we assign something to an element of second
,
the change shows up in first
as well.
FIXME: diagram
This is called aliasing, and NumPy does it because:
If we want to copy data so that we can safely make changes,
we can do that explicitly using the array object's copy
method:
In [13]:
third = first.copy()
third[:, :] = 1234
print 'third is', third
print 'but first is', first
FIXME: diagram
The second thing we need to explore before using arrays is the question of performance. Let's create a list of a million ones, and an array with the same data:
In [14]:
million = 1000000
as_list = [1] * million
as_array = np.ones((million,))
We can now use IPython's timeit
command to see how long it takes to sum up the values in the list:
In [15]:
%%timeit
sum(as_list)
and in the array:
In [16]:
%%timeit
np.sum(as_array)
FIXME: uniform storage, low-level code, benefits
The other way to create arrays is to read data from files.
NumPy has functions to handle a wide variety of common formats,
including comma-separated values (CSV).
Suppose we have a file called patients.csv
that contains normalized white blood cell counts for 60 people during the 40 days after contact with someone carrying
drug-resistant tuberculosis (DRTB).
We can use shell commands to look at how large the file is,
and the first few lines in it:
In [17]:
!wc -l patients.csv
!head -5 patients.csv
Let's load that data file using NumPy's loadtxt
function.
As you can probably guess,
the optional delimiter
argument tells it that values are separated by commas:
In [18]:
patients = np.loadtxt('patients.csv', delimiter=',')
print 'patients.shape is', patients.shape
print patients
Now that our data is in an array, we can index it using all the same shortcuts we introduced earlier for image arrays. For example, we can look at the values for a single patient:
In [19]:
p0 = patients[0, :]
print 'record for patient 0 contains', len(p0), 'elements'
print p0
Or at the white cell counts at t0 for all patients:
In [20]:
t0 = patients[:, 0]
print 'data for time 0 contains', len(t0), 'elements'
print t0
We can now start using the dozens of functions in the NumPy library to do things like find the average white cell count for all patients across the entire 40 days:
In [21]:
print 'overall average white cell count is', np.mean(patients)
A more meaningful statistic is probably the mean white cell count over time across all our patients. To get this, we simply tell NumPy which axis we want it to sweep over. In this case, that is axis 0, because that's the one that distinguishes patients from each other:
In [22]:
mean_over_time = np.mean(patients, 0)
print mean_over_time
We can check that we've done the calculation along the right axis by looking at the size of our result:
In [23]:
print 'mean_over_time has', len(mean_over_time), 'entries'
Simlarly, we can calculate the average white cell count for each patient over all time by asking NumPy to calculate the mean over axis 1:
In [24]:
print 'mean for each patient is\n', np.mean(patients, 1)
In many cases, we won't want to calculate statistics on all of our values, but only on those that meet certain criteria. For example, let's see which patients had a normalized white cell count of 0 on the first day of exposure:
In [25]:
print patients[:, 0] == 0
It looks like they all did;
we can check using np.all
,
which tells us whether all the elements in an array are true:
In [26]:
print 'Are all patients initially uninfected?', np.all(patients[:, 0] == 0)
Let's try that again for day 1:
In [27]:
print patients[:, 1] > 0
print 'Are all patients uninfected on day 1?', np.all(patients[:, 1] == 0)
print 'How many patients are uninfected on day 1?', np.sum(patients[:, 1] == 0)
The last line gives us the right answer because NumPy treats True
as 1 and False
as 0 when doing arithmetic.
But we can do more with these values than add them up;
we can also use them to select the rows we want from our array:
In [28]:
sample = patients[ patients[:, 1] > 0 ]
print 'We have selected', sample.shape, 'elements from patients'
print sample
As you probably expect,
NumPy takes our 40 True
and False
values,
lines them up with the major axis of patients
,
and gives us just those rows where the mask is True
.
We can then do more arithmetic with this sub-array,
such as finding the maximum white cell count for those people who were showing signs of infection on day 1:
In [29]:
maxima_1 = np.max(sample, 1)
print maxima_1
and then calculate the average maximum cell count for those people:
In [30]:
ave_1 = np.average(maxima_1)
print 'average maximum count (infected on day 1)', ave_1
and compare it with the average maximum cell count for people who weren't showing signs of infection on day 1:
In [31]:
print 'average maximum count (clear on day 1)', np.average(np.max(patients[ patients[:, 1] == 0 ], 1))
The code above highlights the simultaneous strength and weakness of using array operators. On the one hand, we can write a single expression that calculates the same result as this:
In [32]:
total = 0.0
num = 0
for p in range(60):
if patients[p, 1] == 0:
max_count = 0
for t in range(40):
if patients[p, t] > max_count:
max_count = patients[p, t]
total += max_count
num += 1
print 'result', total / num
On the other hand, the expression:
np.average(np.max(patients[ patients[:, 1] == 0], 1))
does take a bit of practice to read,
and it's very easy to fail to notice the difference between ==
and !=
,
or axis 0 versus axis 1,
when they're buried inside a complex expression.
The mathematician Richard Hamming once said,
"The purpose of computing is insight, not numbers,"
and the best way to develop insight is often to visualize data.
Visualization deserves an entire lecture (or course) of its own,
but we can explore a few features of Python's matplotlib
here.
First,
let's tell the IPython Notebook that we want our plots displayed inline,
rather than in a separate viewing window:
In [33]:
%matplotlib inline
Next,
we will import the pyplot
module from matplotlib
under the alias plt
.
(Again,
most programmers alias it this way to save themselves typing.)
pyplot
provides a simplified interface to matplotlib
's capabilities,
and allows us to do things like simple heat maps with just a couple of lines of code:
In [34]:
from matplotlib import pyplot as plt
plt.imshow(patients)
plt.show()
This plot clearly shows how the white cell counts rise and fall for our patients over the 40-day period. In fact, they seem to rise and fall quite predictably. Let's take a look at the average degree of infection over time:
In [35]:
n_patients, n_days = patients.shape
dates = range(n_days)
ave_infection = np.average(patients, 0)
plt.plot(dates, ave_infection)
plt.show()
The graph above is surprisingly regular. Let's try looking at the maximum cell count per day across all our patients:
In [36]:
plt.plot(dates, np.max(patients, 0))
plt.show()
Whoops: that's more than surprising, it's downright suspicious. What does the minimum look like?
In [37]:
plt.plot(dates, np.min(patients, 0))
plt.show()
Again, that is suspiciously regular. In fact, as you may already have guessed, our patient data was actually produced by a random number generator producing uniform values between D/4 and D, where D ramps up and down uniformly from the start to the end of our 40-day period. If we were reviewing a paper that used this data, now would be the time to notify the editor of our suspicions.
If we're going to do that, though, we probably ought to tidy up our plots. First, we'll put them side by side:
In [46]:
plt.subplot(1, 3, 1)
plt.plot(dates, ave_infection)
plt.subplot(1, 3, 2)
plt.plot(dates, np.max(patients, 0))
plt.subplot(1, 3, 3)
plt.plot(dates, np.min(patients, 0))
plt.show()
As its name suggests,
subplot
creates a new sub-plot:
all subsequent plotting commands apply to it
until a new sub-plot is created.
The three arguments tell the library how many rows and columns of sub-plots we want,
and which sub-plot this is.
This figure has all our information, but it looks a bit squashed, and it would be helpful to have some titles. Let's try again:
In [58]:
plt.figure(figsize=(8.0, 3.0))
plt.subplot(1, 3, 1)
plt.xlabel('date')
plt.ylabel('average')
plt.plot(dates, ave_infection)
plt.subplot(1, 3, 2)
plt.xlabel('date')
plt.ylabel('maximum')
plt.plot(dates, np.max(patients, 0))
plt.subplot(1, 3, 3)
plt.xlabel('date')
plt.ylabel('minimum')
plt.plot(dates, np.min(patients, 0))
plt.tight_layout()
plt.show()
This time we start by creating a figure that is 8.0 units wide and 3.0 units high.
The xlabel
and ylabel
commands put labels on the axes,
and the tight_layout
command makes sure that there's enough space between the figures
that the vertical labels don't overlap the adjacent figures.
For another example of number crunching in Python, let's look at the Lotka-Volterra equation. First proposed in the 1920s, it models the interactions between predators and prey. When predators are scarce, prey breed rapidly; as more prey become available, the predator population increases; and as the number of predators increases, prey become scarcer, so the predator population peaks and falls. In mathematical form, this is:
where:
Let's set up our equations. If X is the pair [u, v], i.e., the prey and predator populations, then the equation of change is:
In [93]:
a = 1.
b = 0.1
c = 1.5
d = 0.75
def dX_dt(X, t):
return np.array([ a*X[0] - b*X[0]*X[1] , -c*X[1] + d*b*X[0]*X[1] ])
We have to create the state variable X
in order to use the numerical integration function
we'll introduce in a moment.
Similarly,
the function that calculates the derivative,
dX_dt
,
has to take the state and the time as parameters,
even though it doesn't use the latter.
We'll integrate our equation from t=0 to t=20 in 2000 steps,
so we need a vector of time values,
which we can create using NumPy's linspace
function:
In [94]:
t = np.linspace(0, 20, 2000)
Let's start with 20 rabbits and 4 foxes:
In [95]:
X_initial = np.array([20, 4])
and then integrate numerically using odeint
(the ordinary differential equation integrator)
from the SciPy library:
In [96]:
from scipy import integrate
X = integrate.odeint(dX_dt, X_initial, t)
X
is now 2000 pairs of numbers representing the prey and predator populations at each time t.
Let's plot those using the plotting library's object-oriented interface:
In [97]:
predators = X[:, 0]
prey = X[:, 1]
fig = plt.figure()
plt.plot(t, prey, 'r-', label='Prey')
plt.plot(t, predators, 'b-', label='Predators')
plt.grid()
plt.legend(loc='best')
plt.xlabel('time')
plt.ylabel('population')
plt.title('Evolution of predator and prey populations')
plt.show()
Python's plotting libraries can do a lot more than what we've shown here—the
matplotlib
gallery
has dozens of examples,
and more are being added all the time.
The most important thing when using it,
or any other part of any matrix library in any language,
is this:
If you're writing a loop, you're probably doing it wrong.
One of the most common sources of error when using NumPy, or when dealing with numbers in any programming language for that matter, is associated with floating point arithmetic. Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction
0.125
has value 1/10 + 2/100 + 5/1000, and in the same way the binary fraction
0.001
has value 0/2 + 0/4 + 1/8. These two fractions have identical values, the only difference being that the first is written in base 10 fractional notation, and the second in base 2.
Unfortunately, most numbers cannot be represented exactly as fractions in any base, decimal or binary. As a consequence, the decimal floating-point numbers you enter are generally only approximated by the binary floating-point numbers actually stored in the machine. It’s easy to forget that the stored value is an approximation, because of the way that floats are displayed at the interpreter prompt. Python only prints a decimal approximation to the true decimal value of the binary approximation stored by the machine. If Python were to print the exact decimal value of the binary approximation stored for 0.1, it would have to display
0.1000000000000000055511151231257827021181583404541015625
That is more digits than most people find useful, so Python keeps the number of digits manageable by displaying a rounded value instead:
In [7]:
0.1
Out[7]:
The small errors introduced by these binary approximations are usually negligible, but can occasionally cause bugs that are very difficult to track down. Consider the following function, for which a quick test has been executed:
In [2]:
def add_one_tenth(input_data):
"""Add 0.1 to each value of an array"""
return np.array(input_data) + 0.1
In [3]:
test_data = add_one_tenth([0.1, 0.2, 0.3, 0.4])
expected_answer = np.array([0.2, 0.3, 0.4, 0.5])
print test_data == expected_answer
The unexpected False
arises because the binary appoximation of 0.1 + 0.2 doesn't exactly equal 0.3
In [6]:
0.1 + 0.2
Out[6]:
This result brings us to the first rule of floating point arithmetic: never use ==
(or !=
) on floating point numbers. Instead, it is better to check whether two values are within some tolerance, and if they are, treat them as equal.
NumPy provides some very convenient tools for doing this, which typically allow you to define the tolerance in terms of the absolute and/or relative error. The absolute error (atol
) in some approximation is simply the absolute value of the difference between the actual value and the approximation. The relative error (rtol
), on the other hand, is the ratio of the absolute error to the value we're approximating. For example, if we're off by 1 in approximating 8+1 and 56+1, the absolute error is the same in both cases, but the relative error in the first case is 1/9 = 11%, while in the second case it is only 1/57 = 1.7%. When we're thinking about floating point numbers, relative error is almost always more useful than absolute error. After all, it makes little sense to say that we're off by a hundreth when the value in question is a billionth.
Let's now go ahead and write a better test for add_one_tenth
. The np.allclose
method tests whether two arrays are element-wise equal within a specified tolerance. More specifically, if the following equation is element-wise True
, then np.allclose
returns True
.
absolute(a - b) <= (atol + rtol * absolute(b))
In [8]:
np.allclose(test_data, expected_answer, rtol=1e-05, atol=0.0)
Out[8]:
Of course, in many cases we want to do more with floating point numbers than just test equality. For instance, we might want to do some simple trigonometry:
In [13]:
def arccos_to_deg(values):
"""Calculate the arccosine and give result in degrees (not radians)"""
answer_radians = np.arccos(values)
return np.rad2deg(answer_radians)
In [14]:
arccos_to_deg(1.0)
Out[14]:
As expected, an input value of 1.0 gives the answer 0.0. However, what if we had derived the input value of 1.0 from an expression like the following?
In [24]:
(0.1 + 0.2) / 0.3
Out[24]:
The relative error associated with this expression is very small, however it's significant because arccosine is only defined over the range [-1.0, 1.0]. If we pass this derived input value to our simple trigonometic function, it now returns nan
. This is the standard way computers represent an undefined or unrepresentable value (and stands for "not a number").
In [17]:
a = (0.1 + 0.2) / 0.3
arccos_to_deg(a)
Out[17]:
NumPy doesn't have a tool to cover every type of floating point arithmetic error that you'll ever come across, so in this instance we'll have to write our own function to fix the problem.
In [28]:
def approx_one(values, rtol=1e-05, atol=0.0):
"""Take values close to 1.0 and return exactly 1.0"""
threshold = atol + rtol * np.absolute(values)
diff = np.array(values) - 1.0
return np.where(np.absolute(diff) < threshold, 1.0, values)
arccos_to_deg(approx_one(a))
Out[28]:
The moral of the story here is that floating point issues are often unique to the problem that you're working on. However, if you're aware of the common tools (e.g. np.allclose
) and techniques (e.g. never use ==
) for dealing with them, you can almost always come up with a solution.
FIXME