Basic Programming Using Python: Number Crunching

Objectives

Using the NumPy libraries to learn efficient

Vector and Matrix manipulation
Matrix notation

Getting hands-on experience with

determining and setting datatypes for matrix elements (float, integer, 32 and 64 bit)
matrix primitives and stereotypes
indexing and changing individual matrix values
the difference between row-major order vs. column-major order
copying vs. aliasing matrix data
using timeit to compare performance for different forumlations of the same calculation
slicing and masking matrices of data to get scalar statistics via compact queries
visualising more complex data to gain insight with matplotlib
common mistakes assoiated with floating point numbers and the tools/strategies to overcome them

Lesson

Scientists write a lot of software to simulate physical phenomena like diffusion-limited aggregation, but a lot more to crunch numbers. From the matrices used to calculate the stress and strain on a bridge to the tables used in statistics, the bulk of scientific data lives in arrays of one kind or another.

We can always manipulate arrays using loops, as in this function that calculates the dot product of two vectors:



In [3]:

    
def dot(left, right):
    '''Calculate dot product of two equal-sized vectors.'''
    assert len(left) == len(right), 'Vector lengths unequal: {0} vs. {1}'.format(len(left), len(right))
    result = 0.0
    for i in range(len(left)):
        result += left[i] * right[i]
    return result

However, writing code like this is inefficient in two senses. First, the same handful of matrix operations come up so often that it's worth developing a special notation for them. Second, because those operations are common, it's worth investing time in optimizing their performance. Thousands of software developers have done exactly that over fifty years, producing libraries that are much faster, and much more reliable, than anything a single person could develop. These libraries are typically written in low-level languages like Fortran and C, and then wrapped up in MATLAB or Python to make them easier to use.

This lesson explores the capabilities of these libraries. We use Python's NumPy library for our examples, but all of the ideas can be used in other languages as well.

First Steps

Let's start by making a vector containing three values:



In [4]:

    
import numpy as np
values = [1, 2, 3]
vector = np.array(values)
print vector

By convention, most programmers import numpy under the name np, since they're going to be typing it a lot. We then ask the library to create an array containing three values, which we then print.

Let's take a closer look at that vector:



In [5]:

    
print 'vector type is', type(vector)
print 'vector shape is', vector.shape
print 'vector data type is', vector.dtype









    



vector type is <type 'numpy.ndarray'>
vector shape is (3,)
vector data type is int64

vector is not a list: it's a special kind of object called an N-dimensional array (or ndarray). When we ask for its shape, we get back a tuple showing how large it is along each axis. In this case, that tuple only has the value 3 because our vector has three elements along the first axis.

The property vector.dtype tells us that the elements of the vector are all 64-bit integers. We'll take a look at the end of this section at why all the array's elements must be of the same type.

We can create arrays whose elements have a different data type by providing initial values of that type. If we provide a mix of types, NumPy converts everything to the most general of those types. For example, if we provide a mix of integers and floats, we get an array of floats:



In [6]:

    
vector = np.array([1., 2, 3]) # one float
print 'vector mixing floats and integers', vector
print 'data type', vector.dtype









    



vector mixing floats and integers [ 1.  2.  3.]
data type float64

We can force everything to be a particular data type by providing an optional argument:



In [7]:

    
vector = np.array([1, 2, 3], dtype=np.float32)
print 'forcing integer values to be 32-bit floats gives', vector, 'of type', vector.dtype









    



forcing integer values to be 32-bit floats gives [ 1.  2.  3.] of type float32

Of course, we won't normally type in all our data. Instead, we will either construct arrays in stereotyped ways, or read data from files. Here are some examples of the first approach:



In [8]:

    
z = np.zeros((5, 3))
print z









    



[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]



In [9]:

    
print np.ones((5, 3))









    



[[ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]
 [ 1.  1.  1.]]



In [10]:

    
print np.identity(5)









    



[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]]

These examples show two features of NumPy that are important to keep straight. The first is that array dimensions are written as a single tuple containing 1 or more sizes: np.zeros and np.ones take (5,3) as a single argument, rather than 5 and 3 separately.

The second thing to note is the way arrays are displayed. The shape (5,3) doesn't mean "5 elements along the X axis, and 3 along the Y". Instead, it means "5 along the first axis, and 3 along the second". This is called row-major order, and when we print the array, NumPy shows us 5 sub-arrays, each containing 3 elements.

And yes, this means that we can create multi-dimensional arrays by giving NumPy lists of lists:



In [11]:

    
rectangle = np.array([ [11, 22], [33, 44], [55, 66] ])
print rectangle









    



[[11 22]
 [33 44]
 [55 66]]

There are two more things we need to explore before we start using arrays to solve real scientific problems. The first is the rules governing what data is copied when. Here's an experiment:



In [12]:

    
first = np.ones((2, 2))
print 'first is', first
second = first
second[0, 0] = 9
print 'second is', second
print '...and first is now', first









    



first is [[ 1.  1.]
 [ 1.  1.]]
second is [[ 9.  1.]
 [ 1.  1.]]
...and first is now [[ 9.  1.]
 [ 1.  1.]]

As the figure below shows, the expression second = first doesn't copy data. Instead, it makes the variable second point at the same array as first, so that when we assign something to an element of second, the change shows up in first as well.

FIXME: diagram

This is called aliasing, and NumPy does it because:

it's more efficient that copying data unnecessarily, and
that's what Python does in other cases (e.g., with lists).

If we want to copy data so that we can safely make changes, we can do that explicitly using the array object's copy method:



In [13]:

    
third = first.copy()
third[:, :] = 1234
print 'third is', third
print 'but first is', first









    



third is [[ 1234.  1234.]
 [ 1234.  1234.]]
but first is [[ 9.  1.]
 [ 1.  1.]]

FIXME: diagram

The second thing we need to explore before using arrays is the question of performance. Let's create a list of a million ones, and an array with the same data:



In [14]:

    
million = 1000000
as_list = [1] * million
as_array = np.ones((million,))

We can now use IPython's timeit command to see how long it takes to sum up the values in the list:



In [15]:

    
%%timeit
sum(as_list)









    



100 loops, best of 3: 8.13 ms per loop

and in the array:



In [16]:

    
%%timeit
np.sum(as_array)









    



1000 loops, best of 3: 1.1 ms per loop

FIXME: uniform storage, low-level code, benefits

Analyzing Patient Data

The other way to create arrays is to read data from files. NumPy has functions to handle a wide variety of common formats, including comma-separated values (CSV). Suppose we have a file called patients.csv that contains normalized white blood cell counts for 60 people during the 40 days after contact with someone carrying drug-resistant tuberculosis (DRTB). We can use shell commands to look at how large the file is, and the first few lines in it:



In [17]:

    
!wc -l patients.csv
!head -5 patients.csv









    



      60 patients.csv
0,0,1,3,1,2,4,7,8,3,3,3,10,5,7,4,7,7,12,18,6,13,11,11,7,7,4,6,8,8,4,4,5,7,3,4,2,3,0,0
0,1,2,1,2,1,3,2,2,6,10,11,5,9,4,4,7,16,8,6,18,4,12,5,12,7,11,5,11,3,3,5,4,4,5,5,1,1,0,1
0,1,1,3,3,2,6,2,5,9,5,7,4,5,4,15,5,11,9,10,19,14,12,17,7,12,11,7,4,2,10,5,4,2,2,3,2,2,1,1
0,0,2,0,4,2,2,1,6,7,10,7,9,13,8,8,15,10,10,7,17,4,4,7,6,15,6,4,9,11,3,5,6,3,3,4,2,3,2,1
0,1,1,3,3,1,3,5,2,4,4,7,6,5,3,10,8,10,6,17,9,14,9,7,13,9,12,6,7,7,9,6,3,2,2,4,2,0,1,1

Let's load that data file using NumPy's loadtxt function. As you can probably guess, the optional delimiter argument tells it that values are separated by commas:



In [18]:

    
patients = np.loadtxt('patients.csv', delimiter=',')
print 'patients.shape is', patients.shape
print patients









    



patients.shape is (60, 40)
[[ 0.  0.  1. ...,  3.  0.  0.]
 [ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 ..., 
 [ 0.  1.  1. ...,  1.  1.  1.]
 [ 0.  0.  0. ...,  0.  2.  0.]
 [ 0.  0.  1. ...,  1.  1.  0.]]

Now that our data is in an array, we can index it using all the same shortcuts we introduced earlier for image arrays. For example, we can look at the values for a single patient:



In [19]:

    
p0 = patients[0, :]
print 'record for patient 0 contains', len(p0), 'elements'
print p0









    



record for patient 0 contains 40 elements
[  0.   0.   1.   3.   1.   2.   4.   7.   8.   3.   3.   3.  10.   5.   7.
   4.   7.   7.  12.  18.   6.  13.  11.  11.   7.   7.   4.   6.   8.   8.
   4.   4.   5.   7.   3.   4.   2.   3.   0.   0.]

Or at the white cell counts at t₀ for all patients:



In [20]:

    
t0 = patients[:, 0]
print 'data for time 0 contains', len(t0), 'elements'
print t0









    



data for time 0 contains 60 elements
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.]

We can now start using the dozens of functions in the NumPy library to do things like find the average white cell count for all patients across the entire 40 days:



In [21]:

    
print 'overall average white cell count is', np.mean(patients)









    



overall average white cell count is 6.14875

A more meaningful statistic is probably the mean white cell count over time across all our patients. To get this, we simply tell NumPy which axis we want it to sweep over. In this case, that is axis 0, because that's the one that distinguishes patients from each other:



In [22]:

    
mean_over_time = np.mean(patients, 0)
print mean_over_time









    



[  0.           0.45         1.11666667   1.75         2.43333333   3.15
   3.8          3.88333333   5.23333333   5.51666667   5.95         5.9
   8.35         7.73333333   8.36666667   9.5          9.58333333
  10.63333333  11.56666667  12.35        13.25        11.96666667
  11.03333333  10.16666667  10.           8.66666667   9.15         7.25
   7.33333333   6.58333333   6.06666667   5.95         5.11666667   3.6
   3.3          3.56666667   2.48333333   1.5          1.13333333
   0.56666667]

We can check that we've done the calculation along the right axis by looking at the size of our result:



In [23]:

    
print 'mean_over_time has', len(mean_over_time), 'entries'









    



mean_over_time has 40 entries

Simlarly, we can calculate the average white cell count for each patient over all time by asking NumPy to calculate the mean over axis 1:



In [24]:

    
print 'mean for each patient is\n', np.mean(patients, 1)









    



mean for each patient is
[ 5.45   5.425  6.1    5.9    5.55   6.225  5.975  6.65   6.625  6.525
  6.775  5.8    6.225  5.75   5.225  6.3    6.55   5.7    5.85   6.55
  5.775  5.825  6.175  6.1    5.8    6.425  6.05   6.025  6.175  6.55
  6.175  6.35   6.725  6.125  7.075  5.725  5.925  6.15   6.075  5.75
  5.975  5.725  6.3    5.9    6.75   5.925  7.225  6.15   5.95   6.275  5.7
  6.1    6.825  5.975  6.725  5.7    6.25   6.4    7.05   5.9  ]

In many cases, we won't want to calculate statistics on all of our values, but only on those that meet certain criteria. For example, let's see which patients had a normalized white cell count of 0 on the first day of exposure:



In [25]:

    
print patients[:, 0] == 0









    



[ True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True  True  True]

It looks like they all did; we can check using np.all, which tells us whether all the elements in an array are true:



In [26]:

    
print 'Are all patients initially uninfected?', np.all(patients[:, 0] == 0)









    



Are all patients initially uninfected? True

Let's try that again for day 1:



In [27]:

    
print patients[:, 1] > 0
print 'Are all patients uninfected on day 1?', np.all(patients[:, 1] == 0)
print 'How many patients are uninfected on day 1?', np.sum(patients[:, 1] == 0)









    



[False  True  True False  True False False False False  True  True  True
 False False  True  True False False False  True  True False False  True
  True False False False False False  True False  True False False False
  True  True  True  True False False  True False  True  True False  True
 False False  True  True False False  True False  True  True False False]
Are all patients uninfected on day 1? False
How many patients are uninfected on day 1? 33

The last line gives us the right answer because NumPy treats True as 1 and False as 0 when doing arithmetic. But we can do more with these values than add them up; we can also use them to select the rows we want from our array:



In [28]:

    
sample = patients[ patients[:, 1] > 0 ]
print 'We have selected', sample.shape, 'elements from patients'
print sample









    



We have selected (27, 40) elements from patients
[[ 0.  1.  2. ...,  1.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 [ 0.  1.  1. ...,  0.  1.  1.]
 ..., 
 [ 0.  1.  2. ...,  0.  0.  1.]
 [ 0.  1.  1. ...,  2.  1.  1.]
 [ 0.  1.  1. ...,  1.  1.  1.]]

As you probably expect, NumPy takes our 40 True and False values, lines them up with the major axis of patients, and gives us just those rows where the mask is True. We can then do more arithmetic with this sub-array, such as finding the maximum white cell count for those people who were showing signs of infection on day 1:



In [29]:

    
maxima_1 = np.max(sample, 1)
print maxima_1









    



[ 18.  19.  17.  18.  18.  18.  17.  18.  19.  19.  15.  17.  16.  15.  16.
  17.  19.  16.  16.  18.  16.  15.  14.  20.  17.  17.  19.]

and then calculate the average maximum cell count for those people:



In [30]:

    
ave_1 = np.average(maxima_1)
print 'average maximum count (infected on day 1)', ave_1









    



average maximum count (infected on day 1) 17.1851851852

and compare it with the average maximum cell count for people who weren't showing signs of infection on day 1:



In [31]:

    
print 'average maximum count (clear on day 1)', np.average(np.max(patients[ patients[:, 1] == 0 ], 1))









    



average maximum count (clear on day 1) 17.5757575758

The code above highlights the simultaneous strength and weakness of using array operators. On the one hand, we can write a single expression that calculates the same result as this:



In [32]:

    
total = 0.0
num = 0
for p in range(60):
    if patients[p, 1] == 0:
        max_count = 0
        for t in range(40):
            if patients[p, t] > max_count:
                max_count = patients[p, t]
        total += max_count
        num += 1
print 'result', total / num









    



result 17.5757575758

On the other hand, the expression:

np.average(np.max(patients[ patients[:, 1] == 0], 1))

does take a bit of practice to read, and it's very easy to fail to notice the difference between == and !=, or axis 0 versus axis 1, when they're buried inside a complex expression.

Visualization

The mathematician Richard Hamming once said, "The purpose of computing is insight, not numbers," and the best way to develop insight is often to visualize data. Visualization deserves an entire lecture (or course) of its own, but we can explore a few features of Python's matplotlib here. First, let's tell the IPython Notebook that we want our plots displayed inline, rather than in a separate viewing window:



In [33]:

    
%matplotlib inline

Next, we will import the pyplot module from matplotlib under the alias plt. (Again, most programmers alias it this way to save themselves typing.) pyplot provides a simplified interface to matplotlib's capabilities, and allows us to do things like simple heat maps with just a couple of lines of code:



In [34]:

    
from matplotlib import pyplot as plt
plt.imshow(patients)
plt.show()

This plot clearly shows how the white cell counts rise and fall for our patients over the 40-day period. In fact, they seem to rise and fall quite predictably. Let's take a look at the average degree of infection over time:



In [35]:

    
n_patients, n_days = patients.shape
dates = range(n_days)
ave_infection = np.average(patients, 0)
plt.plot(dates, ave_infection)
plt.show()

The graph above is surprisingly regular. Let's try looking at the maximum cell count per day across all our patients:



In [36]:

    
plt.plot(dates, np.max(patients, 0))
plt.show()

Whoops: that's more than surprising, it's downright suspicious. What does the minimum look like?



In [37]:

    
plt.plot(dates, np.min(patients, 0))
plt.show()

Again, that is suspiciously regular. In fact, as you may already have guessed, our patient data was actually produced by a random number generator producing uniform values between D/4 and D, where D ramps up and down uniformly from the start to the end of our 40-day period. If we were reviewing a paper that used this data, now would be the time to notify the editor of our suspicions.

If we're going to do that, though, we probably ought to tidy up our plots. First, we'll put them side by side:



In [46]:

    
plt.subplot(1, 3, 1)
plt.plot(dates, ave_infection)

plt.subplot(1, 3, 2)
plt.plot(dates, np.max(patients, 0))

plt.subplot(1, 3, 3)
plt.plot(dates, np.min(patients, 0))

plt.show()

As its name suggests, subplot creates a new sub-plot: all subsequent plotting commands apply to it until a new sub-plot is created. The three arguments tell the library how many rows and columns of sub-plots we want, and which sub-plot this is.

This figure has all our information, but it looks a bit squashed, and it would be helpful to have some titles. Let's try again:



In [58]:

    
plt.figure(figsize=(8.0, 3.0))

plt.subplot(1, 3, 1)
plt.xlabel('date')
plt.ylabel('average')
plt.plot(dates, ave_infection)

plt.subplot(1, 3, 2)
plt.xlabel('date')
plt.ylabel('maximum')
plt.plot(dates, np.max(patients, 0))

plt.subplot(1, 3, 3)
plt.xlabel('date')
plt.ylabel('minimum')
plt.plot(dates, np.min(patients, 0))

plt.tight_layout()
plt.show()

This time we start by creating a figure that is 8.0 units wide and 3.0 units high. The xlabel and ylabel commands put labels on the axes, and the tight_layout command makes sure that there's enough space between the figures that the vertical labels don't overlap the adjacent figures.

The Lotka-Volterra Equation

For another example of number crunching in Python, let's look at the Lotka-Volterra equation. First proposed in the 1920s, it models the interactions between predators and prey. When predators are scarce, prey breed rapidly; as more prey become available, the predator population increases; and as the number of predators increases, prey become scarcer, so the predator population peaks and falls. In mathematical form, this is:

\begin{eqnarray} \frac{du}{dt} & = & au & - & buv \\ \frac{dv}{dt} & = & -cv & - & dbuv \\ \end{eqnarray}

where:

u and v are the number of prey and predators respectively;
a is the natural growth rate of prey when there are no predators;
b is the natural death rate of prey due to predation;
c is the natural death rate of predators when there are no prey; and
d describes how many prey have to be caught and eaten to produce a new predator.

Let's set up our equations. If X is the pair [u, v], i.e., the prey and predator populations, then the equation of change is:



In [93]:

    
a = 1.
b = 0.1
c = 1.5
d = 0.75
def dX_dt(X, t):
   return np.array([ a*X[0] - b*X[0]*X[1] , -c*X[1] + d*b*X[0]*X[1] ])

We have to create the state variable X in order to use the numerical integration function we'll introduce in a moment. Similarly, the function that calculates the derivative, dX_dt, has to take the state and the time as parameters, even though it doesn't use the latter.

We'll integrate our equation from t=0 to t=20 in 2000 steps, so we need a vector of time values, which we can create using NumPy's linspace function:



In [94]:

    
t = np.linspace(0, 20, 2000)

Let's start with 20 rabbits and 4 foxes:



In [95]:

    
X_initial = np.array([20, 4])

and then integrate numerically using odeint (the ordinary differential equation integrator) from the SciPy library:



In [96]:

    
from scipy import integrate
X = integrate.odeint(dX_dt, X_initial, t)

X is now 2000 pairs of numbers representing the prey and predator populations at each time t. Let's plot those using the plotting library's object-oriented interface:



In [97]:

    
predators = X[:, 0]
prey = X[:, 1]
fig = plt.figure()
plt.plot(t, prey, 'r-', label='Prey')
plt.plot(t, predators, 'b-', label='Predators')
plt.grid()
plt.legend(loc='best')
plt.xlabel('time')
plt.ylabel('population')
plt.title('Evolution of predator and prey populations')
plt.show()

Python's plotting libraries can do a lot more than what we've shown here—the matplotlib gallery has dozens of examples, and more are being added all the time. The most important thing when using it, or any other part of any matrix library in any language, is this:

If you're writing a loop, you're probably doing it wrong.

Floating Point Arithmetic

One of the most common sources of error when using NumPy, or when dealing with numbers in any programming language for that matter, is associated with floating point arithmetic. Floating-point numbers are represented in computer hardware as base 2 (binary) fractions. For example, the decimal fraction

0.125

has value 1/10 + 2/100 + 5/1000, and in the same way the binary fraction

0.001

has value 0/2 + 0/4 + 1/8. These two fractions have identical values, the only difference being that the first is written in base 10 fractional notation, and the second in base 2.

Unfortunately, most numbers cannot be represented exactly as fractions in any base, decimal or binary. As a consequence, the decimal floating-point numbers you enter are generally only approximated by the binary floating-point numbers actually stored in the machine. It’s easy to forget that the stored value is an approximation, because of the way that floats are displayed at the interpreter prompt. Python only prints a decimal approximation to the true decimal value of the binary approximation stored by the machine. If Python were to print the exact decimal value of the binary approximation stored for 0.1, it would have to display

0.1000000000000000055511151231257827021181583404541015625

That is more digits than most people find useful, so Python keeps the number of digits manageable by displaying a rounded value instead:



In [7]:

    
0.1









    Out[7]:





0.1

The small errors introduced by these binary approximations are usually negligible, but can occasionally cause bugs that are very difficult to track down. Consider the following function, for which a quick test has been executed:



In [2]:

    
def add_one_tenth(input_data):
    """Add 0.1 to each value of an array"""
    
    return np.array(input_data) + 0.1



In [3]:

    
test_data = add_one_tenth([0.1, 0.2, 0.3, 0.4])
expected_answer = np.array([0.2, 0.3, 0.4, 0.5])

print test_data == expected_answer









    



[ True False  True  True]

The unexpected False arises because the binary appoximation of 0.1 + 0.2 doesn't exactly equal 0.3



In [6]:

    
0.1 + 0.2









    Out[6]:





0.30000000000000004

This result brings us to the first rule of floating point arithmetic: never use == (or !=) on floating point numbers. Instead, it is better to check whether two values are within some tolerance, and if they are, treat them as equal.

NumPy provides some very convenient tools for doing this, which typically allow you to define the tolerance in terms of the absolute and/or relative error. The absolute error (atol) in some approximation is simply the absolute value of the difference between the actual value and the approximation. The relative error (rtol), on the other hand, is the ratio of the absolute error to the value we're approximating. For example, if we're off by 1 in approximating 8+1 and 56+1, the absolute error is the same in both cases, but the relative error in the first case is 1/9 = 11%, while in the second case it is only 1/57 = 1.7%. When we're thinking about floating point numbers, relative error is almost always more useful than absolute error. After all, it makes little sense to say that we're off by a hundreth when the value in question is a billionth.

Let's now go ahead and write a better test for add_one_tenth. The np.allclose method tests whether two arrays are element-wise equal within a specified tolerance. More specifically, if the following equation is element-wise True, then np.allclose returns True.

absolute(a - b) <= (atol + rtol * absolute(b))



In [8]:

    
np.allclose(test_data, expected_answer, rtol=1e-05, atol=0.0)









    Out[8]:





True

Of course, in many cases we want to do more with floating point numbers than just test equality. For instance, we might want to do some simple trigonometry:



In [13]:

    
def arccos_to_deg(values):
    """Calculate the arccosine and give result in degrees (not radians)"""
    
    answer_radians = np.arccos(values)
    
    return np.rad2deg(answer_radians)



In [14]:

    
arccos_to_deg(1.0)









    Out[14]:





0.0

As expected, an input value of 1.0 gives the answer 0.0. However, what if we had derived the input value of 1.0 from an expression like the following?



In [24]:

    
(0.1 + 0.2) / 0.3









    Out[24]:





1.0000000000000002

The relative error associated with this expression is very small, however it's significant because arccosine is only defined over the range [-1.0, 1.0]. If we pass this derived input value to our simple trigonometic function, it now returns nan. This is the standard way computers represent an undefined or unrepresentable value (and stands for "not a number").



In [17]:

    
a = (0.1 + 0.2) / 0.3
arccos_to_deg(a)









    Out[17]:





nan

NumPy doesn't have a tool to cover every type of floating point arithmetic error that you'll ever come across, so in this instance we'll have to write our own function to fix the problem.



In [28]:

    
def approx_one(values, rtol=1e-05, atol=0.0):
    """Take values close to 1.0 and return exactly 1.0"""
    
    threshold = atol + rtol * np.absolute(values)
    diff = np.array(values) - 1.0
    
    return np.where(np.absolute(diff) < threshold, 1.0, values)


arccos_to_deg(approx_one(a))









    Out[28]:





0.0

The moral of the story here is that floating point issues are often unique to the problem that you're working on. However, if you're aware of the common tools (e.g. np.allclose) and techniques (e.g. never use ==) for dealing with them, you can almost always come up with a solution.

Key Points

FIXME