Introduction to Scientific and Numerical Python

See Jake Vanderplas' Intro to NumPy, Efficient NumPy and Intro to Scipy for more details. This notebook was created from a selection of materials in those more exhaustive notebooks.


Instructions: Create a new directory called NumPy with a notebook called NumPyTour. Give it a heading 1 cell title Introduction to Scientific and Numerical Python. Read this page, typing in the code in the code cells and executing them as you go.

Do not copy/paste.

Type the commands yourself to get the practice doing it. This will also slow you down so you can think about the commands and what they are doing as you type them.</font>

Save your notebook when you are done, then try the accompanying exercises.


Introduction

While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.

SciPy is the premiere package for scientific computing in Python. It has many, many useful tools, which are central to scientific computing tasks you'll come across.

Some of the submodules available in SciPy that we could use in this course include:

  • File input/output: scipy.io
  • Special functions: scipy.special
  • Linear algebra operations: scipy.linalg
  • Fast Fourier transforms: scipy.fftpack
  • Optimization and fit: scipy.optimize
  • Statistics and random numbers: scipy.stats
  • Interpolation: scipy.interpolate
  • Numerical integration: scipy.integrate
  • Signal processing: scipy.signal
  • Image processing: scipy.ndimage
  • Sparse Matrices: scipy.sparse
  • Physical Constants: scipy.constants
  • Spatial metrics: scipy.spatial

We'll learn more about some of those throughout this course, picking them up as they are needed to help us solve specific physics problems. But before we get into the SciPy libraries, we need to learn how to efficiently compute with the NumPy library. Both are installed by default with the Anaconda distribution of Python.

The NumPy library forms the base layer for the entire SciPy ecosystem. You can import it into the notebook with


In [ ]:
import numpy as np

Basics of NumPy Arrays

You learned about Python lists during your CodeCademy Bootcamp. Although they are very flexible containers, they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices.

The main object provided by numpy is a powerful array. We'll start by exploring how the numpy array differs from Python lists. We start by creating a simple list and an array with the same contents of the list:


In [ ]:
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])

Elements of a one-dimensional array are accessed with the same syntax as a list:


In [ ]:
arr[0]

In [ ]:
lst[0]

In [ ]:
arr[-1]

In [ ]:
arr[2:]

Differences between arrays and lists

The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:


In [ ]:
lst[-1] = 'a string inside a list'
lst

In [ ]:
arr[-1] = 'a string inside an array'

The information about the type of an array is contained in its dtype attribute:


In [ ]:
arr.dtype

Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:


In [ ]:
arr[-1] = 1.234
arr

Array Creation

As we did above, you can create an array from a python list, but there are other methods of creating arrays that are very useful. For example, we may want to create an array initialized to a specific value, like 0 or 1, which can then be built upon with addition or multiplication to get arrays of any initial value. To create these, use zeros or ones and specify the desired type:


In [ ]:
np.zeros(5, dtype=float)

In [ ]:
np.zeros(3, dtype=int)

In [ ]:
np.zeros(3, dtype=complex)

In [ ]:
print '5 ones:', np.ones(5)

In [ ]:
a = np.empty(4)
a.fill(5.5)
a

In [ ]:
a = np.empty(3,dtype=int)
a.fill(7)
a

Numpy also offers the arange function, which works like the builtin range but returns an array instead of a list:


In [ ]:
np.arange(5)

But notice something important about how it is used:


In [ ]:
c = np.arange(2,16,2)
c

The max number in the range is exclusive because python numbering starts at zero.

Alternatively, the linspace and logspace functions create linearly and logarithmically-spaced grids respectively, with a fixed number of points and including both ends of the specified interval:


In [ ]:
print "A linear grid between 0 and 1 with 4 elements:"
print np.linspace(0, 1, 4)

In [ ]:
print "A logarithmic grid between 10**1 and 10**3 with 4 elements:"
print np.logspace(1, 3, 4)

It is often useful to create arrays with random numbers that follow a specific distribution. The np.random module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):


In [ ]:
np.random.randn(5)

whereas this will also give 5 samples, but from a normal distribution with a mean of 10 and a variance of 3:


In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10

Multidimensional arrays

Numpy can create arrays of aribtrary dimensions. For example, a list of lists can be used to initialize a two dimensional array:


In [ ]:
lst2 = [[1, 2], [3, 4]]
arr2 = np.array([[1, 2], [3, 4]])
arr2

With two-dimensional arrays we start seeing the power of NumPy: while a nested list can be indexed using repeatedly the [ ] operator, multidimensional arrays support a much more natural indexing syntax with a single [ ] and a set of indices separated by commas:


In [ ]:
print lst2[0][1]
print arr2[0,1]

Most of the array creation functions listed above can be used with more than one dimension, for example:


In [ ]:
np.zeros((2,3))

The shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:


In [ ]:
arr = np.arange(8).reshape(2, 4)
print arr

But note that reshaping (like most NumPy operations) provides a view of the same memory:


In [ ]:
arr = np.arange(8)
arr2 = arr.reshape(2, 4)

arr[0] = 1000
print arr
print arr2

This lack of copying allows for very efficient vectorized operations.

With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:


In [ ]:
print 'Slicing in the second row:', arr2[1, 2:4]
print 'All rows, third column   :', arr2[:, 2]

In [ ]:
print 'First row:  ', arr2[0]
print 'Second row: ', arr2[1]

In [ ]:
#Print some properties of the array arr2
print 'Data type                :', arr2.dtype
print 'Total number of elements :', arr2.size
print 'Number of dimensions     :', arr2.ndim
print 'Shape (dimensionality)   :', arr2.shape
print 'Memory used (in bytes)   :', arr2.nbytes

In [ ]:
#Print some useful information that the arrays can calculate for us
print 'Minimum and maximum             :', arr2.min(), arr2.max()
print 'Sum and product of all elements :', arr2.sum(), arr2.prod()
print 'Mean and standard deviation     :', arr2.mean(), arr2.std()

It's also possible to do the computation along a single dimension, by passing the axis parameter; for example:


In [ ]:
print 'For the following array:\n', arr2
print 'The sum of elements along the rows is    :', arr2.sum(axis=1)
print 'The sum of elements along the columns is :', arr2.sum(axis=0)

As you can see in this example, the value of the axis parameter is the dimension which will be consumed once the operation has been carried out. This is why to sum along the rows we use axis=1.

Another widely used property of arrays is the .T attribute, which allows you to access the transpose of the array:


In [ ]:
print 'Array:\n', arr2
print 'Transpose:\n', arr2.T

NumPy can also create some useful matrices:


In [ ]:
#identity matrix
c = np.eye(3)
c

In [ ]:
#diagonal matrix wth elements of an array
d = np.diag(np.array([1,2,3,4]))
d

In [ ]:
e = np.diag(np.linspace(0,1,6))
e

To access the elements of a multidimensional (in this case 2D) array:


In [ ]:
e[1,1]

In [ ]:
e[2,2]

Slicing Basics

To select a range of elements in the array, you can use slices, just like for Python lists:


In [ ]:
a = np.arange(10)
a

In [ ]:
a[:4]

In [ ]:
a[2:4]

In [ ]:
a[5:]

In [ ]:
a[:]

You can also give the step size:


In [ ]:
a[2:9:3]

Arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.

Consider for example that in the array norm10 we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:


In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10

In [ ]:
mask = norm10 > 9
mask

Now that we have this mask, we can use it to either read those values or to reset them to 0:


In [ ]:
print 'Values above 9:', norm10[mask]

In [ ]:
print 'Resetting all values above 9 to 0...'
norm10[mask] = 0
print norm10

Operations with Arrays

Arrays support all regular arithmetic operators, and the NumPy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied element-wise, i.e., are applied to all the elements of the array at the same time. Consider for example:


In [ ]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print arr1, '+', arr2, '=', arr1+arr2

Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is not the matrix multiplication from linear algebra (as is the case in Matlab, for example):


In [ ]:
print arr1, '*', arr2, '=', arr1*arr2

We may also multiply an array by a scalar:


In [ ]:
1.5 * arr1

Broadcasting

While in principle arrays must always match in their dimensionality in order for an operation to be valid, NumPy will broadcast dimensions when possible. Here is an example of broadcasting a scalar to a 1D array:


In [ ]:
print np.arange(3)
print np.arange(3) + 5

We can also broadcast a 1D array to a 2D array, in this case adding a vector to all rows of a matrix:


In [ ]:
print np.ones((3,3))
print " + "
print np.arange(3)
print " = "
print np.ones((3, 3)) + np.arange(3)

We can also broadcast in two directions at a time:


In [ ]:
np.arange(3).reshape((3, 1)) + np.arange(3)

Rules of Broadcasting

Broadcasting rules can do the following:

  • If the two arrays differ in their number of dimensions, the shape of the array with fewer dimensions is padded with additional elements until it matches the other on its leading (left) side.

  • If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.

  • If in any dimension the sizes disagree and neither is equal to 1, an error is raised.

Note that all of this happens without ever actually creating the stretched arrays in memory! This broadcasting behavior is in practice enormously powerful, especially because when NumPy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually replicate the data. In the example above the operation is carried as if the 1.5 was a 1-d array with 1.5 in all of its entries, but no actual array was ever created. This can save lots of memory in cases when the arrays in question are large and can have significant performance implications.

For example, when we do

np.arange(3) + 5

The scalar 5 is

  • first 'promoted' to a 1-dimensional array of length 1
  • then, this array is 'stretched' to length 3 to match the first array.

After these two operations are complete, the addition can proceed as now both operands are one-dimensional arrays of length 3.

The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when

  • they are equal to begin with, or
  • one of them is 1; in this case numpy will do the 'stretching' to make them equal.

If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.


Universal Functions or ufuncs

NumPy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. Furthermore, SciPy ships a rich special function library in the scipy.special module that includes Bessel, Airy, Fresnel, Laguerre and other classical special functions. For example, sampling the sine function at 100 points between 0 and $2\pi$ is as simple as:


In [ ]:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)

These are examples of Universal Functions or ufuncs in NumPy. They operate element-wise on an array. We've already seen examples of these in the various arithmetic operations:


In [ ]:
x = np.random.random(4)
print x
print x + 1  # add 1 to each element of x

In [ ]:
print x * 2  # multiply each element of x by 2

In [ ]:
print x * x  # multiply each element of x by itself

In [ ]:
print x[1:] - x[:-1]

These are binary ufuncs: they take two arguments. There are also unary ufuncs:


In [ ]:
-x

In [ ]:
np.sin(x)

Ufuncs are very fast:


In [ ]:
x = np.random.random(10000)

In [ ]:
%%timeit
# compute element-wise x + 1 via a ufunc 
y = np.zeros_like(x)
y = x + 1

In [ ]:
%%timeit
# compute element-wise x + 1 via a loop
y = np.zeros_like(x)
for i in range(len(x)):
    y[i] = x[i] + 1

NumPy ufuncs are faster than Python functions involving loops, because the looping happens in compiled code. This is only possible when types are known beforehand, which is why NumPy arrays must be typed.

Other Available Ufuncs:

  • Trigonometric functions (np.sin, np.cos, etc.)
  • Scipy special functions (scipy.special.j0, scipy.special.gammaln, etc.)
  • Element-wise minimum/maximum (np.minimum, np.maximum)
  • User-defined ufuncs

In [ ]:
x = np.random.random(5)
print x
print np.minimum(x, 0.5)
print np.maximum(x, 0.5)

How is this different from min() and max()?


In [ ]:
print np.min(x)
print np.max(x)

Linear Algebra with NumPy

Numpy ships with a basic linear algebra library, and all arrays have a dot method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:


In [ ]:
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])

print v1, '.', v2, '=', np.dot(v1, v2)

Here is a regular matrix-vector multiplication, note that the array v1 should be viewed as a column vector in traditional linear algebra notation; NumPy makes no distinction between row and column vectors and simply verifies that the dimensions match the required rules of matrix multiplication, in this case we have a 2x3 matrix multiplied by a 3-vector, which produces a 2-vector:


In [ ]:
A = np.arange(6).reshape(2, 3)
print A, 'x', v1, '=', np.dot(A, v1)

For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A\times A^T$:


In [ ]:
print np.dot(A, A.T)

and $A^T\times A$:


In [ ]:
print np.dot(A.T, A)

The numpy.linalg module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, scipy.linalg contains the majority of the tools that one finds in the fast numerical packages written in Fortran.


NumPy Aggregates

Aggregates are functions over arrays which return smaller arrays. NumPy has several built-in:


In [ ]:
# 10 x 10 array drawn from a standard normal
x = np.random.randn(10, 10)

In [ ]:
#Mean of the values
print x.mean()

In [ ]:
#Standard deviation
print x.std()

In [ ]:
#Variance
print x.var()

In [ ]:
#Sum all elements
print x.sum()

In [ ]:
#Multiply all elements
print x.prod()

In [ ]:
#Median
print np.median(x)

Beware the built-in Python aggregates!

Python has a min, max, and sum aggregate built-in. These are much more general than the versions in NumPy, so they are much slower:


In [ ]:
x = np.random.random(10000)

%timeit np.sum(x)
%timeit sum(x)

Moral of the story: Dynamic type-checking is slow. Make sure to use NumPy's sum, min, and max.


You can also take aggregates along certain axes:


In [ ]:
x = np.random.rand(3, 5)
print x

In [ ]:
print x.sum(0)  # sum along columns

In [ ]:
print x.sum(1)  # sum along rows

In [ ]:
print np.mean(x, 1) # mean along rows

More slicing and Masking

Imagine you have an array of data where negative values indicate some kind of error:


In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])

How might you clean this array, setting all negative values to, say, zero?


In [ ]:
for i in range(len(x)):
    if x[i] < 0:
        x[i] = 0
print x

A faster way is to construct a boolean mask:


In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])

mask = (x < 0)
print mask

And the mask can be used directly to set the value you desire:


In [ ]:
x[mask] = 0
print x

Or do it directly:


In [ ]:
x = np.array([1, 2, 3, -999, 2, 4, -999])
x[x < 0] = 0
print x

Useful masking functions:


In [ ]:
x = np.random.random(5)
print x

In [ ]:
x[x > 0.5] = np.nan
print x

In [ ]:
x[np.isnan(x)] = np.inf
print x

In [ ]:
np.nan == np.nan

In [ ]:
x[np.isinf(x)] = 0
print x

In [ ]:
x = np.array([1, 0, -np.inf, np.inf, np.nan])
print "input   ", x
print "x < 0   ", (x < 0)
print "x > 0   ", (x > 0)
print "isinf   ", np.isinf(x)
print "isnan   ", np.isnan(x)
print "isposinf", np.isposinf(x)
print "isneginf", np.isneginf(x)

You can also use bitwise boolean operations on masks with parentheses:


In [ ]:
x = np.arange(16).reshape((4, 4))
print x

In [ ]:
print (x < 5)

In [ ]:
#Or the complement of that operation:
print ~(x < 5)

In [ ]:
#Elements evaluate to true if they match both conditions
print (x < 10) & (x % 2 == 0)

In [ ]:
#select elements within a range
print (x > 3) & (x < 8)

Sum over a mask to find the number of True elements:


In [ ]:
x = np.random.random(100)
print "array is length", len(x), "and has"
print (x > 0.5).sum(), "elements that are greater than 0.5"

Clip is a useful function. See if you can figure out what it does.


In [ ]:
x = np.random.random(10)
x = np.clip(x, 0.3, 0.6)

print np.sum(x < 0.3)
print np.sum(x > 0.6)

In [ ]:
x

You can also turn a mask into information about the indices using the where function.


In [ ]:
x = np.random.random((3, 3))
print x

In [ ]:
print np.where(x < 0.3)

In [ ]:
print x[x < 0.3]

In [ ]:
print x[np.where(x < 0.3)]

Another useful trick (poker, anyone?):


In [ ]:
i = np.arange(6)
print "Original order:",i
np.random.shuffle(i)
print "Randomly shuffled:",i

Summary:

The advantage of using NumPy is all about moving loops into compiled code:

  • Use NumPy ufuncs to your advantage (eliminate loops!)

  • Use NumPy aggregates to your advantage (eliminate loops!)

  • Use NumPy broadcasting to your advantage (eliminate loops!)

  • Use NumPy slicing and masking to your advantage (eliminate loops!)


All content is under a modified MIT License, and can be freely used and adapted. See the full license text here.