The Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library. It was not, however, designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (essential building blocks of virtually all scientific computing).
For example, Python lists are very flexible containers that can be nested arbitrarily deep and can hold any Python object in them, but they are poorly suited to represent common mathematical constructs like vectors and matrices. It is for this reason that NumPy exists.
This workshop is a hands on coding workshop with discussion sections. It can be treated as an exercise for an individual, but the intent is to work as a group with an instructor.
Topics are introduced, followed by worked examples and exploration suggestions. These should be attempted individually or in small groups (making sure everyone is keeping their own notes). Each topic concludes with a review and discussion session for the whole group.
The Jupyter notebook provides each participant with an environment to run code and make notes. We recommend that you take your own copy of the workshop notebook and customise it through the workshop. This should provide you with a useful resource to refer back to.
In [ ]:
# numpy is generally imported as 'np'
import numpy as np
print(np)
print(np.__version__)
Here is a link to the NumPy documenetation for v1.11: https://docs.scipy.org/doc/numpy-1.11.0/reference/
There are many online forums with tips and suggestions for solving problems with NumPy, such as http://stackoverflow.com/
In [ ]:
# an explicit list of numbers
anarray = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23])
# an array of zeros of shape(3, 4)
zeroarray = np.zeros((3, 4))
# a range from 0 to n-1
rangearray = np.arange(12)
# a range from 0 to n-1, reshaped to (2, 3, 5)
shapedarray = np.arange(30).reshape(2, 3, 5)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In NumPy
c = a * b
calculates the element-wise product of a
and b
, at near-C speeds, but with the code simplicity we expect from something based on Python.
This demonstrates a core feature of NumPy: 'vectorization'.
Loops iterating through elements of arrays are not required, which can make code easier to read, as well as performing fast calculations.
Many scientific Python libraries use NumPy as their core array representation. From plotting libraries such as Matplotlib, to parallel processing libraries such as Dask, to data interoperability libraries such as Iris, NumPy arrays are at the core of how these libraries operate and communicate.
In [ ]:
arr = np.ones((3, 2, 4))
print("Array shape:", arr.shape)
print("Array element dtype:", arr.dtype)
In [ ]:
Make some more arrays. Some different ways to create arrays can be found at https://docs.scipy.org/doc/numpy/user/basics.creation.html. What are the properties of your arrays?
In [ ]:
What else can you find out about these arrays?
In [ ]:
In [ ]:
arr = np.array([1, 2, 3, 4, 5, 6])
print("arr --", arr)
print("arr[2] --", arr[2])
print("arr[2:5] --", arr[2:5])
print("arr[::2] --", arr[::2])
You can also index multidimensional arrays in a logical way using an enhanced indexing syntax. Remember that Python uses zero-based indexing!
In [ ]:
lst_2d = [[1, 2, 3], [4, 5, 6]]
arr_2d = np.array(lst_2d)
print("2D list:")
print(lst_2d)
print('')
print("2D array:")
print(arr_2d)
print('')
print("Single array element:")
print(arr_2d[1, 2])
print('')
print("Single row:")
print(arr_2d[1])
print('')
print("First two columns:")
print(arr_2d[:, :2])
Numpy provides syntax to index conditionally, based on the data in the array.
You can pass in an array of True and False values (a boolean array), or, more commonly, a condition that returns a boolean array.
In [ ]:
print(arr_2d[arr_2d % 2 == 0])
In [ ]:
In [ ]:
Question: why do the following examples produce different results?
In [ ]:
print(lst_2d[0:2][1])
print(arr_2d[0:2, 1])
The result we just received points to an important piece of learning, which is that in most cases NumPy arrays behave very differently to Python lists. Let's explore the differences (and some similarities) between the two.
A NumPy array has a fixed data type, called dtype
. This is the type of all the elements of the array.
This is in contrast to Python lists, which can hold elements of different types.
In [ ]:
In [ ]:
In [ ]:
shape
, dtype
dtype
array([list])
, ones
, zeros
, arange
, linspace
:
In [ ]:
arr1 = np.arange(4)
arr2 = np.arange(4)
print('{} + {} = {}'.format(arr1, arr2, arr1 + arr2))
In [ ]:
In [ ]:
In [ ]:
It makes intrinsic sense that you should be able to add a constant to all values in an array:
In [ ]:
arr = np.arange(4)
const = 5
print("Original array: {}".format(arr))
print("")
print("Array + const: {}".format(arr + const))
In [ ]:
daily_records = np.array([[12, 14, 11], [11, 12, 15]])
print('raw data:')
print(daily_records)
Each station is known to overstate the maximum recorded temperature by a different known constant value. You wish to subtract the appropriate offset from each station's values.
You can do that like this:
In [ ]:
offset = np.array([2, 1, 4])
corrected_records = daily_records - offset
print('corrected values:')
print(corrected_records)
NumPy allows you to do this easily using a powerful piece of functionality called broadcasting.
Broadcasting is a way of treating the arrays as if they had the same dimensions, and thus have elements all corresponding. It is then easy to perform the calculation, element-wise.
It does this by matching dimensions in one array to the other where possible, and using repeated values where there is no corresponding dimension in the other array.
Broadcasting applies these three rules:
If the two arrays differ in their number of dimensions, the shape of the array with fewer dimensions is padded with ones on its leading (left) side.
If the shape of the two arrays does not match in any dimension, either array with shape equal to 1 in a given dimension is stretched to match the other shape.
If in any dimension the sizes disagree and neither has shape equal to 1, an error is raised.
Note that all of this happens without ever actually creating the expanded arrays in memory! This broadcasting behavior is in practice enormously powerful, especially given that when NumPy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't actually duplicate the data. In the example above the operation is carried out as if the scalar 1.5 was a 1D array with 1.5 in all of its entries, but no actual array is ever created. This can save lots of memory in cases when the arrays in question are large. As such this can have significant performance implications.
In [ ]:
arr1 = np.ones((2, 3))
arr2 = np.ones((2, 1))
# (arr1 + arr2).shape
In [ ]:
arr1 = np.ones((2, 3))
arr2 = np.ones(3)
# (arr1 + arr2).shape
In [ ]:
arr1 = np.ones((1, 3))
arr2 = np.ones((2, 1))
# (arr1 + arr2).shape
In [ ]:
arr1 = np.ones((1, 3))
arr2 = np.ones((1, 2))
# (arr1 + arr2).shape
NumPy allows you to change the shape of an array, so long as the total number of elements in the array does not change.
For example, we could reshape a flat array with 12 elements to a 2D array with shape (2, 6)
, or (3, 2, 2)
, or even (3, 4, 1)
.
We could not, however, reshape it to have shape (2, 5)
, because the total number of elements would not be kept constant.
Now suppose you want to apply a correction for each day.
For example, you might try :
In [ ]:
days_adjust = np.array([1.5, 3.7])
adjusted = daily_records - days_adjust
but that results in a ValueError
Clearly, this doesn't work !
With reference to the above rules of broadcasting:
days_adjust
fails
In [ ]:
Sometimes an operation will produce a result, but not the one you wanted.
For example, suppose we have :
A = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
and we wish to add 0, 100 and 400 to the rows. That is,
B = np.array([0, 100, 400])
# and...
desired_result = np.array([[ 1, 2, 3],
[104, 105, 406],
[407, 408, 409]])
In [ ]:
In [ ]:
Now go and experiment with this behaviour!
You could review some of your experiments that failed from the earlier "Elementwise Arithmetic" section and see if you can now make them work.
In [ ]:
In [ ]:
Numpy arrays support many common statistical calculations. For a list of common operations, see : https://docs.scipy.org/doc/numpy/reference/routines.statistics.html.
The simplest operations consist of calculating a single statistical value from an array of numbers -- such as a mean value, a variance or a minimum.
For example:
In [ ]:
a = np.arange(12).reshape((3, 4))
mean = np.mean(a)
print(a)
print(mean)
daily_records
array, given by the np.mean()
function ?<array>.mean()
. Is that the same thing ?np.median(array)
or array.median()
?
In [ ]:
In [ ]:
In [ ]:
Used without any further arguments, statistical functions simply reduce the whole array to a single value. In practice, however, we very often want to calculate statistics over only some of the dimensions.
The most common requirement is to calculate a statistic along a single array dimension, while leaving all the other dimensions intact. This is referred to as "collapsing" or "reducing" the chosen dimension.
This is done by adding an "axis" keyword specifying which dimension, such as np.min(data, axis=1)
.
For example, recall that the above "daily_records" data varies over 2 timepoints and 3 locations, mapped to array dimensions 0 and 1 respectively:
>>> print daily_records
[[12 14 11]
[11 12 15]]
For this data, by adding the axis
keyword, we can calculate either :
axis=0
: a statistic over both days, for each station , oraxis=1
: a statistic over all three stations, for each day.For most statistical functions (but not all), the "axis" keyword will also accept multiple dimensions, e.g. axis=(0, 2)
.
Remember also that the default statistical operation, as seen above, is to collapses over all dimensions, reducing the array to a single value. This is equivalent to coding axis=None
.
np.ptp
statistic of this data in a similar form, i.e. for all X and Y at each timestep ?ptp
means a peak-to-peak range, i.e. "max(data) - min(data)".
In [ ]:
In [ ]:
In [ ]:
Real-world measurements processes often result in certain datapoint values being uncertain or simply "missing". This is usually indicated by additional data quality information, stored alongside the data values.
In these cases we often need to make calculations that count only the valid datapoints. Numpy provides a special "masked array" type for this type of calculation.
For example, we might know that in our previous daily_records
data, the value for station-2 on day-1 is invalid.
To represent this missing datapoint, we can make a masked version of the data :
In [ ]:
daily_records = np.array([[12, 14, 11], [11, 12, 15]])
masked_data = np.ma.masked_array(daily_records)
masked_data[0, 1] = np.ma.masked
print('masked data:')
print(masked_data)
The statistics of the masked data version are different:
In [ ]:
print('unmasked average = ', np.mean(daily_records))
print('masked average = ', np.ma.mean(masked_data))
The np.ma.masked_array()
function seen above is a simple creation method for masked data.
The sub-module np.ma
contains all the NumPy masked-data routines.
Instead of masking selected points, as above, a mask is often specified as a "mask array": This is a boolean array of the same size and shape as the data, where False
means a good datapoint and True
means a masked or missing point. Such a 'boolean mask' can be passed in the np.ma.masked_array
creation method, and can be extracted with the np.ma.getmaskarray()
function.
Note that most file storage formats represent missing data in a different way, using a distinct "missing data" value appearing in the data.
There is special support for converting between this type of representation and NumPy masked arrays : Every masked array has a 'fill_value' property and a 'filled()' method to convert between the two.
np.ma.masked_where
and related routines.np.ma.filled()
to create a 'plain' (i.e. unmasked) array from a masked one
In [ ]:
In [ ]:
array.mean()
and also np.mean(array)
,
the choice being mostly a question of style.
In [ ]:
%%timeit
x = range(500)
Repeat that, specifying only 100 loops and fastest time of 5 runs
In [ ]:
%%timeit -n 100 -r 5
x = range(500)
This gives us an easy way to evaluate performance for implementations ...
In [ ]:
rands = np.random.random(1000000).reshape(100, 100, 100)
In [ ]:
%%timeit -n 10 -r 5
overPointEightLoop = 0
for i in range(100):
for j in range(100):
for k in range(100):
if rands[i, j, k] > 0.8:
overPointEightLoop +=1
In [ ]:
%%timeit -n 10 -r 5
overPointEightWhere = rands[rands > 0.8].size
Clearly this is a trivial example, so let us explore a more complicated case.
In this exercise, you are tasked with implementing the simple trapezoid rule formula for numerical integration. If we want to compute the definite integral
$$ \int_{a}^{b}f(x)dx $$we can partition the integration interval $[a,b]$ into smaller subintervals. We then approximate the area under the curve for each subinterval by calculating the area of the trapezoid created by linearly interpolating between the two function values at each end of the subinterval:
For a pre-computed $y$ array (where $y = f(x)$ at discrete samples) the trapezoidal rule equation is:
$$ \int_{a}^{b}f(x)dx\approx\frac{1}{2}\sum_{i=1}^{n}\left(x_{i}-x_{i-1}\right)\left(y_{i}+y_{i-1}\right). $$In pure python, this can be written as:
def trapz_slow(x, y):
area = 0.
for i in range(1, len(x)):
area += (x[i] - x[i-1]) * (y[i] + y[i-1])
return area / 2
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
Write a function trapzf(f, a, b, npts=100)
that accepts a function f
, the endpoints a
and b
and the number of samples to take npts
. Sample the function uniformly at these
points and return the value of the integral.
Use the trapzf function to identify the minimum number of sampling points needed to approximate the integral $\int_0^3 x^2$ with an absolute error of $<=0.0001$. (A loop is necessary here.)
In [ ]:
In [ ]:
try:
a = np.ones((11, 13, 17, 23, 29, 37, 47))
except MemoryError:
print('this would have been a memory error')
NumPy attempts to not make copies of arrays unless it is explicitly told to.
Many NumPy operations will produce a reference to an existing array, known as a "view", instead of making a whole new array.
For example: Indexing and reshaping provide a view of the same memory wherever possible.
In [ ]:
arr = np.arange(8)
arr_view = arr.reshape(2, 4)
# Print the "view" array from reshape.
print('Before\n', arr_view)
# Update the first element of the original array.
arr[0] = 1000
# Print the "view" array from reshape again,
# noticing the first value has changed.
print('After\n', arr_view)
What this means is that if one array (arr
) is modified, the other (arr_view
) will also be updated : the same memory is being shared.
This is a valuable tool which enables the system memory overhead to be managed, which is particularly useful when handling lots of large arrays. The lack of copying allows for very efficient vectorized operations.
Remember, this behaviour is automatic in most of NumPy, so it requires some consideration in your code, it can lead to some bugs that are hard to track down.
For example, if you are changing some elements of an array that you are using elsewhere, you may want to explicitly copy that array before making changes.
If in doubt, you can always copy the data to a different block of memory with the copy()
method.
For example ...
In [ ]:
arr = np.arange(8)
arr_view = arr.reshape(2, 4).copy()
# Print the "view" array from reshape.
print('Before\n', arr_view)
# Update the first element of the original array.
arr[0] = 1000
# Print the "view" array from reshape again,
# noticing the first value has changed.
print('After\n', arr_view)
MemoryError
.copy
arrays.Here are a selection of topics we chose not to include in this workshop, or only touched upon briefly. You may come across some of them or find these links and terms useful in the future.