Introduction to NumPy

NumPy supports arrays which are very useful to numerical computations
  • Arrays are N dimensional: 1d (vector), 2d (plane),...,N dim
  • Arrays are (generally) faster than lists
  • Many packages use numpy arrays to store data
  • Arrays can be used to make calculations in one command, without For loops or list compreension

Let's import it...


In [284]:
import numpy as np

Vectorised array2 = array1*k+c

Non-vectorised, here with a list for index in range(len(list1)): list2[index] = list1[index]*k + c

Do we still need lists?

  • Lists can have different objects as elements. Arrays are homogenous. example_list = [number, string, cat, dog]
  • Both arrays and lists can be nested nested_list = [[1,2], [1,2,3], [1]]

Looking for help?

  • Interactive help: NumPy has an a built-in search engine

In [220]:
np.lookfor('weighted average')


Search results for 'weighted average'
-------------------------------------
numpy.average
    Compute the weighted average along the specified axis.
numpy.ma.average
    Return the weighted average of array over the given axis.
numpy.mean
    Compute the arithmetic mean along the specified axis.
numpy.nanmean
    Compute the arithmetic mean along the specified axis, ignoring NaNs.
numpy.ma.mean
    Returns the average of the array elements.
numpy.ma.MaskedArray.mean
    Returns the average of the array elements.
numpy.cov
    Estimate a covariance matrix, given data and weights.

Creating an array from a list


In [285]:
a1d = np.array([3, 4, 5, 6])
a1d


Out[285]:
array([3, 4, 5, 6])

In [222]:
a2d = np.array([[10.,   20, 30], [9, 8, 7]])
a2d


Out[222]:
array([[ 10.,  20.,  30.],
       [  9.,   8.,   7.]])

In [223]:
print( type( a1d[0] ) )
print( type( a2d[0,0] ) )


<class 'numpy.int32'>
<class 'numpy.float64'>

In [224]:
type(a1d)


Out[224]:
numpy.ndarray

The core class of NumPy is the ndarray (homogeneous n-dimensional array).

To find methods or attributes: a1d. ->tab

Common mistakes


In [225]:
try:
    a = np.array(1,2,3,4)   # WRONG, only 2 non-keyword arguments accepted
except ValueError as err:
    print(err)


only 2 non-keyword arguments accepted

In [226]:
a = np.array([1,2,3,4]) # RIGHT

In [227]:
np.ndarray([1,2,3,4]) #  ndarray is is a low level method. Use np.array() instead


Out[227]:
array([[[[ 0.1,  0.2,  0.3,  0.4],
         [ 0.5,  0.6,  0.7,  0.8],
         [ 0.9,  1. ,  1.1,  1.2]],

        [[ 1.3,  1.4,  1.5,  1.6],
         [ 1.7,  1.8,  1.9,  2. ],
         [ 2.1,  2.2,  2.3,  2.4]]]])

Functions for creating arrays

arange([start,] stop[, step,], dtype=None)

evenly spaced, defined by step


In [228]:
np.arange(1, 9, 2)


Out[228]:
array([1, 3, 5, 7])

In [229]:
# for integers, np.arange is same as range but returns an array insted of a list 
np.array( range(1,9,2) )


Out[229]:
array([1, 3, 5, 7])
linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

evenly spaced, defined by length


In [296]:
np.linspace(0, 1, 10)   # start, end, num-points


Out[296]:
array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

Filled with zeros


In [231]:
np.zeros((2, 3))


Out[231]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

By default, the dtype of the created array is float64 but other dtypes can be used:


In [232]:
np.zeros((2,2),dtype=int)


Out[232]:
array([[0, 0],
       [0, 0]])

Filled with ones


In [233]:
np.ones((2, 3))


Out[233]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Create array with random numbers


In [234]:
np.random.rand(4)       # uniform in [0, 1]


Out[234]:
array([ 0.36287114,  0.14607325,  0.58796834,  0.02682632])

In [235]:
np.random.normal(0,1,4)      # Gaussian (mean,std dev, num samples)


Out[235]:
array([ 0.57712622, -0.25874753, -0.43217297,  1.8227352 ])

In [298]:
np.random.gamma(1,1,(2,2))      # Gamma (shape, scale , num samples)


Out[298]:
array([[ 0.69766346,  0.38936882],
       [ 1.73591557,  0.50891149]])

Grid generation

  • A common task is to generate a pair of arrays that represent data coordinates.
  • Useful for interpolation of mapping contours.
  • When orthogonal 1D coordinate arrays already exist, NumPy's meshgrid function is very useful:

In [237]:
x = np.linspace(-5, 5, 3)
y = np.linspace(10, 40, 4)
x2d, y2d = np.meshgrid(x, y)
print(x2d)
print(y2d)


[[-5.  0.  5.]
 [-5.  0.  5.]
 [-5.  0.  5.]
 [-5.  0.  5.]]
[[ 10.  10.  10.]
 [ 20.  20.  20.]
 [ 30.  30.  30.]
 [ 40.  40.  40.]]

Transpose arrays

This can be very useful when dealing with grids


In [238]:
print(np.transpose(y2d)) # or equivalentely
print(y2d.transpose())   # using the method of y2d
print(y2d.T)             # using the property of y2d


[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]
[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]
[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]

Array attributes


In [239]:
a2d


Out[239]:
array([[ 10.,  20.,  30.],
       [  9.,   8.,   7.]])

ndarray.ndim

the number of axes (dimensions) of the array. In NumPy, the number of dimensions is referred to as rank.


In [240]:
a2d.ndim


Out[240]:
2

ndarray.shape

the dimensions of the array


In [241]:
a2d.shape


Out[241]:
(2, 3)

This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the rank, or number of dimensions, ndim.

ndarray.size

the total number of elements of the array


In [242]:
a2d.size


Out[242]:
6

Note that size is not equal to len(). The latter returns the length of the first dimension.


In [243]:
len(a2d)


Out[243]:
2

Statistical methods of arrays


In [306]:
print('array a1d                       :',a1d)
print('Minimum and maximum             :', a1d.min(), a1d.max())
print('Sum and product of all elements :', a1d.sum(), a1d.prod())
print('Mean and standard deviation     :', a1d.mean(), a1d.std())


array a1d                       : [3 4 5 6]
Minimum and maximum             : 3 6
Sum and product of all elements : 18 360
Mean and standard deviation     : 4.5 1.11803398875

Operations over a given axis


In [314]:
print(a2d)
print('sum  :',a2d.sum())
print('sum  :',a2d.sum(axis=0))
print('sum  :',a2d.sum(axis=1))


[[ 10.  20.  30.]
 [  9.   8.   7.]]
sum  : 84.0
sum  : [ 19.  28.  37.]
sum  : [ 60.  24.]

Vectorisation: operations on whole arrays


In [245]:
np.exp(a/100.)/a


Out[245]:
array([ 1.01005017,  0.51010067,  0.34348484,  0.26020269])

In [246]:
# Non-vectorised
r=np.zeros(a.shape)   # create empy array for results

for i in range(len(a)):
    r[i] = np.exp(a[i]/100.)/a[i]
r


Out[246]:
array([ 1.01005017,  0.51010067,  0.34348484,  0.26020269])

Vectorization is generally faster than a for loop. But it is not always possible or the most readable: for i in range(len(a)): if isprime(i): r[i] = a[i] else r[i]= a[i]+a[i-1]

Array indexing

  • Indices begin at 0, like other Python sequences and C/C++. Note that many languages, such as Matlab, R and Fortran, start with 1
  • In 2D, the first dimension corresponds to rows, the second to columns.
  • The fastest varying dimension is the last dimension! The outer level of the hierarchy is the first dimension.

In [247]:
a = np.arange(10, 100, 10)
a


Out[247]:
array([10, 20, 30, 40, 50, 60, 70, 80, 90])

In [248]:
a[2:9:3] # [start:end:step]


Out[248]:
array([30, 60, 90])

In [249]:
a[:3] # last is not included


Out[249]:
array([10, 20, 30])

In [250]:
a[-2] # negative index counts from the end


Out[250]:
80

Using indexes: how to calculate x[i]-x[i-1] without a loop?


In [251]:
x = np.random.rand(6)
x


Out[251]:
array([ 0.80490041,  0.1864352 ,  0.45164558,  0.8112803 ,  0.41884847,
        0.22717352])

In [252]:
xs = np.sort(x)
xs


Out[252]:
array([ 0.1864352 ,  0.22717352,  0.41884847,  0.45164558,  0.80490041,
        0.8112803 ])

In [253]:
xs[1:] - xs[:-1]


Out[253]:
array([ 0.04073832,  0.19167494,  0.03279711,  0.35325483,  0.00637989])

Exercise 1

Create a 2D NumPy array from the following list and assign it to the variable "arr":


In [254]:
# [[2, 3.2, 5.5, -6.4, -2.2, 2.4],
#  [1, 22, 4, 0.1, 5.3, -9],
#  [3, 1, 2.1, 21, 1.1, -2]]

Can you guess what the following slices are equal to? Print them to check your understanding.


In [255]:
# a[:, 3]

In [256]:
# a[1:4, 0:4]

In [257]:
# a[1:, 2]

Fancy indexing

NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks)


In [258]:
a = np.random.randint(1, 100, 6) # array of 6 random integers between 1 and 100
a


Out[258]:
array([18, 99, 19,  7, 29, 36])

In [259]:
mask = ( a % 3 == 0 ) # Where divisible by 3 (% is the modulus operator).
mask


Out[259]:
array([ True,  True, False, False, False,  True], dtype=bool)

In [260]:
a[mask]


Out[260]:
array([18, 99, 36])

Exercise 2

Consider an 4x5 2D array of negative integers:


In [261]:
a = np.arange(-100, 0, 5).reshape(4, 5)
a


Out[261]:
array([[-100,  -95,  -90,  -85,  -80],
       [ -75,  -70,  -65,  -60,  -55],
       [ -50,  -45,  -40,  -35,  -30],
       [ -25,  -20,  -15,  -10,   -5]])

Suppose you want to return an array result, which has the squared value when an element in array a is greater than -90 and less than -40, and is 1 otherwise.

  • With a For loop it would look like this:

In [262]:
result = np.zeros(a.shape, dtype=a.dtype)

for i in range(a.shape[0]):
    for j in range(a.shape[1]):
        if a[i, j] > -90 and a[i, j] < -40:
            result[i, j] = a[i, j]**2
        else:
            result[i, j] = 1
            
result


Out[262]:
array([[   1,    1,    1, 7225, 6400],
       [5625, 4900, 4225, 3600, 3025],
       [2500, 2025,    1,    1,    1],
       [   1,    1,    1,    1,    1]])

But a more less verbose and quicker approach would be:


In [263]:
condition = (a > -90) & (a < -40)
condition


Out[263]:
array([[False, False, False,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True, False, False, False],
       [False, False, False, False, False]], dtype=bool)

In [264]:
result[condition] = a[condition]**2
result[~condition] = 1
print(result)


[[   1    1    1 7225 6400]
 [5625 4900 4225 3600 3025]
 [2500 2025    1    1    1]
 [   1    1    1    1    1]]

A one-liner using np.where:


In [265]:
result = np.where(condition, a**2, 1)
print(result)


[[   1    1    1 7225 6400]
 [5625 4900 4225 3600 3025]
 [2500 2025    1    1    1]
 [   1    1    1    1    1]]

Masked arrays - how to handle (propagating) missing values

All operations related to masked arrays live in numpy.ma submodule.

The simplest example of manual creation of a masked array:


In [266]:
a = np.ma.masked_array(data=[1, 2, 3],
                       mask=[True, True, False],
                       fill_value=-999)
a


Out[266]:
masked_array(data = [-- -- 3],
             mask = [ True  True False],
       fill_value = -999)

Often, a task is to mask array depending on a criterion.


In [267]:
a = np.linspace(1, 15, 15)

In [268]:
masked_a = np.ma.masked_greater_equal(a, 11)

In [269]:
masked_a


Out[269]:
masked_array(data = [1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 -- -- -- -- --],
             mask = [False False False False False False False False False False  True  True
  True  True  True],
       fill_value = 1e+20)

Exercise 3

  • Create a "data" array of 30 linearly spaced numbers in the interval (1, 15)
  • Hint: use np.linspace or np.arange functions
  • Create a condition - a True/False (boolean) array, corresponding to the masked values
  • The data array should be masked where elements are
    • greater than 1
    • less or equal than 12
    • divisible by 2.
  • Mask the array depending on the condition
  • Hint: use np.where function

In [270]:
# Your code:
# arr =

In [271]:
# Your code:
# condition =

In [272]:
# masked_arr = np.ma.masked_where(condition, arr)
# print(masked_arr)

Shape manipulation


In [273]:
a = np.array([[1, 2, 3], [4, 5, 6]])

In [274]:
print('{} <-- array'.format(a))
print('{} <-- its shape'.format(a.shape))


[[1 2 3]
 [4 5 6]] <-- array
(2, 3) <-- its shape

In [275]:
a.flatten()


Out[275]:
array([1, 2, 3, 4, 5, 6])

In [315]:
a.repeat(4)


Out[315]:
array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6,
       6])

In [278]:
a.reshape((3, 2))


Out[278]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [279]:
print('Old shape: {}'.format(a.shape))
print('New shape: {}'.format(a.reshape((3, 2)).shape))


Old shape: (2, 3)
New shape: (3, 2)

Add a dimension


In [280]:
a[..., np.newaxis].shape


Out[280]:
(2, 3, 1)

Exercise 4

Generate a 2d array with 5x5. The first value is 0 and it grows left to right and top to bottom in increments on 0.1.


In [281]:
e4 = np.arange(0.,2.5,.1)
e4


Out[281]:
array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,
        1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,
        2.2,  2.3,  2.4])

In [282]:
e4.reshape([5,5])


Out[282]:
array([[ 0. ,  0.1,  0.2,  0.3,  0.4],
       [ 0.5,  0.6,  0.7,  0.8,  0.9],
       [ 1. ,  1.1,  1.2,  1.3,  1.4],
       [ 1.5,  1.6,  1.7,  1.8,  1.9],
       [ 2. ,  2.1,  2.2,  2.3,  2.4]])

Broadcasting

The fact that NumPy operates on an element-wise basis means that in principle arrays must always match one another's shape. However, NumPy will also helpfully "broadcast" dimensions when possible.