Introduction to NumPy

NumPy supports arrays which are very useful to numerical computations

Arrays are N dimensional: 1d (vector), 2d (plane),...,N dim
Arrays are (generally) faster than lists
Many packages use numpy arrays to store data
Arrays can be used to make calculations in one command, without For loops or list compreension

Let's import it...



In [284]:

    
import numpy as np

Vectorised array2 = array1*k+c

Non-vectorised, here with a list for index in range(len(list1)): list2[index] = list1[index]*k + c

Do we still need lists?

Lists can have different objects as elements. Arrays are homogenous. example_list = [number, string, cat, dog]
Both arrays and lists can be nested nested_list = [[1,2], [1,2,3], [1]]

Looking for help?

Documentation: http://docs.scipy.org/doc/numpy/reference/
Use help function (tab will show options available) help(np.mean)

Interactive help: NumPy has an a built-in search engine



In [220]:

    
np.lookfor('weighted average')









    



Search results for 'weighted average'
-------------------------------------
numpy.average
    Compute the weighted average along the specified axis.
numpy.ma.average
    Return the weighted average of array over the given axis.
numpy.mean
    Compute the arithmetic mean along the specified axis.
numpy.nanmean
    Compute the arithmetic mean along the specified axis, ignoring NaNs.
numpy.ma.mean
    Returns the average of the array elements.
numpy.ma.MaskedArray.mean
    Returns the average of the array elements.
numpy.cov
    Estimate a covariance matrix, given data and weights.

Creating an array from a list



In [285]:

    
a1d = np.array([3, 4, 5, 6])
a1d









    Out[285]:





array([3, 4, 5, 6])



In [222]:

    
a2d = np.array([[10.,   20, 30], [9, 8, 7]])
a2d









    Out[222]:





array([[ 10.,  20.,  30.],
       [  9.,   8.,   7.]])



In [223]:

    
print( type( a1d[0] ) )
print( type( a2d[0,0] ) )









    



<class 'numpy.int32'>
<class 'numpy.float64'>



In [224]:

    
type(a1d)









    Out[224]:





numpy.ndarray

The core class of NumPy is the ndarray (homogeneous n-dimensional array).

To find methods or attributes: a1d. ->tab

Common mistakes



In [225]:

    
try:
    a = np.array(1,2,3,4)   # WRONG, only 2 non-keyword arguments accepted
except ValueError as err:
    print(err)









    



only 2 non-keyword arguments accepted



In [226]:

    
a = np.array([1,2,3,4]) # RIGHT



In [227]:

    
np.ndarray([1,2,3,4]) #  ndarray is is a low level method. Use np.array() instead









    Out[227]:





array([[[[ 0.1,  0.2,  0.3,  0.4],
         [ 0.5,  0.6,  0.7,  0.8],
         [ 0.9,  1. ,  1.1,  1.2]],

        [[ 1.3,  1.4,  1.5,  1.6],
         [ 1.7,  1.8,  1.9,  2. ],
         [ 2.1,  2.2,  2.3,  2.4]]]])

Functions for creating arrays

`arange([start,] stop[, step,], dtype=None)`

evenly spaced, defined by step



In [228]:

    
np.arange(1, 9, 2)









    Out[228]:





array([1, 3, 5, 7])



In [229]:

    
# for integers, np.arange is same as range but returns an array insted of a list 
np.array( range(1,9,2) )









    Out[229]:





array([1, 3, 5, 7])

`linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)`

evenly spaced, defined by length



In [296]:

    
np.linspace(0, 1, 10)   # start, end, num-points









    Out[296]:





array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

Filled with zeros



In [231]:

    
np.zeros((2, 3))









    Out[231]:





array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

By default, the dtype of the created array is float64 but other dtypes can be used:



In [232]:

    
np.zeros((2,2),dtype=int)









    Out[232]:





array([[0, 0],
       [0, 0]])

Filled with ones



In [233]:

    
np.ones((2, 3))









    Out[233]:





array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Create array with random numbers



In [234]:

    
np.random.rand(4)       # uniform in [0, 1]









    Out[234]:





array([ 0.36287114,  0.14607325,  0.58796834,  0.02682632])



In [235]:

    
np.random.normal(0,1,4)      # Gaussian (mean,std dev, num samples)









    Out[235]:





array([ 0.57712622, -0.25874753, -0.43217297,  1.8227352 ])



In [298]:

    
np.random.gamma(1,1,(2,2))      # Gamma (shape, scale , num samples)









    Out[298]:





array([[ 0.69766346,  0.38936882],
       [ 1.73591557,  0.50891149]])

Grid generation

A common task is to generate a pair of arrays that represent data coordinates.
Useful for interpolation of mapping contours.
When orthogonal 1D coordinate arrays already exist, NumPy's meshgrid function is very useful:



In [237]:

    
x = np.linspace(-5, 5, 3)
y = np.linspace(10, 40, 4)
x2d, y2d = np.meshgrid(x, y)
print(x2d)
print(y2d)









    



[[-5.  0.  5.]
 [-5.  0.  5.]
 [-5.  0.  5.]
 [-5.  0.  5.]]
[[ 10.  10.  10.]
 [ 20.  20.  20.]
 [ 30.  30.  30.]
 [ 40.  40.  40.]]

Transpose arrays

This can be very useful when dealing with grids



In [238]:

    
print(np.transpose(y2d)) # or equivalentely
print(y2d.transpose())   # using the method of y2d
print(y2d.T)             # using the property of y2d









    



[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]
[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]
[[ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]
 [ 10.  20.  30.  40.]]

Array attributes



In [239]:

    
a2d









    Out[239]:





array([[ 10.,  20.,  30.],
       [  9.,   8.,   7.]])

ndarray.ndim

the number of axes (dimensions) of the array. In NumPy, the number of dimensions is referred to as rank.



In [240]:

    
a2d.ndim









    Out[240]:





2

ndarray.shape

the dimensions of the array



In [241]:

    
a2d.shape









    Out[241]:





(2, 3)

This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the rank, or number of dimensions, ndim.

ndarray.size

the total number of elements of the array



In [242]:

    
a2d.size









    Out[242]:





6

Note that size is not equal to len(). The latter returns the length of the first dimension.



In [243]:

    
len(a2d)









    Out[243]:





2

Statistical methods of arrays



In [306]:

    
print('array a1d                       :',a1d)
print('Minimum and maximum             :', a1d.min(), a1d.max())
print('Sum and product of all elements :', a1d.sum(), a1d.prod())
print('Mean and standard deviation     :', a1d.mean(), a1d.std())









    



array a1d                       : [3 4 5 6]
Minimum and maximum             : 3 6
Sum and product of all elements : 18 360
Mean and standard deviation     : 4.5 1.11803398875

Operations over a given axis



In [314]:

    
print(a2d)
print('sum  :',a2d.sum())
print('sum  :',a2d.sum(axis=0))
print('sum  :',a2d.sum(axis=1))









    



[[ 10.  20.  30.]
 [  9.   8.   7.]]
sum  : 84.0
sum  : [ 19.  28.  37.]
sum  : [ 60.  24.]

Vectorisation: operations on whole arrays



In [245]:

    
np.exp(a/100.)/a









    Out[245]:





array([ 1.01005017,  0.51010067,  0.34348484,  0.26020269])



In [246]:

    
# Non-vectorised
r=np.zeros(a.shape)   # create empy array for results

for i in range(len(a)):
    r[i] = np.exp(a[i]/100.)/a[i]
r









    Out[246]:





array([ 1.01005017,  0.51010067,  0.34348484,  0.26020269])

Vectorization is generally faster than a for loop. But it is not always possible or the most readable: for i in range(len(a)): if isprime(i): r[i] = a[i] else r[i]= a[i]+a[i-1]

Array indexing

Indices begin at 0, like other Python sequences and C/C++. Note that many languages, such as Matlab, R and Fortran, start with 1
In 2D, the first dimension corresponds to rows, the second to columns.
The fastest varying dimension is the last dimension! The outer level of the hierarchy is the first dimension.



In [247]:

    
a = np.arange(10, 100, 10)
a









    Out[247]:





array([10, 20, 30, 40, 50, 60, 70, 80, 90])



In [248]:

    
a[2:9:3] # [start:end:step]









    Out[248]:





array([30, 60, 90])



In [249]:

    
a[:3] # last is not included









    Out[249]:





array([10, 20, 30])



In [250]:

    
a[-2] # negative index counts from the end









    Out[250]:





80

Using indexes: how to calculate x[i]-x[i-1] without a loop?



In [251]:

    
x = np.random.rand(6)
x









    Out[251]:





array([ 0.80490041,  0.1864352 ,  0.45164558,  0.8112803 ,  0.41884847,
        0.22717352])



In [252]:

    
xs = np.sort(x)
xs









    Out[252]:





array([ 0.1864352 ,  0.22717352,  0.41884847,  0.45164558,  0.80490041,
        0.8112803 ])



In [253]:

    
xs[1:] - xs[:-1]









    Out[253]:





array([ 0.04073832,  0.19167494,  0.03279711,  0.35325483,  0.00637989])

Exercise 1

Create a 2D NumPy array from the following list and assign it to the variable "arr":



In [254]:

    
# [[2, 3.2, 5.5, -6.4, -2.2, 2.4],
#  [1, 22, 4, 0.1, 5.3, -9],
#  [3, 1, 2.1, 21, 1.1, -2]]

Can you guess what the following slices are equal to? Print them to check your understanding.



In [255]:

    
# a[:, 3]



In [256]:

    
# a[1:4, 0:4]



In [257]:

    
# a[1:, 2]

Fancy indexing

NumPy arrays can be indexed with slices, but also with boolean or integer arrays (masks)



In [258]:

    
a = np.random.randint(1, 100, 6) # array of 6 random integers between 1 and 100
a









    Out[258]:





array([18, 99, 19,  7, 29, 36])



In [259]:

    
mask = ( a % 3 == 0 ) # Where divisible by 3 (% is the modulus operator).
mask









    Out[259]:





array([ True,  True, False, False, False,  True], dtype=bool)



In [260]:

    
a[mask]









    Out[260]:





array([18, 99, 36])

Exercise 2

Consider an 4x5 2D array of negative integers:



In [261]:

    
a = np.arange(-100, 0, 5).reshape(4, 5)
a









    Out[261]:





array([[-100,  -95,  -90,  -85,  -80],
       [ -75,  -70,  -65,  -60,  -55],
       [ -50,  -45,  -40,  -35,  -30],
       [ -25,  -20,  -15,  -10,   -5]])

Suppose you want to return an array result, which has the squared value when an element in array a is greater than -90 and less than -40, and is 1 otherwise.

With a For loop it would look like this:



In [262]:

    
result = np.zeros(a.shape, dtype=a.dtype)

for i in range(a.shape[0]):
    for j in range(a.shape[1]):
        if a[i, j] > -90 and a[i, j] < -40:
            result[i, j] = a[i, j]**2
        else:
            result[i, j] = 1
            
result









    Out[262]:





array([[   1,    1,    1, 7225, 6400],
       [5625, 4900, 4225, 3600, 3025],
       [2500, 2025,    1,    1,    1],
       [   1,    1,    1,    1,    1]])

But a more less verbose and quicker approach would be:



In [263]:

    
condition = (a > -90) & (a < -40)
condition









    Out[263]:





array([[False, False, False,  True,  True],
       [ True,  True,  True,  True,  True],
       [ True,  True, False, False, False],
       [False, False, False, False, False]], dtype=bool)



In [264]:

    
result[condition] = a[condition]**2
result[~condition] = 1
print(result)









    



[[   1    1    1 7225 6400]
 [5625 4900 4225 3600 3025]
 [2500 2025    1    1    1]
 [   1    1    1    1    1]]

A one-liner using `np.where`:



In [265]:

    
result = np.where(condition, a**2, 1)
print(result)









    



[[   1    1    1 7225 6400]
 [5625 4900 4225 3600 3025]
 [2500 2025    1    1    1]
 [   1    1    1    1    1]]

Masked arrays - how to handle (propagating) missing values

All operations related to masked arrays live in numpy.ma submodule.

The simplest example of manual creation of a masked array:



In [266]:

    
a = np.ma.masked_array(data=[1, 2, 3],
                       mask=[True, True, False],
                       fill_value=-999)
a









    Out[266]:





masked_array(data = [-- -- 3],
             mask = [ True  True False],
       fill_value = -999)

Often, a task is to mask array depending on a criterion.



In [267]:

    
a = np.linspace(1, 15, 15)



In [268]:

    
masked_a = np.ma.masked_greater_equal(a, 11)



In [269]:

    
masked_a









    Out[269]:





masked_array(data = [1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 -- -- -- -- --],
             mask = [False False False False False False False False False False  True  True
  True  True  True],
       fill_value = 1e+20)

Exercise 3

Create a "data" array of 30 linearly spaced numbers in the interval (1, 15)
Hint: use np.linspace or np.arange functions
Create a condition - a True/False (boolean) array, corresponding to the masked values
The data array should be masked where elements are
- greater than 1
- less or equal than 12
- divisible by 2.
Mask the array depending on the condition
Hint: use np.where function



In [270]:

    
# Your code:
# arr =



In [271]:

    
# Your code:
# condition =



In [272]:

    
# masked_arr = np.ma.masked_where(condition, arr)
# print(masked_arr)

Shape manipulation



In [273]:

    
a = np.array([[1, 2, 3], [4, 5, 6]])



In [274]:

    
print('{} <-- array'.format(a))
print('{} <-- its shape'.format(a.shape))









    



[[1 2 3]
 [4 5 6]] <-- array
(2, 3) <-- its shape



In [275]:

    
a.flatten()









    Out[275]:





array([1, 2, 3, 4, 5, 6])



In [315]:

    
a.repeat(4)









    Out[315]:





array([1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6,
       6])



In [278]:

    
a.reshape((3, 2))









    Out[278]:





array([[1, 2],
       [3, 4],
       [5, 6]])



In [279]:

    
print('Old shape: {}'.format(a.shape))
print('New shape: {}'.format(a.reshape((3, 2)).shape))









    



Old shape: (2, 3)
New shape: (3, 2)

Add a dimension



In [280]:

    
a[..., np.newaxis].shape









    Out[280]:





(2, 3, 1)

Exercise 4

Generate a 2d array with 5x5. The first value is 0 and it grows left to right and top to bottom in increments on 0.1.



In [281]:

    
e4 = np.arange(0.,2.5,.1)
e4









    Out[281]:





array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9,  1. ,
        1.1,  1.2,  1.3,  1.4,  1.5,  1.6,  1.7,  1.8,  1.9,  2. ,  2.1,
        2.2,  2.3,  2.4])



In [282]:

    
e4.reshape([5,5])









    Out[282]:





array([[ 0. ,  0.1,  0.2,  0.3,  0.4],
       [ 0.5,  0.6,  0.7,  0.8,  0.9],
       [ 1. ,  1.1,  1.2,  1.3,  1.4],
       [ 1.5,  1.6,  1.7,  1.8,  1.9],
       [ 2. ,  2.1,  2.2,  2.3,  2.4]])

Broadcasting

The fact that NumPy operates on an element-wise basis means that in principle arrays must always match one another's shape. However, NumPy will also helpfully "broadcast" dimensions when possible.

Image source