Python the basics: numpy

DS Data manipulation, analysis and visualisation in Python
December, 2019

© 2016, Joris Van den Bossche and Stijn Van Hoey (mailto:jorisvandenbossche@gmail.com, mailto:stijnvanhoey@gmail.com). Licensed under CC BY 4.0 Creative Commons


This notebook is largely based on material of the Python Scientific Lecture Notes (https://scipy-lectures.github.io/), adapted with some exercises.


In [1]:
%matplotlib inline

Numpy - multidimensional data arrays

Introduction

NumPy is the fundamental package for scientific computing with Python. It contains among other things:

  • a powerful N-dimensional array/vector/matrix object
  • sophisticated (broadcasting) functions
  • function implementation in C/Fortran assuring good performance if vectorized
  • tools for integrating C/C++ and Fortran code
  • useful linear algebra, Fourier transform, and random number capabilities

Also known as array oriented computing. The recommended convention to import numpy is:


In [2]:
import numpy as np

In the numpy package the terminology used for vectors, matrices and higher-dimensional data sets is array. Let's already load some other modules too.


In [3]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')

Showcases

Roll the dice

You like to play boardgames, but you want to better know you're chances of rolling a certain combination with 2 dices:


In [4]:
def mydices(throws):
    """
    Function to create the distrrbution of the sum of two dices.
    
    Parameters
    ----------
    throws : int
        Number of throws with the dices
    """
    stone1 = np.random.uniform(1, 6, throws) 
    stone2 = np.random.uniform(1, 6, throws) 
    total = stone1 + stone2
    return plt.hist(total, bins=20) # We use matplotlib to show a histogram

In [5]:
mydices(100) # test this out with multiple options


Out[5]:
(array([ 1.,  2.,  0.,  5.,  4., 11.,  8.,  7.,  4., 12.,  9., 10.,  3.,
         7.,  5.,  4.,  4.,  2.,  0.,  2.]),
 array([ 2.66822215,  3.12954778,  3.59087341,  4.05219905,  4.51352468,
         4.97485031,  5.43617595,  5.89750158,  6.35882721,  6.82015285,
         7.28147848,  7.74280411,  8.20412975,  8.66545538,  9.12678101,
         9.58810665, 10.04943228, 10.51075791, 10.97208355, 11.43340918,
        11.89473481]),
 <a list of 20 Patch objects>)

Cartesian2Polar

Consider a random 10x2 matrix representing cartesian coordinates, how to convert them to polar coordinates


In [6]:
# random numbers (X, Y in 2 columns)
Z = np.random.random((10,2))
X, Y = Z[:,0], Z[:,1]

# distance
R = np.sqrt(X**2 + Y**2)
# angle
T = np.arctan2(Y, X) # Array of angles in radians
Tdegree = T*180/(np.pi) # If you like degrees more

# NEXT PART (now for illustration)
#plot the cartesian coordinates
plt.figure(figsize=(14, 6))
ax1 = plt.subplot(121)
ax1.plot(Z[:,0], Z[:,1], 'o')
ax1.set_title("Cartesian")
#plot the polar coorsidnates
ax2 = plt.subplot(122, polar=True)
ax2.plot(T, R, 'o')
ax2.set_title("Polar")


Out[6]:
Text(0.5, 1.05, 'Polar')

Speed

Memory-efficient container that provides fast numerical operations:


In [7]:
L = range(1000)
%timeit [i**2 for i in L]


231 µs ± 14.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]:
a = np.arange(1000)
%timeit a**2


1.42 µs ± 55.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [9]:
#More information about array?
np.array?

Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

  • a Python list or tuples
  • using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.
  • reading data from files

From lists

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.


In [10]:
# a vector: the argument to the array function is a Python list
V = np.array([1, 2, 3, 4])
V


Out[10]:
array([1, 2, 3, 4])

In [11]:
# a matrix: the argument to the array function is a nested Python list
M = np.array([[1, 2], [3, 4]])
M


Out[11]:
array([[1, 2],
       [3, 4]])

The v and M objects are both of the type ndarray that the numpy module provides.


In [12]:
type(V), type(M)


Out[12]:
(numpy.ndarray, numpy.ndarray)

The difference between the v and M arrays is only their shapes. We can get information about the shape of an array by using the ndarray.shape property.


In [13]:
V.shape


Out[13]:
(4,)

In [14]:
M.shape


Out[14]:
(2, 2)

The number of elements in the array is available through the ndarray.size property:


In [15]:
M.size


Out[15]:
4

Equivalently, we could use the function numpy.shape and numpy.size


In [16]:
np.shape(M)


Out[16]:
(2, 2)

In [17]:
np.size(M)


Out[17]:
4

Using the dtype (data type) property of an ndarray, we can see what type the data of an array has (always fixed for each array, cfr. Matlab):


In [18]:
M.dtype


Out[18]:
dtype('int64')

We get an error if we try to assign a value of the wrong type to an element in a numpy array:


In [19]:
#M[0,0] = "hello"  #uncomment this cell

In [20]:
f = np.array(['Bonjour', 'Hello', 'Hallo',])
f


Out[20]:
array(['Bonjour', 'Hello', 'Hallo'], dtype='<U7')

If we want, we can explicitly define the type of the array data when we create it, using the dtype keyword argument:


In [21]:
M = np.array([[1, 2], [3, 4]], dtype=complex)  #np.float64, np.float, np.int64

print(M, '\n', M.dtype)


[[1.+0.j 2.+0.j]
 [3.+0.j 4.+0.j]] 
 complex128

Since Numpy arrays are statically typed, the type of an array does not change once created. But we can explicitly cast an array of some type to another using the astype functions (see also the similar asarray function). This always create a new array of new type:


In [22]:
M = np.array([[1, 2], [3, 4]], dtype=float)
M2 = M.astype(int)
M2


Out[22]:
array([[1, 2],
       [3, 4]])

Common type that can be used with dtype are: int, float, complex, bool, object, etc.

We can also explicitly define the bit size of the data types, for example: int64, int16, float64, float128, complex128.

Higher order is also possible:


In [23]:
C = np.array([[[1], [2]], [[3], [4]]])
print(C.shape)
C


(2, 2, 1)
Out[23]:
array([[[1],
        [2]],

       [[3],
        [4]]])

In [24]:
C.ndim # number of dimensions


Out[24]:
3

Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit python lists. Instead we can use one of the many functions in numpy that generates arrays of different forms. Some of the more common are:

arange


In [25]:
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step
x


Out[25]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [26]:
x = np.arange(-1, 1, 0.1)
x


Out[26]:
array([-1.00000000e+00, -9.00000000e-01, -8.00000000e-01, -7.00000000e-01,
       -6.00000000e-01, -5.00000000e-01, -4.00000000e-01, -3.00000000e-01,
       -2.00000000e-01, -1.00000000e-01, -2.22044605e-16,  1.00000000e-01,
        2.00000000e-01,  3.00000000e-01,  4.00000000e-01,  5.00000000e-01,
        6.00000000e-01,  7.00000000e-01,  8.00000000e-01,  9.00000000e-01])

linspace and logspace


In [27]:
# using linspace, both end points ARE included
np.linspace(0, 10, 25)


Out[27]:
array([ 0.        ,  0.41666667,  0.83333333,  1.25      ,  1.66666667,
        2.08333333,  2.5       ,  2.91666667,  3.33333333,  3.75      ,
        4.16666667,  4.58333333,  5.        ,  5.41666667,  5.83333333,
        6.25      ,  6.66666667,  7.08333333,  7.5       ,  7.91666667,
        8.33333333,  8.75      ,  9.16666667,  9.58333333, 10.        ])

In [28]:
np.logspace(0, 10, 10, base=np.e)


Out[28]:
array([1.00000000e+00, 3.03773178e+00, 9.22781435e+00, 2.80316249e+01,
       8.51525577e+01, 2.58670631e+02, 7.85771994e+02, 2.38696456e+03,
       7.25095809e+03, 2.20264658e+04])

In [29]:
plt.plot(np.logspace(0, 10, 10, base=np.e), np.random.random(10), 'o')
plt.xscale('log')


random data


In [30]:
# uniform random numbers in [0,1]
np.random.rand(5,5)


Out[30]:
array([[0.64458006, 0.4346515 , 0.66922782, 0.78279927, 0.04740032],
       [0.86639841, 0.32640866, 0.35476706, 0.71548929, 0.51360575],
       [0.39717636, 0.27168644, 0.1870705 , 0.64932543, 0.97227284],
       [0.28626098, 0.27471716, 0.30494562, 0.69197046, 0.86236139],
       [0.47811521, 0.61643596, 0.2185436 , 0.52716391, 0.37536285]])

In [31]:
# standard normal distributed random numbers
np.random.randn(5,5)


Out[31]:
array([[-0.78564868, -0.2108072 , -0.47394221, -0.33717412, -0.97584212],
       [-0.75266178,  1.90747325, -1.31596781, -2.0986146 , -0.14433166],
       [ 0.9314924 , -0.8233683 , -0.04168856,  1.16942105, -0.45402032],
       [-0.73200415, -1.37982703,  0.05351722, -0.24567728, -1.51127797],
       [ 0.41053237,  1.02658263,  0.80739941, -1.81270136,  1.61348214]])

zeros and ones


In [32]:
np.zeros((3,3))


Out[32]:
array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [33]:
np.ones((3,3))


Out[33]:
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
EXERCISE: Create a vector with values ranging from 10 to 49 with steps of 1

In [34]:
np.arange(10, 50, 1)


Out[34]:
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45, 46, 47, 48, 49])
EXERCISE: Create a 3x3 identity matrix (look into docs!)

In [35]:
np.identity(3)


Out[35]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [36]:
np.eye(3)


Out[36]:
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])
EXERCISE: Create a 3x3x3 array with random values

In [37]:
np.random.random((3, 3, 3))


Out[37]:
array([[[0.21258258, 0.35914899, 0.37324672],
        [0.34780917, 0.28225533, 0.35082236],
        [0.48911534, 0.48029672, 0.68990601]],

       [[0.94963152, 0.28852249, 0.18588925],
        [0.70317602, 0.27571899, 0.52066804],
        [0.09559004, 0.76897268, 0.98323306]],

       [[0.95384981, 0.53651399, 0.75836151],
        [0.6859915 , 0.13839845, 0.69834657],
        [0.38832489, 0.75572136, 0.96867562]]])

File I/O

Numpy is capable of reading and writing text and binary formats. However, since most data-sources are providing information in a format with headings, different dtypes,... we will use for reading/writing of textfiles the power of Pandas.

Comma-separated values (CSV)

Writing to a csvfile with numpy is done with the savetxt-command:


In [38]:
a = np.random.random(40).reshape((20, 2))
np.savetxt("random-matrix.csv", a, delimiter=",")

To read data from such file into Numpy arrays we can use the numpy.genfromtxt function. For example,


In [39]:
a2 = np.genfromtxt("random-matrix.csv", delimiter=',')
a2


Out[39]:
array([[0.61268435, 0.62295097],
       [0.08599135, 0.83475067],
       [0.85771508, 0.02800516],
       [0.99292391, 0.81868069],
       [0.34043113, 0.92060917],
       [0.69390076, 0.18589285],
       [0.17166723, 0.4124914 ],
       [0.02749709, 0.08037973],
       [0.06226372, 0.13779169],
       [0.7258197 , 0.58831096],
       [0.96142289, 0.16428051],
       [0.15823336, 0.06869432],
       [0.55920299, 0.33070503],
       [0.85903107, 0.05515984],
       [0.93159072, 0.61821204],
       [0.84106524, 0.64970104],
       [0.37907052, 0.33894076],
       [0.8569883 , 0.25512972],
       [0.13540005, 0.64918829],
       [0.45793494, 0.80791084]])

Numpy's native file format

Useful when storing and reading back numpy array data, since binary. Use the functions numpy.save and numpy.load:


In [40]:
np.save("random-matrix.npy", a)

!file random-matrix.npy


random-matrix.npy: data

In [41]:
np.load("random-matrix.npy")


Out[41]:
array([[0.61268435, 0.62295097],
       [0.08599135, 0.83475067],
       [0.85771508, 0.02800516],
       [0.99292391, 0.81868069],
       [0.34043113, 0.92060917],
       [0.69390076, 0.18589285],
       [0.17166723, 0.4124914 ],
       [0.02749709, 0.08037973],
       [0.06226372, 0.13779169],
       [0.7258197 , 0.58831096],
       [0.96142289, 0.16428051],
       [0.15823336, 0.06869432],
       [0.55920299, 0.33070503],
       [0.85903107, 0.05515984],
       [0.93159072, 0.61821204],
       [0.84106524, 0.64970104],
       [0.37907052, 0.33894076],
       [0.8569883 , 0.25512972],
       [0.13540005, 0.64918829],
       [0.45793494, 0.80791084]])

Manipulating arrays

Indexing

MATLAB-USERS:
PYTHON STARTS AT 0!

We can index elements in an array using the square bracket and indices:


In [42]:
V


Out[42]:
array([1, 2, 3, 4])

In [43]:
# V is a vector, and has only one dimension, taking one index
V[0]


Out[43]:
1

In [44]:
V[-1:]  #-2, -2:,...


Out[44]:
array([4])

In [45]:
# a is a matrix, or a 2 dimensional array, taking two indices 
# the first dimension corresponds to rows, the second to columns.
a[1, 1]


Out[45]:
0.8347506700516859

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array)


In [46]:
a[1]


Out[46]:
array([0.08599135, 0.83475067])

The same thing can be achieved with using : instead of an index:


In [47]:
a[1, :] # row 1


Out[47]:
array([0.08599135, 0.83475067])

In [48]:
a[:, 1] # column 1


Out[48]:
array([0.62295097, 0.83475067, 0.02800516, 0.81868069, 0.92060917,
       0.18589285, 0.4124914 , 0.08037973, 0.13779169, 0.58831096,
       0.16428051, 0.06869432, 0.33070503, 0.05515984, 0.61821204,
       0.64970104, 0.33894076, 0.25512972, 0.64918829, 0.80791084])

We can assign new values to elements in an array using indexing:


In [49]:
a[0, 0] = 1
a[:, 1] = -1
a


Out[49]:
array([[ 1.        , -1.        ],
       [ 0.08599135, -1.        ],
       [ 0.85771508, -1.        ],
       [ 0.99292391, -1.        ],
       [ 0.34043113, -1.        ],
       [ 0.69390076, -1.        ],
       [ 0.17166723, -1.        ],
       [ 0.02749709, -1.        ],
       [ 0.06226372, -1.        ],
       [ 0.7258197 , -1.        ],
       [ 0.96142289, -1.        ],
       [ 0.15823336, -1.        ],
       [ 0.55920299, -1.        ],
       [ 0.85903107, -1.        ],
       [ 0.93159072, -1.        ],
       [ 0.84106524, -1.        ],
       [ 0.37907052, -1.        ],
       [ 0.8569883 , -1.        ],
       [ 0.13540005, -1.        ],
       [ 0.45793494, -1.        ]])

Index slicing

Index slicing is the technical name for the syntax M[lower:upper:step] to extract part of an array:


In [50]:
A = np.array([1, 2, 3, 4, 5])
A


Out[50]:
array([1, 2, 3, 4, 5])

In [51]:
A[1:3]


Out[51]:
array([2, 3])

Array slices are mutable: if they are assigned a new value the original array from which the slice was extracted is modified:


In [52]:
A[1:3] = [-2,-3]

A


Out[52]:
array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in M[lower:upper:step]:


In [53]:
A[::] # lower, upper, step all take the default values


Out[53]:
array([ 1, -2, -3,  4,  5])

In [54]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array


Out[54]:
array([ 1, -3,  5])

In [55]:
A[:3] # first three elements


Out[55]:
array([ 1, -2, -3])

In [56]:
A[3:] # elements from index 3


Out[56]:
array([4, 5])

In [57]:
A[-3:] # the last three elements


Out[57]:
array([-3,  4,  5])
EXERCISE: Create a null vector of size 10 and adapt it in order to make the fifth element a value 1

In [58]:
vec = np.zeros(10)
vec[4] = 1.
vec


Out[58]:
array([0., 0., 0., 0., 1., 0., 0., 0., 0., 0.])

Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index:


In [59]:
a = np.arange(0, 100, 10)
a[[2, 3, 2, 4, 2]]


Out[59]:
array([20, 30, 20, 40, 20])

In more dimensions:


In [60]:
A = np.arange(25).reshape(5,5)
A


Out[60]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [61]:
row_indices = [1, 2, 3]
A[row_indices]


Out[61]:
array([[ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [62]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]


Out[62]:
array([ 6, 12, 19])

We can also index masks: If the index mask is an Numpy array of with data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:


In [63]:
B = np.array([n for n in range(5)])  #range is pure python => Exercise: Make this shorter with pur numpy
B


Out[63]:
array([0, 1, 2, 3, 4])

In [64]:
row_mask = np.array([True, False, True, False, False])
B[row_mask]


Out[64]:
array([0, 2])

In [65]:
# same thing
row_mask = np.array([1,0,1,0,0], dtype=bool)
B[row_mask]


Out[65]:
array([0, 2])

This feature is very useful to conditionally select elements from an array, using for example comparison operators:


In [66]:
AR = np.random.randint(0, 20, 15)
AR


Out[66]:
array([14,  7,  9,  3, 13,  0, 19,  4, 16,  2,  8,  4,  0, 15,  8])

In [67]:
AR%3 == 0


Out[67]:
array([False, False,  True,  True, False,  True, False, False, False,
       False, False, False,  True,  True, False])

In [68]:
extract_from_AR = AR[AR%3 == 0]
extract_from_AR


Out[68]:
array([ 9,  3,  0,  0, 15])

In [69]:
x = np.arange(0, 10, 0.5)
x


Out[69]:
array([0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. , 5.5, 6. ,
       6.5, 7. , 7.5, 8. , 8.5, 9. , 9.5])

In [70]:
mask = (5 < x) * (x < 7.5)  # We actually multiply two masks here (boolean 0 and 1 values)
mask


Out[70]:
array([False, False, False, False, False, False, False, False, False,
       False, False,  True,  True,  True,  True, False, False, False,
       False, False])

In [71]:
x[mask]


Out[71]:
array([5.5, 6. , 6.5, 7. ])
EXERCISE: Swap the first two rows of the 2-D array `A`?

In [72]:
A = np.arange(25).reshape(5,5)
A


Out[72]:
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [73]:
#SWAP
A[[0, 1]] = A[[1, 0]]
A


Out[73]:
array([[ 5,  6,  7,  8,  9],
       [ 0,  1,  2,  3,  4],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])
EXERCISE: Change all even numbers of `AR` into zero-values.

In [74]:
AR = np.random.randint(0, 20, 15)
AR


Out[74]:
array([ 3,  1, 16,  7,  3,  9,  5,  2, 19, 19,  2,  6, 15, 12,  9])

In [75]:
AR[AR%2==0] = 0.
AR


Out[75]:
array([ 3,  1,  0,  7,  3,  9,  5,  0, 19, 19,  0,  0, 15,  0,  9])
EXERCISE: Change all even positions of matrix `AR` into zero-values

In [76]:
AR = np.random.randint(1, 20, 15)
AR


Out[76]:
array([ 8, 17,  3, 13, 13,  6, 12,  4, 12,  2,  3, 11,  4,  1,  5])

In [77]:
AR[1::2] = 0
AR


Out[77]:
array([ 8,  0,  3,  0, 13,  0, 12,  0, 12,  0,  3,  0,  4,  0,  5])

Some more extraction functions

where function to know the indices of something


In [78]:
x = np.arange(0, 10, 0.5)
np.where(x>5.)


Out[78]:
(array([11, 12, 13, 14, 15, 16, 17, 18, 19]),)

With the diag function we can also extract the diagonal and subdiagonals of an array:


In [79]:
np.diag(A)


Out[79]:
array([ 5,  1, 12, 18, 24])

The take function is similar to fancy indexing described above:


In [80]:
x.take([1, 5])


Out[80]:
array([0.5, 2.5])

Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/Numpy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations.

Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.


In [81]:
v1 = np.arange(0, 5)

In [82]:
v1 * 2


Out[82]:
array([0, 2, 4, 6, 8])

In [83]:
v1 + 2


Out[83]:
array([2, 3, 4, 5, 6])

In [84]:
A = np.arange(25).reshape(5,5)
A * 2


Out[84]:
array([[ 0,  2,  4,  6,  8],
       [10, 12, 14, 16, 18],
       [20, 22, 24, 26, 28],
       [30, 32, 34, 36, 38],
       [40, 42, 44, 46, 48]])

In [85]:
np.sin(A) #np.log(A), np.arctan,...


Out[85]:
array([[ 0.        ,  0.84147098,  0.90929743,  0.14112001, -0.7568025 ],
       [-0.95892427, -0.2794155 ,  0.6569866 ,  0.98935825,  0.41211849],
       [-0.54402111, -0.99999021, -0.53657292,  0.42016704,  0.99060736],
       [ 0.65028784, -0.28790332, -0.96139749, -0.75098725,  0.14987721],
       [ 0.91294525,  0.83665564, -0.00885131, -0.8462204 , -0.90557836]])

Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is element-wise operations:


In [86]:
A * A # element-wise multiplication


Out[86]:
array([[  0,   1,   4,   9,  16],
       [ 25,  36,  49,  64,  81],
       [100, 121, 144, 169, 196],
       [225, 256, 289, 324, 361],
       [400, 441, 484, 529, 576]])

In [87]:
v1 * v1


Out[87]:
array([ 0,  1,  4,  9, 16])

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:


In [88]:
A.shape, v1.shape


Out[88]:
((5, 5), (5,))

In [89]:
A * v1


Out[89]:
array([[ 0,  1,  4,  9, 16],
       [ 0,  6, 14, 24, 36],
       [ 0, 11, 24, 39, 56],
       [ 0, 16, 34, 54, 76],
       [ 0, 21, 44, 69, 96]])

Consider the speed difference with pure python:


In [90]:
a = np.arange(10000)
%timeit a + 1  

l = range(10000)
%timeit [i+1 for i in l]


4.55 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
571 µs ± 15.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [91]:
#logical operators:
a1 = np.arange(0, 5, 1)
a2 = np.arange(5, 0, -1)
a1>a2  # >, <=,...


Out[91]:
array([False, False, False,  True,  True])

In [92]:
# cfr. 
np.all(a1>a2) # any


Out[92]:
False

Basic operations on numpy arrays (addition, etc.) are elementwise. Nevertheless, It’s also possible to do operations on arrays of different sizes if Numpy can transform these arrays so that they all have the same size: this conversion is called broadcasting.


In [93]:
A, v1


Out[93]:
(array([[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9],
        [10, 11, 12, 13, 14],
        [15, 16, 17, 18, 19],
        [20, 21, 22, 23, 24]]), array([0, 1, 2, 3, 4]))

In [94]:
A*v1


Out[94]:
array([[ 0,  1,  4,  9, 16],
       [ 0,  6, 14, 24, 36],
       [ 0, 11, 24, 39, 56],
       [ 0, 16, 34, 54, 76],
       [ 0, 21, 44, 69, 96]])

In [95]:
x, y = np.arange(5), np.arange(5).reshape((5, 1)) # a row and a column array

In [96]:
distance = np.sqrt(x ** 2 + y ** 2)
distance


Out[96]:
array([[0.        , 1.        , 2.        , 3.        , 4.        ],
       [1.        , 1.41421356, 2.23606798, 3.16227766, 4.12310563],
       [2.        , 2.23606798, 2.82842712, 3.60555128, 4.47213595],
       [3.        , 3.16227766, 3.60555128, 4.24264069, 5.        ],
       [4.        , 4.12310563, 4.47213595, 5.        , 5.65685425]])

In [97]:
#let's put this in a figure:
plt.pcolor(distance)    
plt.colorbar()


Out[97]:
<matplotlib.colorbar.Colorbar at 0x7f639c719f10>

Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the dot function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments:


In [98]:
np.dot(A, A)


Out[98]:
array([[ 150,  160,  170,  180,  190],
       [ 400,  435,  470,  505,  540],
       [ 650,  710,  770,  830,  890],
       [ 900,  985, 1070, 1155, 1240],
       [1150, 1260, 1370, 1480, 1590]])

In [99]:
np.dot(A, v1) #check the difference with A*v1 !!


Out[99]:
array([ 30,  80, 130, 180, 230])

In [100]:
np.dot(v1, v1)


Out[100]:
30

Alternatively, we can cast the array objects to the type matrix. This changes the behavior of the standard arithmetic operators +, -, * to use matrix algebra. You can also get inverse of matrices, determinant,...

We won't go deeper here on pure matrix calculation, but for more information, check the related functions: inner, outer, cross, kron, tensordot. Try for example help(kron).

Calculations

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.


In [101]:
a = np.random.random(40)

Different frequently used operations can be done:


In [102]:
print ('Mean value is', np.mean(a))
print ('Median value is',  np.median(a))
print ('Std is', np.std(a))
print ('Variance is', np.var(a))
print ('Min is', a.min())
print ('Element of minimum value is', a.argmin())
print ('Max is', a.max())
print ('Sum is', np.sum(a))
print ('Prod', np.prod(a))
print ('Cumsum is', np.cumsum(a)[-1])
print ('CumProd of 5 first elements is', np.cumprod(a)[4])
print ('Unique values in this array are:', np.unique(np.random.randint(1,6,10)))
print ('85% Percentile value is: ', np.percentile(a, 85))


Mean value is 0.5251721246906176
Median value is 0.6308854426362718
Std is 0.2933333698135453
Variance is 0.08604446584617012
Min is 0.02227123375361928
Element of minimum value is 32
Max is 0.9849681868420118
Sum is 21.0068849876247
Prod 9.015149841925478e-17
Cumsum is 21.006884987624705
CumProd of 5 first elements is 0.08504604649831109
Unique values in this array are: [1 2 3 4 5]
85% Percentile value is:  0.8178104285481629

In [103]:
a = np.random.random(40)
print(a.argsort())
a.sort() #sorts in place!
print(a.argsort())


[23 36 37  0 35 26 17 28 24 32  6 13 10 29  7 19  9 11 15  1  5 31  8 39
 20 16  4 18  3 38 33 22 25 12  2 34 27 30 21 14]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]

Calculations with higher-dimensional data

When functions such as min, max, etc., is applied to a multidimensional arrays, it is sometimes useful to apply the calculation to the entire array, and sometimes only on a row or column basis. Using the axis argument we can specify how these functions should behave:


In [104]:
m = np.random.rand(3,3)
m


Out[104]:
array([[0.14470059, 0.28015706, 0.47451703],
       [0.6194547 , 0.70604152, 0.19400837],
       [0.8033055 , 0.5510823 , 0.37995246]])

In [105]:
# global max
m.max()


Out[105]:
0.8033054990697354

In [106]:
# max in each column
m.max(axis=0)


Out[106]:
array([0.8033055 , 0.70604152, 0.47451703])

In [107]:
# max in each row
m.max(axis=1)


Out[107]:
array([0.47451703, 0.70604152, 0.8033055 ])

Many other functions and methods in the array and matrix classes accept the same (optional) axis keyword argument.

EXERCISE: Rescale the 5x5 matrix `Z` to values between 0 and 1:

In [108]:
Z = np.random.uniform(5.0, 15.0, (5,5))
Z


Out[108]:
array([[14.45702995, 13.10361903, 10.99536834, 12.24469675, 13.66207423],
       [10.56599341, 10.78165881,  5.47217107, 14.41381216,  7.12207055],
       [13.87742206, 10.1709238 , 13.02488306,  6.51297311, 12.1635431 ],
       [10.44349031, 11.6701722 , 12.14480228, 14.28764336,  9.83893798],
       [ 5.81683176, 12.25562637,  8.80536541, 12.97521668,  9.22878351]])

In [109]:
# RESCALE:
(Z - Z.min())/(Z.max() - Z.min())


Out[109]:
array([[1.        , 0.84936759, 0.61472276, 0.75377096, 0.91152274],
       [0.56693404, 0.59093724, 0.        , 0.99518993, 0.1836311 ],
       [0.9354906 , 0.52296344, 0.84060441, 0.11583955, 0.74473869],
       [0.55329965, 0.68982732, 0.74265287, 0.98114755, 0.48601397],
       [0.03836017, 0.75498741, 0.37097904, 0.83507662, 0.41810478]])

Reshaping, resizing and stacking arrays

The shape of an Numpy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.


In [110]:
A = np.arange(25).reshape(5,5)
n, m = A.shape
B = A.reshape((1,n*m))
B


Out[110]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
        16, 17, 18, 19, 20, 21, 22, 23, 24]])

We can also use the function flatten to make a higher-dimensional array into a vector. But this function create a copy of the data (see next)


In [111]:
B = A.flatten()
B


Out[111]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

Stacking and repeating arrays

Using function repeat, tile, vstack, hstack, and concatenate we can create larger vectors and matrices from smaller ones:

tile and repeat


In [112]:
a = np.array([[1, 2], [3, 4]])

In [113]:
# repeat each element 3 times
np.repeat(a, 3)


Out[113]:
array([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

In [114]:
# tile the matrix 3 times 
np.tile(a, 3)


Out[114]:
array([[1, 2, 1, 2, 1, 2],
       [3, 4, 3, 4, 3, 4]])

concatenate


In [115]:
b = np.array([[5, 6]])

In [116]:
np.concatenate((a, b), axis=0)


Out[116]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [117]:
np.concatenate((a, b.T), axis=1)


Out[117]:
array([[1, 2, 5],
       [3, 4, 6]])

hstack and vstack


In [118]:
np.vstack((a,b))


Out[118]:
array([[1, 2],
       [3, 4],
       [5, 6]])

In [119]:
np.hstack((a,b.T))


Out[119]:
array([[1, 2, 5],
       [3, 4, 6]])

IMPORTANT!: View and Copy

To achieve high performance, assignments in Python usually do not copy the underlaying objects. This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (techincal term: pass by reference).


In [120]:
A = np.array([[1, 2], [3, 4]])

A


Out[120]:
array([[1, 2],
       [3, 4]])

In [121]:
# now B is referring to the same array data as A 
B = A

In [122]:
# changing B affects A
B[0,0] = 10

B


Out[122]:
array([[10,  2],
       [ 3,  4]])

In [123]:
A


Out[123]:
array([[10,  2],
       [ 3,  4]])

If we want to avoid this behavior, so that when we get a new completely independent object B copied from A, then we need to do a so-called "deep copy" using the function copy:


In [124]:
B = np.copy(A)

In [125]:
# now, if we modify B, A is not affected
B[0,0] = -5

B


Out[125]:
array([[-5,  2],
       [ 3,  4]])

In [126]:
A


Out[126]:
array([[10,  2],
       [ 3,  4]])

Also reshape function just takes a view:


In [127]:
arr = np.arange(8)
arr_view = arr.reshape(2, 4)

In [128]:
print('Before\n', arr_view)
arr[0] = 1000
print('After\n', arr_view)


Before
 [[0 1 2 3]
 [4 5 6 7]]
After
 [[1000    1    2    3]
 [   4    5    6    7]]

In [129]:
arr.flatten()[2] = 10  #Flatten creates a copy!

In [130]:
arr


Out[130]:
array([1000,    1,    2,    3,    4,    5,    6,    7])

Using arrays in conditions

When using arrays in conditions in for example if statements and other boolean expressions, one need to use one of any or all, which requires that any or all elements in the array evalutes to True:


In [131]:
M


Out[131]:
array([[1., 2.],
       [3., 4.]])

In [132]:
if (M > 5).any():
    print("at least one element in M is larger than 5")
else:
    print("no element in M is larger than 5")


no element in M is larger than 5

In [133]:
if (M > 5).all():
    print("all elements in M are larger than 5")
else:
    print("all elements in M are not larger than 5")


all elements in M are not larger than 5

Some extra applications:

Polynomial fit


In [134]:
b_data = np.genfromtxt("./data/bogota_part_dataset.csv", skip_header=3, delimiter=',')
plt.scatter(b_data[:,2], b_data[:,3])


Out[134]:
<matplotlib.collections.PathCollection at 0x7f639be40c10>

In [135]:
x, y = b_data[:,1], b_data[:,3] 
t = np.polyfit(x, y, 2) # fit a 2nd degree polynomial to the data, result is x**2 + 2x + 3
t


Out[135]:
array([-6.87646029e-04,  6.15707668e-01,  5.97719338e+01])

In [136]:
x.sort()
plt.plot(x, y, 'o')
plt.plot(x, t[0]*x**2 + t[1]*x + t[2], '-')


Out[136]:
[<matplotlib.lines.Line2D at 0x7f639be059d0>]

EXERCISE: Make a fourth order fit between the fourth and fifth column of `b_data`

In [137]:
x, y = b_data[:,3], b_data[:,4] 
t = np.polyfit(x, y, 4) # fit a 2nd degree polynomial to the data, result is x**2 + 2x + 3
t
x.sort()
plt.plot(x, y, 'o')
plt.plot(x, t[0]*x**4 + t[1]*x**3 + t[2]*x**2 + t[3]*x +t[4], '-')


Out[137]:
[<matplotlib.lines.Line2D at 0x7f639be92790>]

However, when doing some kind of regression, we would like to have more information about the fit characterstics automatically. Statsmodels is a library that provides this functionality, we will later come back to this type of regression problem.

Moving average function


In [138]:
def moving_average(a, n=3) :
    ret = np.cumsum(a, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

In [139]:
print(moving_average(b_data , n=3))


[ 79.66666667 106.66666667  87.          40.66666667  54.33333333
 108.33333333 138.         108.66666667  43.          56.66666667
  84.66666667 115.         111.          72.66666667  86.
 134.66666667 170.         160.33333333  73.33333333  82.33333333
 161.33333333 196.66666667 176.33333333  64.33333333  75.33333333
 110.66666667 146.         120.33333333  62.          75.33333333
 114.66666667 149.66666667 124.66666667  65.66666667  80.33333333
  96.33333333 131.66666667  90.66666667  52.          67.33333333
 144.66666667 182.33333333 139.          54.          68.
  88.         126.33333333  80.33333333  54.          69.66666667
  93.         132.          90.          62.          78.
  84.66666667 124.          76.33333333  58.66666667  78.
 101.66666667 141.         106.          76.          96.33333333
 102.         141.33333333  91.          62.66666667  82.66666667
 110.         149.33333333 104.          68.33333333  88.33333333
 115.66666667 159.33333333 121.          80.66666667  97.
 106.33333333 150.33333333  97.          67.66666667  84.
 116.         160.33333333 108.66666667  71.          92.33333333
 111.         160.66666667 101.33333333  75.          92.66666667
 140.         193.66666667 142.33333333  89.66666667 111.66666667
 134.         187.33333333 136.33333333  99.         123.66666667
 143.33333333 203.         137.33333333  94.         113.33333333
 146.33333333 206.         139.66666667  95.33333333 119.33333333
 165.66666667 226.         165.66666667 107.66666667 134.
 131.         196.         123.         103.66666667 152.66666667
 171.         239.66666667 133.         102.         152.
 165.66666667 247.33333333 133.66666667 113.66666667 187.66666667
 221.66666667 303.66666667 154.        ]

However, the latter fuction implementation is something we would expect from a good data-analysis library to be implemented already.

The perfect timing for Python Pandas!

REMEMBER!

  • Know how to create arrays : array, arange, ones, zeros,....

  • Know the shape of the array with array.shape, then use slicing to obtain different views of the array: array[::2], etc. Adjust the shape of the array using reshape or flatten it.

  • Obtain a subset of the elements of an array and/or modify their values with masks

  • Know miscellaneous operations on arrays, such as finding the mean or max (array.max(), array.mean()). No need to retain everything, but have the reflex to search in the documentation (online docs, help(), lookfor())!!

  • For advanced use: master the indexing with arrays of integers, as well as broadcasting. Know more Numpy functions to handle various array operations.

Further reading

Acknowledgments and Material