NumPy Introduction

NumPy (Numerical Python) is a library for scientific computing. The library contains many mathematical functions for performing operations of linear algebra, Fourier transforms, pseudo-random numbers generators, etc. NumPy can use BLAS and LAPACK (if you have them installed on your system). BLAS and LAPACK are libraries written in Fortran that provides linear algebra routines. NumPy was born with the idea of bringing the functionality of environments like Matlab and Mathematica to Python.

Overall, the code written in NumPy can be shorter and (arguably) clearer than pure Python. The use of loops is reduced because many operations are applied directly on arrays. In addition, there are many mathematical/scientific functions already defined, so it is posible to write less code and more important, those functions are implemted using very efficient and reliable algorithms.


In [1]:
import numpy as np# the standard way to import numpy

Arrays

NumPy uses a data structure called array. NumPy arrays are similar to Python lists but are more efficient for numerical tasks. This efficiency come from the following facts:

  • Python lists are very general and can contain any type of object. Besides, the types of objects are dynamically allocated. This makes lists inefficient for mathematical operations.
  • NumPy arrays are static and homogeneous. The type of elements is determined when the array is created (or can the user can provide it).
  • NumPy arrays make efficient use of memory.

Furthermore arrays are similar to mathematical vectors and matrices.

Creating arrays

There are several routines to create NumPy arrays:

  • From Python list and tuples
  • Filled with a numerical range
  • Filled with with random numbers
  • Filled with zeros and ones
  • From files

From lists and tuples

To create arrays from Python list (or tuples) we ca use the array function.


In [2]:
v = np.array([1, 2, 3, 4 , 5, 6])
v


Out[2]:
array([1, 2, 3, 4, 5, 6])

In [3]:
M = np.array([[1, 2], [3, 4], [5, 6]])
M


Out[3]:
array([[1, 2],
       [3, 4],
       [5, 6]])

Objects v and M are both of the same type; ndarray as you can easily check.


In [4]:
type(v), type(M)


Out[4]:
(numpy.ndarray, numpy.ndarray)

The difference between array v and M us that they have different shapes. We can get the shape of an array using the method shape.


In [6]:
v.shape  # equivalent to np.shape(v)


Out[6]:
(6,)

In [ ]:
np.shape

In [6]:
M.shape  # equivalent to np.shape(M)


Out[6]:
(3, 2)

size tell us the number of elements of an array.


In [7]:
v.size, M.size  # or np.size(M)


Out[7]:
(6, 6)

In the same way the number of dimensions of and array are obtained using ndim


In [7]:
v.ndim, M.ndim


Out[7]:
(1, 2)

Using the method dtype, we can ask for the type of the elements stored in an array.


In [8]:
M.dtype


Out[8]:
dtype('int64')

Using Python list we can mix different types of data in a same list. Instead, with arrays we get an error.


In [9]:
v[0] = "hello world"


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-2ffc3c5d7654> in <module>()
----> 1 v[0] = "hello world"

ValueError: invalid literal for long() with base 10: 'hello world'

Or we could get something we didn't expect


In [12]:
v[0] = 3.14
v


Out[12]:
array([3, 2, 3, 4, 5, 6])

It is possible to explicitelly define the type of the data when creating an array.


In [13]:
np.array([[1, 2], [3, 4]], dtype=complex)


Out[13]:
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]])

Some common types are: int, float, complex, bool

It is also possible to define the number of bits for each type: int64, int16, float64, complex64.

Filled with a numerical range

It is possible to create arrays from scratch. For example we can directly create an array containing evenly spaced values within the half-open interval [from, to), using arange


In [16]:
np.arange(0, 10, 1) # from, to, step (from and step are optional)


Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Unlike the python range function arange support floats


In [15]:
np.arange(-1, 1, 0.25)


Out[15]:
array([-1.  , -0.75, -0.5 , -0.25,  0.  ,  0.25,  0.5 ,  0.75])

linspace returns evenly spaced numbers, calculated over the interval [from, to]. Works similar to arange but instead of specifying the step we specify the total number of elements we want.


In [16]:
np.linspace(1, 10, 25) # from, to, elements (elements is optional)


Out[16]:
array([  1.   ,   1.375,   1.75 ,   2.125,   2.5  ,   2.875,   3.25 ,
         3.625,   4.   ,   4.375,   4.75 ,   5.125,   5.5  ,   5.875,
         6.25 ,   6.625,   7.   ,   7.375,   7.75 ,   8.125,   8.5  ,
         8.875,   9.25 ,   9.625,  10.   ])

logspace return numbers spaced evenly on a logaritmic scale.


In [17]:
np.logspace(0, 1.5, 5, base=np.e)


Out[17]:
array([ 1.        ,  1.45499141,  2.11700002,  3.08021685,  4.48168907])

Filled with random numbers

Random numbers are widely used in scientific applications. In practice computers generate pseudo random numbers, i.e. numbers that, for most intents and purposes, looks like random numbers.

All the routines to generate random number live inside the random module. Python uses the Mersenne Twister pseudo-random generator.

The most basic function is rand. This function creates an array from a uniform distribution over the range [0, 1).


In [19]:
np.random.rand(2, 5)  # array with shape (2, 5)


Out[19]:
array([[ 0.04403624,  0.89896711,  0.77322558,  0.86002876,  0.03670666],
       [ 0.06825515,  0.31205032,  0.19644764,  0.54890012,  0.39263595]])

randn returns samples from the standard normal distribution.


In [23]:
np.random.randn(3)


Out[23]:
array([ 0.92148512, -0.55496176, -1.7497044 ])

While in a lot of applications we need random numbers, randomness could be problematic, for example during debugging. We can solve this by seeding the pseudo-random generator. In this way each time we ask for a random number we will get the same result.


In [20]:
np.random.seed(31415)
np.random.randn(3)


Out[20]:
array([ 1.36242188,  1.13410818,  2.36307449])

Filled with zeros and ones

Sometimes it is handy to create arrays filled with ones or zeros (or other numbers).


In [24]:
np.zeros((2,5))


Out[24]:
array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [25]:
np.ones((2,5))


Out[25]:
array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [26]:
np.empty((2,5))


Out[26]:
array([[  6.93627369e-310,   6.93627369e-310,   6.93627369e-310,
          6.93627369e-310,   1.49502604e-316],
       [  6.93626759e-310,   6.93626759e-310,   6.93626759e-310,
          6.93626759e-310,   1.63041663e-322]])

In [29]:
np.full((2,5), 42)


Out[29]:
array([[ 42.,  42.,  42.,  42.,  42.],
       [ 42.,  42.,  42.,  42.,  42.]])

From files

A common format for data files is the comma separated values, or related formats, such as TSV (tab separated values). To read data from such files into an array using NumPy we can use the genfromtxt function. For example,


In [30]:
!head sample.dat  # Shows the first lines of a file


   X       Y       Z     occ  b-f
45.885  29.085  -0.349  1.00 27.21
45.402  28.249  -1.361  1.00 25.24
44.792  26.870  -1.153  1.00 23.54
45.647  25.988  -1.415  1.00 20.94
46.415  28.104  -2.539  1.00 26.89
47.432  29.136  -2.674  1.00 27.10
43.440  26.778  -0.893  1.00 20.65
43.049  25.297  -0.723  1.00 18.63
42.868  24.450  -1.968  1.00 16.36

In [31]:
data = np.genfromtxt('sample.dat', skip_header=True)
data


Out[31]:
array([[ 45.885,  29.085,  -0.349,   1.   ,  27.21 ],
       [ 45.402,  28.249,  -1.361,   1.   ,  25.24 ],
       [ 44.792,  26.87 ,  -1.153,   1.   ,  23.54 ],
       [ 45.647,  25.988,  -1.415,   1.   ,  20.94 ],
       [ 46.415,  28.104,  -2.539,   1.   ,  26.89 ],
       [ 47.432,  29.136,  -2.674,   1.   ,  27.1  ],
       [ 43.44 ,  26.778,  -0.893,   1.   ,  20.65 ],
       [ 43.049,  25.297,  -0.723,   1.   ,  18.63 ],
       [ 42.868,  24.45 ,  -1.968,   1.   ,  16.36 ],
       [ 41.96 ,  24.535,  -2.773,   1.   ,  15.9  ],
       [ 41.931,  25.472,   0.263,   1.   ,  17.29 ],
       [ 41.735,  24.893,   1.591,   1.   ,  17.13 ],
       [ 40.229,  24.874,   1.889,   1.   ,  15.26 ]])

In [32]:
data.shape


Out[32]:
(13, 5)

Using numpy.savetxt we can save a NumPy array to a csv file.


In [33]:
M = np.random.rand(3,3)
M


Out[33]:
array([[ 0.21227458,  0.82558364,  0.9472673 ],
       [ 0.22541016,  0.90859751,  0.19236669],
       [ 0.20390934,  0.32720035,  0.66789639]])

In [36]:
np.savetxt("random_matrix.csv", M, fmt='%.2f')

In [37]:
!head random_matrix.csv


0.21 0.83 0.95
0.23 0.91 0.19
0.20 0.33 0.67

Alternatively, is also possible to save a NumPy array using the npy format. It is useful for storing arrays of data that will then be read with NumPy.


In [38]:
np.save("random_matrix.npy", M)
!ls *.npy


random_matrix.npy

To load the saved file we use the load function, as you may have guessed.


In [39]:
np.load("random_matrix.npy")


Out[39]:
array([[ 0.21227458,  0.82558364,  0.9472673 ],
       [ 0.22541016,  0.90859751,  0.19236669],
       [ 0.20390934,  0.32720035,  0.66789639]])

Indexing

Indexing of NumPy arrays works in a similar fashion to indexing Python lists.


In [40]:
v = np.array([1, 2, 3, 4, 5, 6])
M = np.array([[1, 2], [3, 4], [5, 6]])
print v
print M


[1 2 3 4 5 6]
[[1 2]
 [3 4]
 [5 6]]

v is a 1D array (like a vector) and hence each element is indexed using just one number.


In [41]:
v[0]


Out[41]:
1

M is a 2D array (like a matrix) and hence each element is indexed using two numbers.


In [43]:
M[1,1]  # this is equivalent to M[1][1]


Out[43]:
4

If we omit an index when indexing a multidimensional array, NumPy delivers the entire row (or, in general, the array of dimension N-1)


In [44]:
M[1]


Out[44]:
array([3, 4])

The same results is obtained using :


In [45]:
M[1,:]  # row 1


Out[45]:
array([3, 4])

In [46]:
M[:,1]  # column 1


Out[46]:
array([2, 4, 6])

We can assign element to an array specifing the desired index


In [47]:
M[0,0] = -1
M


Out[47]:
array([[-1,  2],
       [ 3,  4],
       [ 5,  6]])

The same idea works for entire rows and columns


In [48]:
M[1,:] = 0
M


Out[48]:
array([[-1,  2],
       [ 0,  0],
       [ 5,  6]])

In [49]:
M[:,1] = -2
M


Out[49]:
array([[-1, -2],
       [ 0, -2],
       [ 5, -2]])

Slicing

Slicing is also possible when working with arrays.


In [50]:
print v
v[1:3]


[1 2 3 4 5 6]
Out[50]:
array([2, 3])

Slices can also be used in assigning new values.


In [51]:
v[1:3] = -2,-3
v


Out[51]:
array([ 1, -2, -3,  4,  5,  6])

Is possible to slice an array using steps


In [52]:
v[::2]  # from and to takes defaults values (the entire array)


Out[52]:
array([ 1, -3,  5])

or take the first N elements


In [53]:
v[:3]


Out[53]:
array([ 1, -2, -3])

or start from the $i^{th}$ element


In [54]:
v[3:]


Out[54]:
array([4, 5, 6])

or just take the last element


In [55]:
v[-1]


Out[55]:
6

or the last N elements


In [56]:
v[-3:]  # the las three elements


Out[56]:
array([4, 5, 6])

As we have already seen arrays are not restricted to 1 or 2 dimensions. Arrays can have many dimensions. As size increases the complexity of the slices also increases.


In [57]:
c = np.array([[[ 0,  1,  2,  3], [ 4,  5,  6,  7], [ 8,  9, 10, 11]],
       [[12, 13, 14, 15], [16, 17, 18, 19], [20, 21, 22, 23]]])
c


Out[57]:
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

In [58]:
np.shape(c)


Out[58]:
(2, 3, 4)

In [59]:
c[0]


Out[59]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [60]:
c[0, 2]


Out[60]:
array([ 8,  9, 10, 11])

Let extract the zeroth elements from the second dimension (axis=1)


In [53]:
c[:,0]  # equivalent to c[:,0,:]


Out[53]:
array([[ 0,  1,  2,  3],
       [12, 13, 14, 15]])

Let extract the first elements from the third dimension (axis=2)


In [54]:
c[:,:,1]


Out[54]:
array([[ 1,  5,  9],
       [13, 17, 21]])

In [55]:
c[0,:,1]


Out[55]:
array([1, 5, 9])

Any subset of element can be obtained from a NumPy array using, for example.

  • Take the zeroth element of the first dimension
  • Then, the first element of the second dimension
  • Finally all the elements in the third dimension by taking steps of 2 elements

In [56]:
c[0,1,::2]


Out[56]:
array([4, 6])

Reverse the order


In [57]:
c[::-1]


Out[57]:
array([[[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]],

       [[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]]])

Another example of a complex slicing.

  • Take the zeroth element from the first dimension
  • Then the element from the second dimension in reverse order
  • Finally the last element from the third dimension

In [58]:
c[0,::-1, -1]


Out[58]:
array([11,  7,  3])

Array manipulation

Sometimes it is convenient to change the shape of the arrays or combine two or more arrays together, for that there are several functions.


In [61]:
c = np.arange(24).reshape(2,3,4)
c


Out[61]:
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])

May be we need to delete part of our array


In [60]:
np.delete(c, (0,2), axis=2)


Out[60]:
array([[[ 1,  3],
        [ 5,  7],
        [ 9, 11]],

       [[13, 15],
        [17, 19],
        [21, 23]]])

If we need to reduce our n-dimensional array to a 1D array we could use flatten or ravel.


In [61]:
c.flatten()  # flatten is method not a function


Out[61]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [62]:
c.ravel() # equivalent to np.ravel(c)


Out[62]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

Stacking

Two arrays can be combined by stacking one onto the other.


In [62]:
d = np.arange(16).reshape(4, 4)
e = d * 2
print d
print e


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
[[ 0  2  4  6]
 [ 8 10 12 14]
 [16 18 20 22]
 [24 26 28 30]]

In [63]:
np.concatenate((d, e), axis=0)  # equivalent to np.vstack((d, e))


Out[63]:
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [ 0,  2,  4,  6],
       [ 8, 10, 12, 14],
       [16, 18, 20, 22],
       [24, 26, 28, 30]])

In [65]:
np.concatenate((d, e), axis=1)  # equivalent to np.hstack((d, e))


Out[65]:
array([[ 0,  1,  2,  3,  0,  2,  4,  6],
       [ 4,  5,  6,  7,  8, 10, 12, 14],
       [ 8,  9, 10, 11, 16, 18, 20, 22],
       [12, 13, 14, 15, 24, 26, 28, 30]])

Splitting

The inverse operation of stacking is splitting


In [64]:
np.split(d, 2, axis=0)


Out[64]:
[array([[0, 1, 2, 3],
        [4, 5, 6, 7]]), array([[ 8,  9, 10, 11],
        [12, 13, 14, 15]])]

In [65]:
np.split(d, 2, axis=1)


Out[65]:
[array([[ 0,  1],
        [ 4,  5],
        [ 8,  9],
        [12, 13]]), array([[ 2,  3],
        [ 6,  7],
        [10, 11],
        [14, 15]])]

Mathematical functions

NumPy provides several mathematical functions. This could seems redundand since the Python's Standard Library already provides this type of functions. But the interesting thing is that NumPy's matemathical functions (as well as other functions) work on the whole array in an element-wise fashion.

For example if we would like to compute the square root of all element in a Python list we should build a for-loop and compute the square root for each element, one at a time. With NumPy we can just doing in one line.

First lets define a new array


In [66]:
f = np.arange(4).reshape(2, 2)
f


Out[66]:
array([[0, 1],
       [2, 3]])

and now compute the square root


In [67]:
np.sqrt(f)


Out[67]:
array([[ 0.        ,  1.        ],
       [ 1.41421356,  1.73205081]])

Functions like sqrt that operates on arrays in an element-by-element fashion are call universal functions (or ufunc for short).

One of the advantages of using ufuncs is that allows for shorter code. Another advantage is that computations are faster than when using a Python loop. Under the hood NumPy is still performing a loop, but the loop is done in a compiled language (C or Fortran) and even further the compiled code has been generally optimized to take advantage of vectorization routines.

Another example of a ufunc is sum


In [68]:
np.sum(f)


Out[68]:
6

Sometimes we don't want to sum over the flattened array but over one of it's axis.


In [69]:
np.sum(f, axis=0)


Out[69]:
array([2, 4])

In [70]:
np.sum(f, axis=1)


Out[70]:
array([1, 5])

The speed-gain of using ufuncs depends on several factors, typical gains could be around 1 to 3 order of magnitudes (unless you apply ufuncs to scalar instead of arrays!).


In [71]:
from math import sqrt
g_arr = np.arange(0, 1000)
g_list = range(0, 1000)

In [72]:
%%timeit
for i in g_list:
    sqrt(i)


10000 loops, best of 3: 79.8 µs per loop

In [73]:
%%timeit
np.sqrt(g_arr)


100000 loops, best of 3: 5.78 µs per loop

In [74]:
%%timeit
sqrt(1)


10000000 loops, best of 3: 79 ns per loop

In [75]:
%%timeit
np.sqrt(1)


100000 loops, best of 3: 2.68 µs per loop

Fancy indexing

In addition to indexing by integers and slices, as we saw before, arrays can be indexed by arrays of integers and arrays of booleans.


In [76]:
h = np.arange(10)**2 
h


Out[76]:
array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [77]:
idx = np.array([0,2,3,3])
h[idx]


Out[77]:
array([0, 4, 9, 9])

In [78]:
I = np.arange(9).reshape(3, 3)
I


Out[78]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [79]:
idx_row = np.array([0,2])
idx_col = np.array([1])

I[idx_row, idx_col]


Out[79]:
array([1, 7])

We can also use and array of boolean for indexing, for example if we want to get all the values that are even, we can do:


In [80]:
I[I % 2 == 0]


Out[80]:
array([0, 2, 4, 6, 8])

And as you can check, the expression inside de brackets is a array of booleans


In [83]:
I % 2 == 0


Out[83]:
array([[ True, False,  True],
       [False,  True, False],
       [ True, False,  True]], dtype=bool)

Constants

Some constants like $\pi$ and $e$ can be accessed from NumPy. If you need other constants you could probably find them at scipy.constants


In [84]:
np.pi


Out[84]:
3.141592653589793

In [85]:
np.e


Out[85]:
2.718281828459045

Beyond NumPy

NumPy is a great library for numerical computing and we have just barely scratch the surface of what is possible to do with it. For scientific computing sometimes you will need access to numerical routines for interpolation, integration, optimization, statistics, audio and image processing, etc. Chances are big that this routines are already implemented in SciPy, a scientific computing library built on top of NumPy. As with NumPy in SciPy fast and reliable routines are implemented and is always a good idea to search if the routine you need is available in SciPy (do not waste time reinventing the wheel, unless you need a very special wheel!). SciPy is also the name of a group of related conferences for users and developers of scientific computing in python. In the following chapters we will use a very few of the whole stack of SciPy functions.

If you problem is less of the numerical-crunching-physics-style of computing and more related to statistics and data analysis probably you will be better using Pandas:. Pandas is another library build on top of NumPy that provides easy-to-use data structures and data analysis tools. If the main part of your work involves manipulating and reshaping data, dealing with missing values, and then performing statistical analysis and data visualization Pandas is the library for you.