In [1]:
%matplotlib inline
%pylab inline


Populating the interactive namespace from numpy and matplotlib

NumPy - multidimensional data arrays

NumPy in a nutshell is a multidimensional array library.

Why does one need numpy?

  • No useful builtin homogenous data container built into Python, lists can contain arbitrary type objects which needs a lot of memory and CPU power to process and store.
  • Only very limited number functions available to work with containers containing homogenous data.

In [2]:
l = list(range(10000))
a = np.arange(10000)

In [3]:
l[:10], a[:10]


Out[3]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))

In [4]:
%timeit [x + 1 for x in l]


1000 loops, best of 3: 748 µs per loop

In [5]:
%timeit a + 1


10000 loops, best of 3: 15.4 µs per loop

Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

  • a Python list or tuples
  • using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.
  • reading data from files
  • wrapping existing memory blocks (buffer interface, memoryview)

From lists

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.


In [6]:
# one dimensional array, the argument to the array function is a Python list
# will cast to 'best' type as required: int -> double -> complex double
v = np.array([1,2,3,4])

v


Out[6]:
array([1, 2, 3, 4])

In [7]:
# a 2d image: the argument to the array function is a nested Python list
img = np.array([[1, 2], [3, 4]])

img


Out[7]:
array([[1, 2],
       [3, 4]])

An array has two basic properties:

  • dtype: object that fully describes the type of data, for example 32 bit unsigned integer or 128 bit complex floating point number

In [8]:
v.dtype


Out[8]:
dtype('int64')
  • shape: a tuple containing the size of each dimension, an array can have up to 32 dimensions and each dimension can be as large as the address space allows.

In [9]:
v.shape


Out[9]:
(4,)

In [10]:
img.shape


Out[10]:
(2, 2)

Many functions take arguments named dtype or shape (sometimes size) that define how the output looks like:


In [11]:
img = np.array([[1, 2], [3, 4]], dtype=np.complex64)

img


Out[11]:
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]], dtype=complex64)

Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in numpy that generates arrays of different forms. Some of the more common are:

emtpy, zeros and ones


In [12]:
np.empty((3,3))


Out[12]:
array([[  0.00000000e+000,   0.00000000e+000,   3.16601930e-316],
       [  3.21323221e-316,   3.17501841e-316,   3.21372865e-316],
       [  1.66064396e-316,   0.00000000e+000,   6.90268638e-310]])

In [13]:
np.zeros((3,3))


Out[13]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [14]:
np.ones((3,3), dtype=np.int16)


Out[14]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [15]:
# get same type of array
np.empty_like(v)
np.ones_like(v)
np.zeros_like(v)
np.full_like(v, 3) #numpy 1.8


Out[15]:
array([3, 3, 3, 3])

arange


In [16]:
# create a range

x = np.arange(0, 10, 1) # arguments: start, stop, step, [dtype]

x


Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [17]:
x = np.arange(-1, 1, 0.1)

x


Out[17]:
array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

linspace and logspace


In [18]:
# using linspace, end points are included by default and you define the number of points instead of the step
np.linspace(0, 10, num=30, endpoint=True)


Out[18]:
array([  0.        ,   0.34482759,   0.68965517,   1.03448276,
         1.37931034,   1.72413793,   2.06896552,   2.4137931 ,
         2.75862069,   3.10344828,   3.44827586,   3.79310345,
         4.13793103,   4.48275862,   4.82758621,   5.17241379,
         5.51724138,   5.86206897,   6.20689655,   6.55172414,
         6.89655172,   7.24137931,   7.5862069 ,   7.93103448,
         8.27586207,   8.62068966,   8.96551724,   9.31034483,
         9.65517241,  10.        ])

In [19]:
np.logspace(0, 10, num=10, base=np.e)


Out[19]:
array([  1.00000000e+00,   3.03773178e+00,   9.22781435e+00,
         2.80316249e+01,   8.51525577e+01,   2.58670631e+02,
         7.85771994e+02,   2.38696456e+03,   7.25095809e+03,
         2.20264658e+04])

random data


In [20]:
from numpy import random

In [21]:
# standard normal distributed random numbers
random.normal(loc=-5., scale=5, size=(5,5))


Out[21]:
array([[-10.07496   ,  -8.90712815,  -2.59793592,  -8.89634474,
        -10.60483511],
       [ -8.93133842,  -0.4726435 , -10.20600326, -10.44268975,  -0.9836462 ],
       [-11.01432594,  -9.38263595,  -0.3652716 ,  -8.30859508,
          1.38892859],
       [  0.99216126, -12.00298969,  -9.85226298,  -2.32475633,
         -8.16799242],
       [ -7.30244278,  -7.24388802,  -6.26780013,  -7.97688188,
         -4.67301287]])

File I/O

Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt or numpy.loadtxt function. For example:


In [22]:
!head stockholm_td_adj.dat


#year month day min   temperature   max dontknow
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1

In [23]:
data = np.genfromtxt('stockholm_td_adj.dat')
data.shape, data.dtype


Out[23]:
((77431, 7), dtype('float64'))

In [24]:
!wc -l stockholm_td_adj.dat


77432 stockholm_td_adj.dat

In [25]:
data = np.genfromtxt('stockholm_td_adj.dat', names=True)
# structured array
data.shape, data.dtype


Out[25]:
((77431,),
 dtype([('year', '<f8'), ('month', '<f8'), ('day', '<f8'), ('min', '<f8'), ('temperature', '<f8'), ('max', '<f8'), ('dontknow', '<f8')]))

In [26]:
data['year'][:10]


Out[26]:
array([ 1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,
        1800.,  1800.])

In [27]:
fig, ax = subplots(figsize=(14,4))

x = data['year'] + data['month'] / 12.0 + data['day'] / 365.
y = data['temperature']

ax.plot(x, y)
ax.axis('tight')
ax.set_title('temperatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');


Manipulating arrays

Changing the shape

The shape (dimensionality) of an array can be changed as long as the new shape has the same number of elements


In [28]:
vec = np.arange(9)
# create a view of vec with a different shape
imgview = vec.reshape(3, 3)

imgview


Out[28]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [29]:
# the ordering of the elements can be defined (Fortran or C)
vec.reshape(3, 3, order='F')


Out[29]:
array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [30]:
# the value of a negative entry will be infered from the remaining free entries
vec.reshape(3, -1).shape


Out[30]:
(3, 3)

Reshaping will create a view/reference of the original data array. Changing the reshaped array will also change the original array.


In [31]:
imgview[0] = 5
vec


Out[31]:
array([5, 5, 5, 3, 4, 5, 6, 7, 8])

Indexing

We can index elements in an array using the square bracket and indices:


In [32]:
# regular python indexing [start:stop:step]
vec[3], vec[3::-1]


Out[32]:
(3, array([3, 5, 5, 5]))

The dimensions are separated by commas in the brackets:


In [33]:
imgview[1,0]


Out[33]:
3

In [34]:
#row
imgview[0], imgview[0,:]


Out[34]:
(array([5, 5, 5]), array([5, 5, 5]))

In [35]:
#column
imgview[:, 0]
# Ellipsis means all free axis, this selects the zero element of the last dimension, regardless of the dimension of imgview
imgview[..., 0]


Out[35]:
array([5, 3, 6])

In [36]:
# assignment on a indexed array
imgview[-1, :] = 1
imgview


Out[36]:
array([[5, 5, 5],
       [3, 4, 5],
       [1, 1, 1]])

Compared to Python lists, a slice of an array is no copy, it is a view on the original data copies must be explicit, e.g. with

imgview[0, :].copy()

Basic scalar math on arrays

The scalar math operation on arrays work element-wise:


In [37]:
a = np.ones((5,))
b = np.zeros_like(a)
a, b


Out[37]:
(array([ 1.,  1.,  1.,  1.,  1.]), array([ 0.,  0.,  0.,  0.,  0.]))

In [38]:
b = a + a
b


Out[38]:
array([ 2.,  2.,  2.,  2.,  2.])

In [39]:
# in place
a /= b
a


Out[39]:
array([ 0.5,  0.5,  0.5,  0.5,  0.5])

In [40]:
a + 1


Out[40]:
array([ 1.5,  1.5,  1.5,  1.5,  1.5])

In [41]:
a + np.ones((4,))


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-2d6215575c4f> in <module>()
----> 1 a + np.ones((4,))

ValueError: operands could not be broadcast together with shapes (5) (4) 

Broadcasting

Broadcasting applies if the shapes of a set of arrays do not match but the shapes can be made to match by repeating certain dimensions.

The basic rules when broadcasting can be applied are:

  • the size of an dimension is 1
  • the trailing shape of the arrays match

Broadcasting documentation

The simplest case is adding a zero dimensional array (scalar) to a higher dimensional array:


In [43]:
from IPython.display import Image
Image(filename='images/broadcast_11.png')


Out[43]:

In [44]:
# subtract vector from an image (e.g. remove overscan offset):
img = np.arange(4*3).reshape(4,3)
o = np.array([1, 2, 3])
#dim(4, 3) - dim(3)
img - o


Out[44]:
array([[-1, -1, -1],
       [ 2,  2,  2],
       [ 5,  5,  5],
       [ 8,  8,  8]])

In [45]:
img = img.reshape(3,4)
o = np.array([1, 2, 3])
#dim(3, 4) - dim(3), trailing dimension does not match
img - o


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-5fbf216187b7> in <module>()
      2 o = np.array([1, 2, 3])
      3 #dim(3, 4) - dim(3), trailing dimension does not match
----> 4 img - o

ValueError: operands could not be broadcast together with shapes (3,4) (3) 

In [46]:
# solution: add a new empty dimension (size 1)
# dim(3, 4) - dim(3, 1)
img - o[:,np.newaxis]


Out[46]:
array([[-1,  0,  1,  2],
       [ 2,  3,  4,  5],
       [ 5,  6,  7,  8]])

Fancy indexing

Arrays or lists of integers can also be used as indices to other arrays:


In [47]:
v = np.arange(5, 10)
row_indices = [0, 2, 4]
print(v)
print(row_indices, '->', v[row_indices])


[5 6 7 8 9]
([0, 2, 4], '->', array([5, 7, 9]))

In [48]:
img = np.arange(25).reshape(5, 5)
col_indices = [1, 2, -1] # index -1 means the last element
img[row_indices, col_indices]


Out[48]:
array([ 1, 12, 24])

In [49]:
img[:, col_indices]


Out[49]:
array([[ 1,  2,  4],
       [ 6,  7,  9],
       [11, 12, 14],
       [16, 17, 19],
       [21, 22, 24]])

Functions related with positions of data like sorting, partitioning, min/max often have a arg variant which instead of returning the result, return an index array that when applied to the original array return the result:


In [50]:
v = random.randint(0, 5, size=10)
sortindices = v.argsort()
sortindices, v[sortindices]


Out[50]:
(array([3, 4, 0, 5, 6, 8, 9, 1, 2, 7]), array([0, 0, 1, 1, 1, 1, 3, 4, 4, 4]))

We can also index with boolean masks: If the index array is an Numpy array of with data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:


In [51]:
x = arange(0, 10, 0.5)
x


Out[51]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [52]:
# & is a bitwise and, which, for booleans, is equivalent to np.logical_and(a, b)
mask = (x >= 5) & (x < 7.5)

mask


Out[52]:
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True, False, False, False,
       False, False], dtype=bool)

In [53]:
x[mask]


Out[53]:
array([ 5. ,  5.5,  6. ,  6.5,  7. ])

NumPy also contains a special array class which carries a mask and ignores the masked values for all arithmetic operations:


In [54]:
mx = np.ma.MaskedArray(x, mask=~((5 < x) & (x < 7.5)))
mx


Out[54]:
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- 5.5 6.0 6.5 7.0 -- -- -- -- --],
             mask = [ True  True  True  True  True  True  True  True  True  True  True False
 False False False  True  True  True  True  True],
       fill_value = 1e+20)

In [55]:
x.sum(), mx.sum()


Out[55]:
(95.0, 25.0)

where

The index mask can be converted to position index using the where function


In [56]:
indices = np.where((x > 5) & (x < 7.5))

indices


Out[56]:
(array([11, 12, 13, 14]),)

It can also be used to conditionally select from two options


In [57]:
a = np.ones_like(x)
np.where((x > 5) & (x < 7.5), a, 5) # note the broadcasting of the second option


Out[57]:
array([ 5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  1.,  1.,
        1.,  1.,  5.,  5.,  5.,  5.,  5.])

In [57]: