Numerical Data in Python: the NumPy Array Object

This notebook is based on existing and more elaborate NumPy lecture material:

For Matlab users, following reference material might com in handy:

Official documentation:

Import Conventions

Name spaces: keep them separate, it will make your live easier on the long term.


In [1]:
import numpy as np

Note that some like to input everything from numpy as

from numpy import *

in order never to have to type np., but that might create problems (aka name space clashes) down the road. For instance, consider the following case:


In [2]:
from numpy import *
print type(cos(10))
cos(array([1,2,3]))


<type 'numpy.float64'>
Out[2]:
array([ 0.54030231, -0.41614684, -0.9899925 ])

In [3]:
from math import *
print type(cos(10))


<type 'float'>

In [4]:
cos(array([1,2,3]))


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-d051637b9365> in <module>()
----> 1 cos(array([1,2,3]))

TypeError: only length-1 arrays can be converted to Python scalars

The function numpy.cos (which accepts numbers and arrays) clashed with math.cos (which only accepts numbers), and the last one that has been imported is the one that prevails.

Arrays

Python objects:

  • high-level number objects: integers, floating point
  • containers: lists (costless insertion and append), dictionaries (fast lookup)

NumPy provides:

  • extension package to Python for multi-dimensional arrays
  • closer to hardware (efficiency)
  • designed for scientific computation (convenience)
  • Also known as array oriented computing

Fooling around with arrays


In [6]:
np.array([[11,12,13], [21,22,23], [31,32,33]])


Out[6]:
array([[11, 12, 13],
       [21, 22, 23],
       [31, 32, 33]])

Can't we just use Python lists instead?


In [5]:
L = range(5) # a Python list
a = np.arange(5) # a NumPy array
b = 2

Lists behave according to very simple yet clear rules:


In [ ]:
L1 = L*2
L2 = L+L
L1 == L2

In [ ]:
L1

We could apply element wise operations on list elements. But they will always be slower compared to NumPy:


In [ ]:
L = range(5000)
a = np.arange(5000)
%timeit [i**2 for i in L]
%timeit a**2
325/5

Convinience functions for creating special arrays:


In [6]:
np.ones((5,4))


Out[6]:
array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

In [7]:
np.diagflat(np.array([1,2,3,4]))


Out[7]:
array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

Selecting last element, slicing, views, copies, in place operations, linspace


In [11]:
a = np.arange(11)
a
a[-1]
a[2:9]


Out[11]:
array([2, 3, 4, 5, 6, 7, 8])

Printing arrays

When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout:

  • the last axis is printed from left to right,
  • the second-to-last is printed from top to bottom,
  • the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

Creating and printing arrays might work a little different than expected:


In [12]:
a1 = np.array([[111,121],[211,221]])
print a1
a2 = np.array([[112, 122],[212,222]])
print a2


[[111 121]
 [211 221]]
[[112 122]
 [212 222]]

What if we would create a 3D array, where a2 would be stacked along a third dimension:


In [15]:
a3 = np.dstack((a1, a2))
a3


Out[15]:
array([[[111, 112],
        [121, 122]],

       [[211, 212],
        [221, 222]]])

In [17]:
a4 = np.array([ [[111,112],[121,122]], [[211,212],[221,222]] ])
a5 = np.array([ [np.array([111,112]),[121,122]], [[211,212],[221,222]] ])

In [18]:
print a3 == a4
print a3[1,0,0]


[[[ True  True]
  [ True  True]]

 [[ True  True]
  [ True  True]]]
211

In [7]:
plot(range10)


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-7dc7c602c80d> in <module>()
----> 1 plot(range10)

NameError: name 'plot' is not defined

Indexing Multi-dimensional Arrays

Create a 3D array with random data


In [19]:
d1, d2, d3 = 1000, 2000, 50
i0, i1, istep = 10, 700, 10
j0, j1, jstep = 8, 1000, 7
aa = np.random.rand(d1,d2,d3)
i = np.arange(i0,i1,istep)
j = np.arange(j0,j1,jstep)
k = [1,2, 10]
aa.shape


Out[19]:
(1000, 2000, 50)

Slicing that array using the intervals defined above in i,j,k


In [20]:
aaijk1 = aa[i,j,k]


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-abfc4537a455> in <module>()
----> 1 aaijk1 = aa[i,j,k]

ValueError: shape mismatch: objects cannot be broadcast to a single shape

So Matlab style slicing doesn't work. In NumPy there are 3 possible methods


In [21]:
# method 1: fancy indexing for each dimension separately
aai = aa[i,:,:]
aaij = aai[:,j,:]
aaijk = aaij[:,:,k]
# which is similar to
aaijk1 = aa[i,:,:][:,j,:][:,:,k]

# method 2: slicing on dimensions 1,2, and fancy indexing on the last dimension
# slicing: inclusive lower boundary, exclusive upper boundary
aaijk2 = aa[i0:i1:istep, j0:j1:jstep, k]

# method 3: fancy indexing
aaijk3 = aa[np.ix_(i,j,k)]

Verify the result, comparing two arrays with floats


In [22]:
print(np.allclose(aaijk1, aaijk))
print(np.allclose(aaijk2, aaijk))
print(np.allclose(aaijk3, aaijk))


True
True
True

Simple benchmarking with IPython magic %timeit, which method is faster?


In [23]:
# which method is faster?
print("method 1: fancy indexing for each dimension seperately")
%timeit aa[i,:,:][:,j,:][:,:,k]
print("method 2: slicing")
%timeit aa[i0:i1:istep, j0:j1:jstep, k]
print("method 3: fancy indexing")
%timeit aa[np.ix_(i,j,k)]


method 1: fancy indexing for each dimension seperately
10 loops, best of 3: 74.3 ms per loop
method 2: slicing
1000 loops, best of 3: 312 µs per loop
method 3: fancy indexing
1000 loops, best of 3: 1.58 ms per loop

Slicing is the clear winner, because we create views and not copies of the arrays.

I/O with NumPy arrays


In [24]:
a = np.random.rand(10, 5)
b = np.ones( (10,2) )
c = np.zeros( (5,5) )

np.savetxt('savetxt', a)
np.savez('savez', a, b, c) # NumPy binary
np.save('save', a) # NumPy binary

And to load the saved arrays: use loadtxt, loads, load


In [ ]:
np.loadtxt('savetxt')

For more complex csv or text files, use numpy.genfromtxt(). For instance, we may be dealing with a fixed-width file, where columns are defined as a given number of characters. In that case, we need to set delimiter to a single integer (if all the columns have the same size) or to a sequence of integers (if columns can have different sizes).


In [26]:
fname = 'fixedwidth'
data1 = np.genfromtxt(fname, dtype=None, delimiter=',', comments='#', skiprows=0, 
                      skip_header=0, skip_footer=0)
print data1.shape
data1[0]


(20,)
Out[26]:
'0.0  1.00000e-02          0.0          0.0  7.50000e-07  7.50000e-07          0.0          0.0  1.00000e+12  2.00000e+12  6.75000e-08  6.75000e-08  1.35000e-07   100.000000   100.000000  9.00000e-04          0.0          0.0          0.0'

In [27]:
data2 = np.genfromtxt(fname, dtype=np.float32, delimiter=13, comments='#', 
                      skiprows=0, skip_header=0, skip_footer=0)
print data2.shape
data2[:,0]


(20, 19)
Out[27]:
array([ 0.        ,  0.1619    ,  0.162     ,  0.194647  ,  0.227294  ,
        0.25994101,  0.292588  ,  0.32523501,  0.35788199,  0.39052901,
        0.42317599,  0.45582399,  0.488471  ,  0.52111799,  0.553765  ,
        0.58641201,  0.61905903,  0.65170598,  0.68435299,  0.71700001], dtype=float32)

In [ ]:
data = np.fromstring(string, dtype=float, count=-1, sep='')
data = np.fromfile(file, dtype=float, count=-1, sep='') # also for binary data
data.tofile(fname)

More file IO with scipy.io

See also official SciPy tutorial on IO

Reading and writing Matlab .mat files


In [28]:
import scipy.io as sio
constitutive = sio.loadmat('constitutive.mat')
utils = sio.loadmat('utils.mat')

In [29]:
utils.keys()


Out[29]:
['utils', '__version__', '__header__', '__globals__']

In [30]:
utils['utils'].dtype


Out[30]:
dtype([('foldername', 'O'), ('nl_2d', 'O'), ('el_2d', 'O'), ('emat', 'O'), ('matprops', 'O'), ('ne_2d', 'O'), ('nn_2d', 'O'), ('etype_string', 'O'), ('etype', 'O'), ('nnpe_2d', 'O'), ('mdim_2d', 'O'), ('gpoints', 'O'), ('ndiv', 'O'), ('mapnn', 'O'), ('mapel_2d', 'O'), ('mapel4mat', 'O'), ('nmat', 'O'), ('pr_2d', 'O'), ('ElCent', 'O'), ('ElArea', 'O'), ('GQ', 'O'), ('iQ', 'O'), ('jQ', 'O'), ('vQ', 'O'), ('iQm', 'O'), ('jQm', 'O'), ('vQm', 'O'), ('density', 'O')])

In [31]:
utils['utils']['foldername'][0][0]


Out[31]:
array([u'BECAS_examples/WTAirfoil'], 
      dtype='<U24')

In [32]:
utils['utils']['matprops'][0][0]


Out[32]:
array([[  4.00000000e+10,   1.00000000e+10,   1.00000000e+10,
          4.00000000e+09,   4.00000000e+09,   3.57100000e+09,
          2.80000000e-01,   2.80000000e-01,   4.00000000e-01,
          1.90000000e+03],
       [  1.20000000e+10,   1.00000000e+10,   1.20000000e+10,
          1.00000000e+10,   3.80000000e+09,   3.80000000e+09,
          5.00000000e-01,   2.80000000e-01,   2.80000000e-01,
          1.89000000e+03],
       [  2.00000000e+10,   1.00000000e+10,   1.00000000e+10,
          7.50000000e+09,   4.00000000e+09,   4.00000000e+09,
          5.00000000e-01,   2.80000000e-01,   2.80000000e-01,
          1.86000000e+03],
       [  5.00000000e+07,   5.00000000e+07,   5.00000000e+07,
          1.78570000e+07,   1.78570000e+07,   1.78570000e+07,
          4.00000000e-01,   4.00000000e-01,   4.00000000e-01,
          8.00000000e+01]])

Or if you think that's too much of hasle, you could try to use the Octave to Python bridge


In [ ]:
from oct2py import octave
# short hand notation
oc = octave.run
oc("load('utils.mat')")
oc("load('constitutive.mat')")
utils        = octave.get('utils')
constitutive = octave.get('constitutive')

In [ ]:
utils['matprops']

In [ ]: