Numpy

(see http://www.numpy.org/)

Numpy allows for dealing with multi-dimensional numerical data in an extremely efficient way.

Motivation

Assume you want to apply a function to each value in a long list (10 million elements):


In [1]:
import time
import math

a = range(10000000)

def func(a):
    return 1e-6*(4*a**2.) #7+1./(a+3)**3.4-(23*a)**4.2)-20

# measure how long it takes in seconds
start_time = time.time()

new_a = []
for val in a:
    new_a.append(func(val))
    
print time.time()-start_time, "seconds"


2.1906940937 seconds

This works, but requires a lot of code. A shorter version would be this here:


In [2]:
start_time = time.time()
new_a = map(func, a)
print time.time()-start_time, "seconds"


1.7215218544 seconds

... but this takes about the same time.

Using numpy, the code is even shorter, easier to read and a lot faster:


In [3]:
import numpy

start_time = time.time()
a = numpy.array(a)
new_a = func(a)
print time.time()-start_time, "seconds"


0.818804979324 seconds

What is different? a is converted from a list into a numpy array and then func is applied to array a as a whole and not on each individual element. This approach is faster because 'under the hood' numpy uses C routines for its arrays that are a lot faster dealing than native Python routines.

Arrays

(see http://docs.scipy.org/doc/numpy/user/basics.creation.html)

Numpy arrays look very similar to list but are a lot more powerful.


In [4]:
l = range(10)
print l
print type(l) # this is a list

a = numpy.array(l)  # arrays are created from lists...
print a
print type(a) # this is an array

print numpy.arange(0, 10, 1)  # ... or using numpy functions (this is the equivalent of range())
print numpy.linspace(0, 10, 10) # arange requires the step size, linspace the number of steps
print numpy.ones(10)
print numpy.zeros(10)
b = a.reshape((2,5))  # arrays can be reshaped into multi-dimensional arrays
print b

print a.shape, b.shape


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
<type 'list'>
[0 1 2 3 4 5 6 7 8 9]
<type 'numpy.ndarray'>
[0 1 2 3 4 5 6 7 8 9]
[  0.           1.11111111   2.22222222   3.33333333   4.44444444
   5.55555556   6.66666667   7.77777778   8.88888889  10.        ]
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
[[0 1 2 3 4]
 [5 6 7 8 9]]
(10,) (2, 5)

One advantage of arrays is that functions can be applied to entire arrays:


In [5]:
print a
print numpy.sqrt(a)
print a + 5
print a / 2.
print func(a)


[0 1 2 3 4 5 6 7 8 9]
[ 0.          1.          1.41421356  1.73205081  2.          2.23606798
  2.44948974  2.64575131  2.82842712  3.        ]
[ 5  6  7  8  9 10 11 12 13 14]
[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5]
[  0.00000000e+00   4.00000000e-06   1.60000000e-05   3.60000000e-05
   6.40000000e-05   1.00000000e-04   1.44000000e-04   1.96000000e-04
   2.56000000e-04   3.24000000e-04]

An Example: how would you implement the vector module from assignment 3 using arrays?


In [6]:
a = numpy.array([0,1,0])
b = numpy.array([1,0,1])

print 'add', a+b       # arrays are already defined to do element-wise operations
print 'subtract', a-b
print 'multiply', a*b
print 'absolute', numpy.sqrt(numpy.sum(a*a))
print 'dotproduct', numpy.sum(a*b)
print 'angle', numpy.rad2deg(numpy.arccos(numpy.sum(a*b)/numpy.sqrt(numpy.sum(a*a))/numpy.sqrt(numpy.sum(b*b))))


add [1 1 1]
subtract [-1  1 -1]
multiply [0 0 0]
absolute 1.0
dotproduct 0
angle 90.0

Indexing

(see http://docs.scipy.org/doc/numpy/user/basics.indexing.html)

Arrays can be indexed the same way as lists:


In [7]:
a = numpy.arange(10)
print a[3]
print a[1:5]
print a[7:2:-1]


3
[1 2 3 4]
[7 6 5 4 3]

Arrays can also be indexed using mask indexing:


In [8]:
print a[a < 5]
print a[a % 2 == 0]
print a[((a > 1) & (a < 7))]*2


[0 1 2 3 4]
[0 2 4 6 8]
[ 4  6  8 10 12]

Mask indexing is easy to read and extremely fast even on large arrays.

Structured Arrays

(see http://docs.scipy.org/doc/numpy/user/basics.rec.html)

Assume you have a table of positional data (x, y, z coordinates) that you write into a list of triples (each row of the table corresponds to one element in your list, the different columns for each row are the elements of the tuples) and you want to use these data as flexibly as possible. You can turn that list into a structured array:


In [9]:
l = [(1, 0, 0), (2, 3, 1), (4, 2, 0), (0, 1, 0), (2, 6, 2)]
s = numpy.array(l, dtype=[('x', int), ('y', int), ('z', int)])

There are two major differences between simple arrays and structured arrays:

  • the list from which the structured array is built must consist of tuples (that makes structured arrays immutable)
  • when defining a structured array, you have to define its datatype (dtype); for each "column" of your table provide a "column name" (string) and a datatype (see below)

Defining the individual datatypes means some extra work, but it is worth the effort:


In [10]:
print s[0]   # provides you with the first row of your table; just like an ordinary array
print s['x'] # provides you with all x elements
print s
print s[((s['y'] < 3) & (s['x'] > 1))] # of course you can use masked indexing, too


(1, 0, 0)
[1 2 4 0 2]
[(1, 0, 0) (2, 3, 1) (4, 2, 0) (0, 1, 0) (2, 6, 2)]
[(4, 2, 0)]

Numpy Routines and Other Useful Things

Numpy provides a wide range of routines that are based on arrays and simplify life a lot.

File I/O

(see http://docs.scipy.org/doc/numpy/reference/routines.io.html)

Read a file into an array:


In [11]:
# data = numpy.genfromtxt('data.txt')  # minimal example; creates a simple array
# data = numpy.genfromtxt('data.txt', usecols=(0, 1, 3))  # allows to specify which columns to read into array
# data = numpy.genfromtxt('data.txt', dtype=[float, float, float], names=['x', 'y', 'z']) # creates a structured array

genfromtxt is a lot faster in reading in files than looping through the individual lines of the file. There is one disadvantage: the data file must have no gaps - missing data have to be indicated (e.g., using NaN).

Arrays can be efficiently written into files using:


In [12]:
# numpy.savetxt('data.txt', data, header=('  x      y      z'), fmt='%4.1f   %4.1f   %4.1f')

Datatypes

(see http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)

Numpy provides a number of more sophisticated data types than basic Python:


In [13]:
print numpy.int8(100)  # Unsigned integer (0 to 255)
print numpy.int8(300)  # watch out! 
b = numpy.float32(1.3) # create a 32 bit float
print b.dtype          # find out data type
print b.astype(numpy.float16).dtype  # convert to 16 bit float
print (b+numpy.float16(0.7)).dtype # use the least complex datatype


100
44
float32
float16
float32

Math Constants and Functions

(see http://docs.scipy.org/doc/numpy/reference/routines.math.html)

Mathematical constants and functions provided by numpy are basically the same as those in math, but only the numpy ones work on arrays!


In [14]:
print numpy.pi     
print numpy.e      # Euler e
print numpy.infty  # infinity
print numpy.nan    # 'not a number'

print numpy.isnan(numpy.nan)     # check for nan
print numpy.isinf(numpy.infty)   # check for infinity

print numpy.sinh(3./2.*numpy.pi) # hyperbolic sine function
print numpy.rad2deg(numpy.pi)    # convert radians to degrees
print numpy.round(1.23456789, 2) # round a float to a certain number of digits after the floating point
print numpy.gradient(numpy.array([1,2,3,5,3,2,1])) # central diff. gradient


3.14159265359
2.71828182846
inf
nan
True
True
55.6543975994
180.0
1.23
[ 1.   1.   1.5  0.  -1.5 -1.  -1. ]

Sorting, Searching, and Counting

(see http://docs.scipy.org/doc/numpy/reference/routines.sort.html)


In [15]:
a = numpy.array([4, 3, 1, 7, 2, 9, 5, 8, 6])

print numpy.sort(a)       # sort a 
print numpy.where(a < 5)  # indices where a < 5
print numpy.argmin(a)     # index of the smallest element in a
print numpy.argmax(a)     # index of the largest element in a
print numpy.argsort(a)    # order of indices that would provide a sorted version of a


[1 2 3 4 5 6 7 8 9]
(array([0, 1, 2, 4]),)
2
5
[2 4 1 0 6 8 3 7 5]

In [16]:
import numpy.random as rnd

print rnd.random()                        # single random number: 0 < random float < 1
print rnd.random(5)*10                    # 5 random numers between 0 and 10
print rnd.random((2,3))                   # 2x3 matrix with random numbers
g = rnd.normal(loc=0, scale=1, size=100)  # gaussian random numbers
print rnd.poisson(2, 5)                   # draw 5 numbers from a poisson distribution with an average frequency of 2
print rnd.lognormal(5, 2, 10)             # draw 10 numbers from a lognormal distribution with parameters 5 and 2
# ... and many more other distributions to choose from

print a, rnd.permutation(a)               # create a random permutation of a list/array


0.241814732837
[ 2.40840546  5.49625288  3.63275304  5.04612508  2.59975186]
[[ 0.14783537  0.16630727  0.74859043]
 [ 0.91880053  0.37305646  0.67450147]]
[1 3 1 1 0]
[   54.80011666  3134.92570927     3.33162721   244.1723481    248.71475639
    34.13682601    91.25377301  1248.26044838  1751.39320053   208.17632167]
[4 3 1 7 2 9 5 8 6] [8 4 1 3 6 7 5 2 9]

In [17]:
g = rnd.normal(loc=0, scale=1, size=1000000)

print numpy.mean(g), numpy.median(g), numpy.std(g)
print numpy.average(g, weights=numpy.ones(1e6)) # weighted average

# specifying axis of operation gives different results:
a = [[1,1,1], [2,2,2], [3,3,3]]
print numpy.mean(a)         # mean of all numbers in the array
print numpy.mean(a, axis=0) # mean along axis 0 (first axis = outermost axis = along columns)
print numpy.mean(a, axis=1) # mean along axis 1 (second axis = along rows)

# operations that ignore nans
b = [1, 2, 3, numpy.nan, 4, 5]
print numpy.mean(b)    # returns nan
print numpy.nanmean(b) # ignores nan; see nanmedian, nanstd, ...

# determine percentiles
print numpy.percentile(g, 50) # the same as median
print numpy.percentile(g, 68.27)-numpy.percentile(g, 31.73)

# create a histogram
hist, bins = numpy.histogram(g, bins=numpy.arange(-5, 6, 1))
print bins
print hist


0.00109325386178 0.00131821793409 1.00155990033
0.00109325386178
2.0
[ 2.  2.  2.]
[ 1.  2.  3.]
nan
3.0
0.00131821793409
0.952642913973
[-5 -4 -3 -2 -1  0  1  2  3  4  5]
[    29   1291  21581 135630 340916 341208 136498  21467   1349     31]

In [18]:
import numpy.linalg as la

a = [1,2,3] 
b = [4,5,6]

print numpy.dot(a,b) # dot product
print numpy.inner(a,b) 
print numpy.outer(a,b) 

i = numpy.diag([1,2,3])
print la.eig(i) # return eigenvalues and eigenvectors
print la.det(i) # return determinant

# solve linear equations
a = [[3,2,-1], [2,-2,4], [-1,0.5,-1]]
b = [1,-2,0]
print la.solve(a,b)


32
32
[[ 4  5  6]
 [ 8 10 12]
 [12 15 18]]
(array([ 1.,  2.,  3.]), array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]]))
6.0
[ 1. -2. -2.]