(see http://www.numpy.org/)
Numpy allows for dealing with multi-dimensional numerical data in an extremely efficient way.
Assume you want to apply a function to each value in a long list (10 million elements):
In [1]:
import time
import math
a = range(10000000)
def func(a):
return 1e-6*(4*a**2.) #7+1./(a+3)**3.4-(23*a)**4.2)-20
# measure how long it takes in seconds
start_time = time.time()
new_a = []
for val in a:
new_a.append(func(val))
print time.time()-start_time, "seconds"
This works, but requires a lot of code. A shorter version would be this here:
In [2]:
start_time = time.time()
new_a = map(func, a)
print time.time()-start_time, "seconds"
... but this takes about the same time.
Using numpy, the code is even shorter, easier to read and a lot faster:
In [3]:
import numpy
start_time = time.time()
a = numpy.array(a)
new_a = func(a)
print time.time()-start_time, "seconds"
What is different? a
is converted from a list into a numpy array and then func
is applied to array a
as a whole and not on each individual element. This approach is faster because 'under the hood' numpy uses C routines for its arrays that are a lot faster dealing than native Python routines.
(see http://docs.scipy.org/doc/numpy/user/basics.creation.html)
Numpy arrays look very similar to list but are a lot more powerful.
In [4]:
l = range(10)
print l
print type(l) # this is a list
a = numpy.array(l) # arrays are created from lists...
print a
print type(a) # this is an array
print numpy.arange(0, 10, 1) # ... or using numpy functions (this is the equivalent of range())
print numpy.linspace(0, 10, 10) # arange requires the step size, linspace the number of steps
print numpy.ones(10)
print numpy.zeros(10)
b = a.reshape((2,5)) # arrays can be reshaped into multi-dimensional arrays
print b
print a.shape, b.shape
One advantage of arrays is that functions can be applied to entire arrays:
In [5]:
print a
print numpy.sqrt(a)
print a + 5
print a / 2.
print func(a)
An Example: how would you implement the vector module from assignment 3 using arrays?
In [6]:
a = numpy.array([0,1,0])
b = numpy.array([1,0,1])
print 'add', a+b # arrays are already defined to do element-wise operations
print 'subtract', a-b
print 'multiply', a*b
print 'absolute', numpy.sqrt(numpy.sum(a*a))
print 'dotproduct', numpy.sum(a*b)
print 'angle', numpy.rad2deg(numpy.arccos(numpy.sum(a*b)/numpy.sqrt(numpy.sum(a*a))/numpy.sqrt(numpy.sum(b*b))))
(see http://docs.scipy.org/doc/numpy/user/basics.indexing.html)
Arrays can be indexed the same way as lists:
In [7]:
a = numpy.arange(10)
print a[3]
print a[1:5]
print a[7:2:-1]
Arrays can also be indexed using mask indexing:
In [8]:
print a[a < 5]
print a[a % 2 == 0]
print a[((a > 1) & (a < 7))]*2
Mask indexing is easy to read and extremely fast even on large arrays.
(see http://docs.scipy.org/doc/numpy/user/basics.rec.html)
Assume you have a table of positional data (x, y, z coordinates) that you write into a list of triples (each row of the table corresponds to one element in your list, the different columns for each row are the elements of the tuples) and you want to use these data as flexibly as possible. You can turn that list into a structured array:
In [9]:
l = [(1, 0, 0), (2, 3, 1), (4, 2, 0), (0, 1, 0), (2, 6, 2)]
s = numpy.array(l, dtype=[('x', int), ('y', int), ('z', int)])
There are two major differences between simple arrays and structured arrays:
dtype
); for each "column" of your table provide a "column name" (string) and a datatype (see below)Defining the individual datatypes means some extra work, but it is worth the effort:
In [10]:
print s[0] # provides you with the first row of your table; just like an ordinary array
print s['x'] # provides you with all x elements
print s
print s[((s['y'] < 3) & (s['x'] > 1))] # of course you can use masked indexing, too
Numpy provides a wide range of routines that are based on arrays and simplify life a lot.
(see http://docs.scipy.org/doc/numpy/reference/routines.io.html)
Read a file into an array:
In [11]:
# data = numpy.genfromtxt('data.txt') # minimal example; creates a simple array
# data = numpy.genfromtxt('data.txt', usecols=(0, 1, 3)) # allows to specify which columns to read into array
# data = numpy.genfromtxt('data.txt', dtype=[float, float, float], names=['x', 'y', 'z']) # creates a structured array
genfromtxt
is a lot faster in reading in files than looping through the individual lines of the file. There is one disadvantage: the data file must have no gaps - missing data have to be indicated (e.g., using NaN
).
Arrays can be efficiently written into files using:
In [12]:
# numpy.savetxt('data.txt', data, header=(' x y z'), fmt='%4.1f %4.1f %4.1f')
(see http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html)
Numpy provides a number of more sophisticated data types than basic Python:
In [13]:
print numpy.int8(100) # Unsigned integer (0 to 255)
print numpy.int8(300) # watch out!
b = numpy.float32(1.3) # create a 32 bit float
print b.dtype # find out data type
print b.astype(numpy.float16).dtype # convert to 16 bit float
print (b+numpy.float16(0.7)).dtype # use the least complex datatype
(see http://docs.scipy.org/doc/numpy/reference/routines.math.html)
Mathematical constants and functions provided by numpy are basically the same as those in math, but only the numpy ones work on arrays!
In [14]:
print numpy.pi
print numpy.e # Euler e
print numpy.infty # infinity
print numpy.nan # 'not a number'
print numpy.isnan(numpy.nan) # check for nan
print numpy.isinf(numpy.infty) # check for infinity
print numpy.sinh(3./2.*numpy.pi) # hyperbolic sine function
print numpy.rad2deg(numpy.pi) # convert radians to degrees
print numpy.round(1.23456789, 2) # round a float to a certain number of digits after the floating point
print numpy.gradient(numpy.array([1,2,3,5,3,2,1])) # central diff. gradient
(see http://docs.scipy.org/doc/numpy/reference/routines.sort.html)
In [15]:
a = numpy.array([4, 3, 1, 7, 2, 9, 5, 8, 6])
print numpy.sort(a) # sort a
print numpy.where(a < 5) # indices where a < 5
print numpy.argmin(a) # index of the smallest element in a
print numpy.argmax(a) # index of the largest element in a
print numpy.argsort(a) # order of indices that would provide a sorted version of a
(see http://docs.scipy.org/doc/numpy/reference/routines.random.html)
In [16]:
import numpy.random as rnd
print rnd.random() # single random number: 0 < random float < 1
print rnd.random(5)*10 # 5 random numers between 0 and 10
print rnd.random((2,3)) # 2x3 matrix with random numbers
g = rnd.normal(loc=0, scale=1, size=100) # gaussian random numbers
print rnd.poisson(2, 5) # draw 5 numbers from a poisson distribution with an average frequency of 2
print rnd.lognormal(5, 2, 10) # draw 10 numbers from a lognormal distribution with parameters 5 and 2
# ... and many more other distributions to choose from
print a, rnd.permutation(a) # create a random permutation of a list/array
In [17]:
g = rnd.normal(loc=0, scale=1, size=1000000)
print numpy.mean(g), numpy.median(g), numpy.std(g)
print numpy.average(g, weights=numpy.ones(1e6)) # weighted average
# specifying axis of operation gives different results:
a = [[1,1,1], [2,2,2], [3,3,3]]
print numpy.mean(a) # mean of all numbers in the array
print numpy.mean(a, axis=0) # mean along axis 0 (first axis = outermost axis = along columns)
print numpy.mean(a, axis=1) # mean along axis 1 (second axis = along rows)
# operations that ignore nans
b = [1, 2, 3, numpy.nan, 4, 5]
print numpy.mean(b) # returns nan
print numpy.nanmean(b) # ignores nan; see nanmedian, nanstd, ...
# determine percentiles
print numpy.percentile(g, 50) # the same as median
print numpy.percentile(g, 68.27)-numpy.percentile(g, 31.73)
# create a histogram
hist, bins = numpy.histogram(g, bins=numpy.arange(-5, 6, 1))
print bins
print hist
(see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html)
In [18]:
import numpy.linalg as la
a = [1,2,3]
b = [4,5,6]
print numpy.dot(a,b) # dot product
print numpy.inner(a,b)
print numpy.outer(a,b)
i = numpy.diag([1,2,3])
print la.eig(i) # return eigenvalues and eigenvectors
print la.det(i) # return determinant
# solve linear equations
a = [[3,2,-1], [2,-2,4], [-1,0.5,-1]]
b = [1,-2,0]
print la.solve(a,b)