While the Python language is an excellent tool for general-purpose programming, it was not designed specifically for mathematical and scientific computing. Numpy allows for:
The critical thing to know is that Python for loops are very slow! One should try to use array-operations as much as possible.
First, let's import the numpy module. There are multiple ways we can do this. In the following examples, zeros
is a numpy routine which we will see later, and depending on how we import numpy we call it in different ways:
'import numpy'
imports the entire numpy module --> 'numpy.zeros()'
'import numpy as np'
imports the entire numpy module and renames it --> 'np.zeros()'
'from numpy import *'
imports the entire numpy module (more or less) --> 'zeros()'
'from numpy import zeros'
imports only the zeros()
routineAfter all that preamble, let's get started:
In [ ]:
import numpy as np
The primary building block of the numpy module is the class "ndarray". A ndarray object represents a multidimensional, homogeneous array of fixed-sized items. An associated date-type object describes the format of each element in the array. An ndarray object is (almost) never instantiated directly, but instead using a method that returns an instance of the class. Here is a pictoral representation of an "ndarray":
In [ ]:
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])
Elements of a one-dimensional array are accessed with the same syntax as a list:
In [ ]:
lst[0]
In [ ]:
lst[2:]
In [ ]:
arr[0]
In [ ]:
arr[2:]
In [ ]:
The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:
In [ ]:
lst[-1] = 'a string inside a list'
lst
but the same can not be done with an array, as we get an error message:
In [ ]:
arr[-1] = 'a string inside an array'
The information about the type of an array is contained in its dtype attribute:
In [ ]:
x = np.array([[1, 2], [3, 4]], dtype=np.uint8)
print(x)
In [ ]:
[b for b in bytes(x.data)]
In [ ]:
arr = np.array([10, 20, 123123])
arr.dtype
Once an array has been created, its dtype is fixed and it can only store elements of the same type. For this example where the dtype is integer, if we store a floating point number it will be automatically converted into an integer:
In [ ]:
arr[-1] = 1.234
arr
Why is a homogeneous data type required for arrays? Speed
In [ ]:
x = range(50000)
y = np.arange(50000)
%timeit [e**2 for e in x]
%timeit y**2
Above we created an array from an existing list; now let us now see other ways in which we can create arrays, which we'll illustrate next. A common need is to have an array initialized with a constant value, and very often this value is 0 or 1 (suitable as starting value for additive and multiplicative loops respectively); zeros
creates arrays of all zeros, with any desired dtype:
In [ ]:
np.zeros((5, 5), dtype=np.float64)
In [ ]:
np.zeros((2, 3), dtype=np.int64)
In [ ]:
np.zeros(3, dtype=complex)
and similarly for ones
:
In [ ]:
np.ones(5)
Then there are the linspace
and logspace
functions to create linearly and logarithmically-spaced grids, respectively, with a fixed number of points and including both ends of the specified interval:
In [ ]:
np.linspace(0, 1, 5) # start, stop, num
In [ ]:
Finally, it is often useful to create arrays with random numbers that follow a specific distribution. The np.random
module provides several random number generators. For example, here we produce an array of 5 random samples taken from a standard normal distribution (0 mean and variance 1):
In [ ]:
rng = np.random.RandomState(0) # <-- seed value, not needed but useful for reproducibility
In [ ]:
rng.normal(loc=0, scale=1, size=5)
Or the same, but from a uniform distribution:
In [ ]:
# 5 random numbers, picked from a uniform distribution between -10 and 10
uni = rng.uniform(-10, 10, size=5)
print(uni)
Some other example of quick random number generation using numpy
In [ ]:
print np.random.rand(2,2)
print np.random.randint(2,100,5)
print np.random.randn(5) # samples 5 times from the standard normal distribution
Above we saw how to index arrays with single numbers and slices, just like Python lists. But arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particluarly useful to extract information from an array that matches a certain condition.
Consider for example that in the array uni
we want to replace all values above 0 with the value 10. We can do so by first finding the mask that indicates where this condition is true or false:
In [ ]:
mask = uni > 0
mask
Now that we have this mask, we can use it to either read those values or to reset them to 0:
In [ ]:
print('Array:', uni)
print('Masked array:', uni[mask])
In [ ]:
uni[mask] = 0
print(uni)
Most of the array creation methods can be used to construct >1D arrays:
In [ ]:
np.zeros((3, 4))
We can also reshape arrays to fit the desired shape:
In [ ]:
np.arange(12)
In [ ]:
arr = np.arange(12).reshape((3, 4))
arr
With two-dimensional arrays we start seeing the power of numpy: while a nested list can be indexed using repeatedly the [ ]
operator, multidimensional arrays support a more direct indexing syntax with a single [ ]
and a set of indices separated by commas:
In [ ]:
Slices Refer the Array Data
In [ ]:
a = np.linspace(0, 29, 30)
a.shape = (5,6)
print a
b = a[1,:] # extract 2nd column of a
print a[1,1]
b[1] = 2
print a[1,1]
In [ ]:
# Take a copy to avoid referencing via slices:
b = a[1,:].copy()
print a[1,1]
b[1] = 7777 # b and a are two different arrays now
print a[1,1]
Now that we have seen how to create arrays with more than one dimension, let's take a look at some other properties.
In [ ]:
print('Data type :', arr.dtype)
print('Total number of elements :', arr.size)
print('Number of dimensions :', arr.ndim)
print('Shape (dimensionality) :', arr.shape)
print('Memory used (in bytes) :', arr.nbytes)
There are also many useful functions in numpy that operate on arrays, e.g.:
In [ ]:
print('Minimum and maximum :', np.min(arr), np.max(arr))
print('Sum and product of all elements :', np.sum(arr), np.prod(arr))
print('Mean and standard deviation :', np.mean(arr), np.std(arr))
For these methods, the above operations area all computed on all the elements of the array. But for a multidimensional array, it's possible to do the computation along a single dimension, by passing the axis
parameter; for example:
In [ ]:
print('For the following array:\n', arr)
print('The sum of elements along the rows is :', np.sum(arr, axis=1))
print('The sum of elements along the columns is :', np.sum(arr, axis=0))
As you can see in this example, the value of the axis
parameter is the dimension which will be consumed once the operation has been carried out. This is why to sum along the rows we use axis=0
.
Another widely used property of arrays is the .T
attribute, which allows you to access the transpose of the array (NumPy does this without making a copy of the array, by manipulating its strides):
In [ ]:
print('Array:\n', arr)
print('Transpose:\n', arr.T)
We don't have time here to look at all the numpy functions that operate on arrays, but here's a complete list. Simply try exploring some of these IPython to learn more, or read their description in their docstrings or the Numpy documentation:
np.ALLOW_THREADS np.compress np.irr np.pv
np.BUFSIZE np.concatenate np.is_busday np.r_
np.CLIP np.conj np.isclose np.rad2deg
np.ComplexWarning np.conjugate np.iscomplex np.radians
np.DataSource np.convolve np.iscomplexobj np.random
np.ERR_CALL np.copy np.isfinite np.rank
np.ERR_DEFAULT np.copysign np.isfortran np.rate
np.ERR_IGNORE np.copyto np.isinf np.ravel
np.ERR_LOG np.core np.isnan np.ravel_multi_index
np.ERR_PRINT np.corrcoef np.isneginf np.real
np.ERR_RAISE np.correlate np.isposinf np.real_if_close
np.ERR_WARN np.cos np.isreal np.rec
np.FLOATING_POINT_SUPPORT np.cosh np.isrealobj np.recarray
np.FPE_DIVIDEBYZERO np.count_nonzero np.isscalar np.recfromcsv
np.FPE_INVALID np.cov np.issctype np.recfromtxt
np.FPE_OVERFLOW np.cross np.issubclass_ np.reciprocal
np.FPE_UNDERFLOW np.csingle np.issubdtype np.record
np.False_ np.ctypeslib np.issubsctype np.remainder
np.Inf np.cumprod np.iterable np.repeat
np.Infinity np.cumproduct np.ix_ np.require
np.MAXDIMS np.cumsum np.kaiser np.reshape
np.MachAr np.datetime64 np.kron np.resize
np.ModuleDeprecationWarning np.datetime_as_string np.ldexp np.restoredot
np.NAN np.datetime_data np.left_shift np.result_type
np.NINF np.deg2rad np.less np.right_shift
np.NZERO np.degrees np.less_equal np.rint
np.NaN np.delete np.lexsort np.roll
np.PINF np.deprecate np.lib np.rollaxis
np.PZERO np.deprecate_with_doc np.linalg np.roots
np.PackageLoader np.diag np.linspace np.rot90
np.RAISE np.diag_indices np.little_endian np.round
np.RankWarning np.diag_indices_from np.load np.round_
np.SHIFT_DIVIDEBYZERO np.diagflat np.loads np.row_stack
np.SHIFT_INVALID np.diagonal np.loadtxt np.s_
np.SHIFT_OVERFLOW np.diff np.log np.safe_eval
np.SHIFT_UNDERFLOW np.digitize np.log10 np.save
np.ScalarType np.disp np.log1p np.savetxt
np.Tester np.divide np.log2 np.savez
np.True_ np.division np.logaddexp np.savez_compressed
np.UFUNC_BUFSIZE_DEFAULT np.dot np.logaddexp2 np.sctype2char
np.UFUNC_PYVALS_NAME np.double np.logical_and np.sctypeDict
np.VisibleDeprecationWarning np.dsplit np.logical_not np.sctypeNA
np.WRAP np.dstack np.logical_or np.sctypes
np.abs np.dtype np.logical_xor np.searchsorted
np.absolute np.e np.logspace np.select
np.absolute_import np.ediff1d np.long np.set_numeric_ops
np.add np.einsum np.longcomplex np.set_printoptions
np.add_docstring np.emath np.longdouble np.set_string_function
np.add_newdoc np.empty np.longfloat np.setbufsize
np.add_newdoc_ufunc np.empty_like np.longlong np.setdiff1d
np.add_newdocs np.equal np.lookfor np.seterr
np.alen np.errstate np.ma np.seterrcall
np.all np.euler_gamma np.mafromtxt np.seterrobj
np.allclose np.exp np.mask_indices np.setxor1d
np.alltrue np.exp2 np.mat np.shape
np.alterdot np.expand_dims np.math np.short
np.amax np.expm1 np.matrix np.show_config
np.amin np.extract np.matrixlib np.sign
np.angle np.eye np.max np.signbit
np.any np.fabs np.maximum np.signedinteger
np.append np.fastCopyAndTranspose np.maximum_sctype np.sin
np.apply_along_axis np.fft np.may_share_memory np.sinc
np.apply_over_axes np.fill_diagonal np.mean np.single
np.arange np.find_common_type np.median np.singlecomplex
np.arccos np.finfo np.memmap np.sinh
np.arccosh np.fix np.meshgrid np.size
np.arcsin np.flatiter np.mgrid np.sometrue
np.arcsinh np.flatnonzero np.min np.sort
np.arctan np.flexible np.min_scalar_type np.sort_complex
np.arctan2 np.fliplr np.minimum np.source
np.arctanh np.flipud np.mintypecode np.spacing
np.argmax np.float np.mirr np.split
np.argmin np.float128 np.mod np.sqrt
np.argpartition np.float16 np.modf np.square
np.argsort np.float32 np.msort np.squeeze
np.argwhere np.float64 np.multiply np.stack
np.around np.float_ np.nan np.std
np.array np.floating np.nan_to_num np.str
np.array2string np.floor np.nanargmax np.str0
np.array_equal np.floor_divide np.nanargmin np.str_
np.array_equiv np.fmax np.nanmax np.string_
np.array_repr np.fmin np.nanmean np.subtract
np.array_split np.fmod np.nanmedian np.sum
np.array_str np.format_parser np.nanmin np.swapaxes
np.asanyarray np.frexp np.nanpercentile np.sys
np.asarray np.frombuffer np.nanprod np.take
np.asarray_chkfinite np.fromfile np.nanstd np.tan
np.ascontiguousarray np.fromfunction np.nansum np.tanh
np.asfarray np.fromiter np.nanvar np.tensordot
np.asfortranarray np.frompyfunc np.nbytes np.test
np.asmatrix np.fromregex np.ndarray np.testing
np.asscalar np.fromstring np.ndenumerate np.tile
np.atleast_1d np.full np.ndfromtxt np.timedelta64
np.atleast_2d np.full_like np.ndim np.trace
np.atleast_3d np.fv np.ndindex np.transpose
np.average np.generic np.nditer np.trapz
np.bartlett np.genfromtxt np.negative np.tri
np.base_repr np.get_array_wrap np.nested_iters np.tril
np.bench np.get_include np.newaxis np.tril_indices
np.binary_repr np.get_printoptions np.nextafter np.tril_indices_from
np.bincount np.getbufsize np.nonzero np.trim_zeros
np.bitwise_and np.geterr np.not_equal np.triu
np.bitwise_not np.geterrcall np.nper np.triu_indices
np.bitwise_or np.geterrobj np.npv np.triu_indices_from
np.bitwise_xor np.gradient np.numarray np.true_divide
np.blackman np.greater np.number np.trunc
np.bmat np.greater_equal np.obj2sctype np.typeDict
np.bool np.half np.object np.typeNA
np.bool8 np.hamming np.object0 np.typecodes
np.bool_ np.hanning np.object_ np.typename
np.broadcast np.histogram np.ogrid np.ubyte
np.broadcast_arrays np.histogram2d np.oldnumeric np.ufunc
np.broadcast_to np.histogramdd np.ones np.uint
np.busday_count np.hsplit np.ones_like np.uint0
np.busday_offset np.hstack np.outer np.uint16
np.busdaycalendar np.hypot np.packbits np.uint32
np.byte np.i0 np.pad np.uint64
np.byte_bounds np.identity np.partition np.uint8
np.bytes0 np.iinfo np.percentile np.uintc
np.bytes_ np.imag np.pi np.uintp
np.c_ np.in1d np.piecewise np.ulonglong
np.can_cast np.index_exp np.pkgload np.unicode
np.cast np.indices np.place np.unicode_
np.cbrt np.inexact np.pmt np.union1d
np.cdouble np.inf np.poly np.unique
np.ceil np.info np.poly1d np.unpackbits
np.cfloat np.infty np.polyadd np.unravel_index
np.char np.inner np.polyder np.unsignedinteger
np.character np.insert np.polydiv np.unwrap
np.chararray np.int np.polyfit np.ushort
np.choose np.int0 np.polyint np.vander
np.clip np.int16 np.polymul np.var
np.clongdouble np.int32 np.polynomial np.vdot
np.clongfloat np.int64 np.polysub np.vectorize
np.column_stack np.int8 np.polyval np.version
np.common_type np.int_ np.power np.void
np.compare_chararrays np.int_asbuffer np.ppmt np.void0
np.compat np.intc np.print_function np.vsplit
np.complex np.integer np.prod np.vstack
np.complex128 np.interp np.product np.warnings
np.complex256 np.intersect1d np.promote_types np.where
np.complex64 np.intp np.ptp np.who
np.complex_ np.invert np.put np.zeros
np.complexfloating np.ipmt np.putmask np.zeros_like
In [ ]:
help(np.correlate)
Standard mathematical operations Just Work (TM):
In [ ]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print(arr1, '+', arr2, '=', arr1 + arr2)
Note, that, unlike in MATLAB, operations are performed element-wise:
In [ ]:
print(arr1, '*', arr2, '=', arr1 * arr2)
While this means that, in principle, arrays must always match in their dimensionality in order for an operation to be valid, numpy will broadcast dimensions when possible. For example, suppose that you want to add the number 3 to arr1
, this works:
In [ ]:
arr1 + 3
This broadcasting behavior is powerful, especially because when numpy broadcasts to create new dimensions or to 'stretch' existing ones, it doesn't replicate the data. In the example above the operation is carried as if the 3 was a 1-d array with 3 in all of its entries, but no actual array was ever created. This can save memory in cases when the arrays in question are large, with significant performance implications.
The general rule is: when operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward, creating dimensions of length 1 as needed. Two dimensions are considered compatible when
None
, or if dimensions are equalIf these conditions are not met, a ValueError: frames are not aligned
exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.
Examples below:
(9, 5) (9, 5) (9, 5) (9, 1)
( ) (9, 1) ( 5) ( 5)
------ ------ ------ ------
(9, 5) (9, 5) (9, 5) (9, 5)
Sketch from Nicolas Rougier's NumPy tutorial
In [ ]:
Sometimes, it is necessary to modify arrays before they can be used together. Numpy allows you to add new axes to an array by indexing with np.newaxis
:
In [ ]:
c = np.arange(5)
d = np.arange(6)
print(c.shape)
print(d.shape)
c + d
In [ ]:
c = np.arange(5)
d = np.arange(6)
print "Before adding a new axis:"
print c
c = c[:, np.newaxis]
print "After adding a new axis:"
print c
In [ ]:
print(c.shape)
print(' ', d.shape)
print('-------')
print((c + d).shape)
# d d d d d d
#
# c c c c c c c d d d d d d
# c c c c c c c d d d d d d
# c + = c c c c c c + d d d d d d
# c c c c c c c d d d d d d
# c c c c c c c d d d d d d
c + d
In [ ]:
b = 3*a - 1 # a is array, b becomes array
The above operation generates a temporary array:
In [ ]:
b = a
b *= 3 # or multiply(b, 3, b)
b -= 1 # or subtract(b, 1, b)
In-place operations:
a *= 3.0 # multiply a's elements by 3 a -= 1.0 # subtract 1 from each element a /= 3.0 # divide each element by 3 a += 1.0 # add 1 to each element a **= 2.0 # square all elements
In [ ]:
# let b be an array
c = np.sin(b)
c = np.arcsin(c)
c = np.sinh(b)
# same functions for the cos and tan families
c = b**2.5 # power function
c = np.log(b)
c = np.exp(b)
c = np.sqrt(b)
# a is an array
a.clip(min=3, max=12) # clip elements
a.mean(); np.mean(a) # mean value
a.var(); np.var(a) # variance
a.std(); np.std(a) # standard deviation
np.median(a)
np.trapz(a) # Trapezoidal integration
np.diff(a) # finite differences (da/dx)
Hint: You can use whatever simple method you like: from the left, right or the trapezoid rule. Just recall the approximation for a definite integral from calculus: $$\int^b_a f(x) \ dx \approx \sum^n_{i=1} f(x_i^*) \ \Delta x_i$$
In [ ]:
Universal functions run much faster than for loops, which should be avoided whenever possible
In [ ]:
def mult1(a,b):
return a * b
def mult2(a,b):
c = np.empty(a.shape)
for i in range(a.shape[0]):
for j in range(a.shape[1]):
c[i,j] = a[i,j] * b[i,j]
return c
In [ ]:
a = np.random.random((400,400))
b = np.random.random((400,400))
In [ ]:
%timeit mult1(a,b)
In [ ]:
%timeit mult2(a,b)
Numpy ships with a basic linear algebra library, and all arrays have a dot
method whose behavior is that of the scalar dot product when its arguments are vectors (one-dimensional arrays) and the traditional matrix multiplication when one or both of its arguments are two-dimensional arrays:
In [ ]:
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])
print(v1, '·', v2, '=', v1.dot(v2))
In [ ]:
A = np.arange(6).reshape(2, 3)
print('A:\n', A)
print('v1:\n', v1)
In [ ]:
A.dot(v1)
For matrix-matrix multiplication, the same dimension-matching rules must be satisfied, e.g. consider the difference between $A \times A^T$:
In [ ]:
print(A.dot(A.T))
and $A^T \times A$:
In [ ]:
print(A.T.dot(A))
Furthermore, the numpy.linalg
module includes additional functionality such as determinants, matrix norms, Cholesky, eigenvalue and singular value decompositions, etc. For even more linear algebra tools, scipy.linalg
contains the majority of the tools in the classic LAPACK libraries as well as functions to operate on sparse matrices. We refer the reader to the Numpy and Scipy documentations for additional details on these.
In [ ]:
Numpy lets you read and write arrays into files in a number of ways. In order to use these tools well, it is critical to understand the difference between a text and a binary file containing numerical data. In a text file, the number $\pi$ could be written as "3.141592653589793", for example: a string of digits that a human can read, with in this case 15 decimal digits. In contrast, that same number written to a binary file would be encoded as 8 characters (bytes) that are not readable by a human but which contain the exact same data that the variable pi
had in the computer's memory.
The tradeoffs between the two modes are thus:
Text mode: occupies more space, precision can be lost (if not all digits are written to disk), but is readable and editable by hand with a text editor. Can only be used for one- and two-dimensional arrays.
Binary mode: compact and exact representation of the data in memory, can't be read or edited by hand. Arrays of any size and dimensionality can be saved and read without loss of information.
First, let's see how to read and write arrays in text mode. The np.savetxt
function saves an array to a text file, with options to control the precision, separators and even adding a header:
In [ ]:
arr = np.arange(10).reshape(2, 5)
print(arr)
np.savetxt('test.out', arr)
In [ ]:
!cat test.out
And this same type of file can then be read with the matching np.loadtxt
function:
In [ ]:
arr2 = np.loadtxt('test.out')
print(arr2)
For binary data, Numpy provides the np.save
and np.savez
routines. The first saves a single array to a file with .npy
extension, while the latter can be used to save a group of arrays into a single file with .npz
extension. The files created with these routines can then be read with the np.load
function.
Let us first see how to use the simpler np.save
function to save a single array:
In [ ]:
np.save('test.npy', arr)
# Now we read this back
arr_loaded = np.load('test.npy')
print(arr)
print(arr_loaded)
print(arr_loaded.dtype)
# Let's see if any element is non-zero in the difference.
# A value of True would be a problem.
print('Any differences?', np.any(arr - arr_loaded))
Now let us see how the np.savez_compressed
function works.
In [ ]:
np.savez_compressed('test.npz', first=arr, second=arr2)
arrays = np.load('test.npz')
arrays.files
The object returned by np.load
from an .npz
file works like a dictionary:
In [ ]:
arrays['first']
This .npz
format is a very convenient way to package compactly and without loss of information, into a single file, a group of related arrays that pertain to a specific problem. At some point, however, the complexity of your dataset may be such that the optimal approach is to use one of the standard formats in scientific data processing that have been designed to handle complex datasets, such as NetCDF or HDF5.
Fortunately, there are tools for manipulating these formats in Python, and for storing data in other ways such as databases. A complete discussion of the possibilities is beyond the scope of this tutorial, but of particular interest for scientific users we at least mention the following:
The scipy.io
module contains routines to read and write Matlab files in .mat
format and files in the NetCDF format that is widely used in certain scientific disciplines.
For manipulating files in the HDF5 format, there are two excellent options in Python: the PyTables project offers a high-level, object oriented approach to manipulating HDF5 datasets, while the h5py project offers a more direct mapping to the standard HDF5 library interface. Both are excellent tools; if you need to work with HDF5 datasets you should read some of their documentation and examples and decide which approach is a better match for your needs.
In [ ]: