In [6]:
import numpy as np
The core of the numpy
package is the array
class. Let's examine that first. We can make an array out of a sequence, like a list.
In [7]:
d = [1, 2, 3, 4, 5]
np.array(d)
Out[7]:
In [8]:
d = [1, 2, 3.1415, 4, 5]
np.array(d)
Out[8]:
You can query the datatype by examining the dtype
attribute of the array.
In [9]:
d = [1, 2, 3.1415, 4, 5]
arr = np.array(d)
arr.dtype
Out[9]:
Array types may be defined explicity in the call
In [10]:
arr = np.array([1, 2, 3, 4, 5], dtype='float32')
arr
Out[10]:
Complex numbers are noted with a lowercase j
or uppercase J
, like this
In [11]:
cmplx = np.array([1.0+2.0j, 3.0+4.0J])
print(cmplx)
cmplx.dtype
Out[11]:
As we have seen before, arrays are like multidimensional sequences. We can create a 2D array by supplying a list of lists as the argument.
In [12]:
arr = np.array([[1., 2., 3.,], [4., 5., 6.]])
arr
Out[12]:
In [13]:
arr.size # The number of elements in the array
Out[13]:
In [14]:
arr.shape # The shape of the array (i.e., the size of each dimension)
Out[14]:
In [15]:
arr.ndim # The number of dimensions of the array
Out[15]:
In [16]:
arr.shape = (3, 2)
arr
Out[16]:
In [17]:
arr.shape = (6,)
arr
Out[17]:
Singleton dimensions add to the dimensionality of an array. The last example was a 1D array (also called a vector), the next are 2D arrays.
In [18]:
arr.shape = (1, 6)
arr # Note that there are *two* square brackets in the output sequence. This is a row vector.
Out[18]:
In [19]:
arr.shape = (6, 1)
arr # this is also a 2D array, like a column vector
Out[19]:
Arrays are indexed in a similar way to sequences, with start:stop:stride
notation, except that this is used for each dimension in the array. Colons denote all the values in a particular dimension, slices indicate some particular subset of the data in that particular dimension.
A common use case is to get a single row or column from a 2D array (a table of data).
In [21]:
arr = np.arange(60).reshape(6, 10)
arr
Out[21]:
In [22]:
arr[:, 4] # the 5th column
Out[22]:
In [23]:
arr[2, :] # the 3rd row
Out[23]:
In [24]:
arr[2] # Trailing colons do not need to be explicitly typed. This is equivalent to the last example.
Out[24]:
In [25]:
arr[4, 7] # an individual element in the table
Out[25]:
In [ ]:
Generally, you will want to think of arrays as representing dimensions in space and time. The conventional way to think of this is that the dimensions are $(t, z, y, x)$; missing dimensions are omitted. This will help make plotting and analysis easier. Some examples might be:
temp[:, :, :, :] # A 4D array (time, height, latitude, longitude)
press[:, :] # A 2D array (time, height)
humid[:, :] # A 2D array (latitude, longitude)
In [27]:
arr = np.array([[1., 2., 3.,], [4., 5., 6.]]) # reset the array to our 2x3 array.
arr.mean() # The mean of all of the elements in the array
Out[27]:
Mean
takes the optional argument axis
that can be used to take the mean along a single axis of the array. Just like with indexing, the axes are reference in a zero-based system; axis=0
means the first dimension.
In [32]:
arr.mean(axis=0) # The mean
Out[32]:
In this case, there are two rows in the first dimension, and arr.mean(axis=0)
takes the average in the 'row' direction, resulting in a 1D array that is the average across the rows.
In [ ]:
You can also use the reshape
method to change the shape of an array.
In [33]:
arr
Out[33]:
In [34]:
arr.reshape(3, 2)
Out[34]:
You can find the mininum and maximum of an array with the min
and max
methods. Sometimes it is useful to find the indices of these minima and maxima. For this use argmin
and argmax
, like
In [35]:
x = np.random.rand(10)
imax = x.argmax()
print(imax, x[imax], x.max())
In [36]:
o = np.ones((3, 4, 5)) # The argument is a shape, so is a tuple with the length of each dimension as an argument
b = np.ones((2, 3), dtype=np.bool)
z = np.zeros((2, 3), dtype=np.float32)
b
Out[36]:
You can also create these arrays with the same shape and datatype of the input array using np.ones_like
and np.zeros_like
.
In [37]:
zb = np.zeros_like(b)
zb
Out[37]:
You can also create a diagonal array with a given vector along the diagonal. These can be offset with an optional argument k
(default=0). This example creates a tri-diagonal array similar to that used for finite difference calculations
In [38]:
np.diag(-2*np.ones(6)) + np.diag(np.ones(5), k=-1) + np.diag(np.ones(5), k=1)
Out[38]:
There are also a number of ways to generate sequences of numbers.
np.arange([start,] stop [[, stride]])
Create a sequence of numbers, similar to range
np.linspace(min, max, length)
Create a uniform series of specified length
between min
and max
, inclusive.np.logspace(minpow, maxpow, length)
Create a uniform series in logspace of specified length
between 10**minpow
and 10**maxpow
, inclusive.
In [53]:
np.arange(10.)
Out[53]:
In [54]:
np.arange(2, 10, 2)
Out[54]:
In [55]:
np.linspace(2, 4, 17)
Out[55]:
You can create arrays of random numbers easily with methods in np.random
.
np.random.rand(d0, d1, ..., d2)
: Create an array of the given shape d0, ..., dn
and populate it with random samples from a uniform distribution over [0,1).np.random.randint(low, high=None, size=None)
: Return random integers from low
(inclusive) to high
(exclusive). If high
is None then return integers from [0, low
). size
is an int or tuple of ints to give the output shape.np.randon.randn(d0, d1, ..., d2)
: Create an array of the given shape d0, ..., dn
and populate it with random samples from the "standard normal" distribution.np.random.random(size=None)
: Return random floats of size
(int or tuple of ints) in the interval [0, 1).
In [47]:
np.random.rand(2, 4)
Out[47]:
In [49]:
np.random.randint(1, 50, (2, 4))
Out[49]:
In [ ]:
In [54]:
x = np.random.rand(4, 5, 6)
y = np.random.rand(4, 5, 6)
print(np.concatenate((x, y)).shape)
print()
print(np.concatenate((x, y), axis=0).shape)
print(np.concatenate((x, y), axis=1).shape)
print(np.concatenate((x, y), axis=2).shape)
There are a number of convenience functions that act like concatenate for specific axes:
np.vstack
– vertical stack (stack along axis=0)np.hstack
– horizontal stack (stack along axis=1)np.dstack
– depth stack (stack along axis=2)
In [55]:
print(np.vstack((x, y)).shape)
print(np.hstack((x, y)).shape)
print(np.dstack((x, y)).shape)
Likewise, arrays can be split with np.split
or np.array_split
. There are also convenience functions to split horizontally, vertically, and with depth.
In [42]:
x = np.random.rand(12, 2, 5)
[a.shape for a in np.split(x, 4, axis=0)]
Out[42]:
Create an array, A, of shape (40, 50, 60). The array slices for first ten entries in the axis=1 direction of A should be filled with 1's, for the next ten filled with 2's, and on up to 5's.
Split it along axis=1 into five sections.
Concatenate two of these back together along axis 1.
What is the resulting shape of each array? [Advanced: can you calculate this on one line?]
In [ ]:
In [57]:
x = np.random.rand(5, 5)
print(x)
ind = x > 0.5
print(ind)
The boolean array can be used as an index to other arrays. Note this will return a 1D array, no matter what dimension the origial arrays are, because there is no way to know what structure the True
values have.
In [60]:
x = np.random.rand(5, 5)
y = np.sin(x)
y[x > 0.5]
# or, equivalently, as two lines
idx = x > 0.5
y[idx]
Out[60]:
To get the indices of the places where the conditional is true (i.e., the locations of the True
values in the boolean array), use the np.where
command.
In [61]:
x = np.random.rand(5, 5)
idx = np.where(x > 0.5)
idx
Out[61]:
Note that np.where
always returns a tuple of indices for each dimension. This is a little strange for 1D arrays, but is done for consistency across all input values. Often, you will want to explicitly pull out the (single) array of indices from the tuple, like
In [ ]:
x = np.random.rand(10)
idx = np.where(x>0.5)[0]
print(idx)
What happens with the [0] is missing behind the call to where
?
In [ ]:
The data for an array may be stored in memory using C
or FORTRAN
ordered memory. Typically, there is no need to think about this, some details can be found here.
However, it is important to remember that subsets of an array can produce a different 'view' of the array that addresses the same memory as the original array. This can lead to some unexpected behaviors. One way to think of this is that assignment in Python is more like a C-pointer (i.e., a reference to a memory location) than an actual value.
In [29]:
a = np.arange(10.0)
b = a[::2]
print(a)
print(b)
In [30]:
a[4] = -999 # this will modify b as well...
print(a)
print(b)
In [31]:
b[-1] = -888 # this will modify a as well...
print(a)
print(b)
Normally, this will not be a problem, but if you need to make sure that a subset of an array has it's own memory, make sure you make a copy
of the array, like
In [32]:
a = np.arange(10.0)
b = a.copy()[::2] # or np.copy(a)
a[4] = -999 # this will NOT modify b now
print(a)
print(b)
(Largely taken from SciPy docs)
Generally arrays should be the same shape for them to be multiplied together.
In [61]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b
Out[61]:
The term broadcasting describes how numpy
treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.
For example, the simplest broadcasting example occurs when an array and a scalar value are combined in an operation:
In [62]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b
Out[62]:
The result is equivalent to the previous example where b was an array. We can think of the scalar b being stretched during the arithmetic operation into an array with the same shape as a. The new elements in b are simply copies of the original scalar. The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies, so that broadcasting operations are as memory and computationally efficient as possible.
The code in the second example is more efficient than that in the first because broadcasting moves less memory around during the multiplication (b is a scalar rather than an array).
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.
Arrays do not need to have the same number of dimensions. For example, if you have a 256x256x3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:
Image (3d array): 256 x 256 x 3
Scale (1d array): 3
Result (3d array): 256 x 256 x 3
When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.
In the following example, both the A and B arrays have axes with length one that are expanded to a larger size during the broadcast operation:
A (4d array): 8 x 1 x 6 x 1
B (3d array): 7 x 1 x 5
Result (4d array): 8 x 7 x 6 x 5
Here are some more examples:
A (2d array): 5 x 4
B (1d array): 1
Result (2d array): 5 x 4
A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4
A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5
Let's create an example with arrays of random numbers.
In [65]:
A = np.random.rand(15, 3, 5)
B = np.random.rand(3, 1)
print(A.shape, B.shape)
Result = A * B
print(Result.shape)
Here are examples of shapes that do not broadcast:
A (1d array): 3
B (1d array): 4 # trailing dimensions do not match
A (2d array): 2 x 1
B (3d array): 8 x 4 x 3 # second from last dimensions mismatched
In [ ]:
Notice that the rules for broadcasting are based on the location of singleton dimensions. Singleton dimensions are implied forward (to the left), but not backward (to the right). So, the first example here works but not the second:
A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4
A (2d array): 5 x 4
B (1d array): 5
Compare with large set of examples above. How can the bottom example here be fixed?
This problem can be fixed by creating new singleton dimensions in arrays. This can be done by putting np.newaxis
in the appropriate space when indexing the array. For example:
In [77]:
A = np.random.rand(5, 4)
B = np.random.rand(5)
A*B
In [78]:
print(B.shape)
print(B[:,np.newaxis].shape)
In [82]:
(A*B[:,np.newaxis]).shape
Out[82]:
In [ ]:
In [ ]:
In [99]:
a = np.arange(12).reshape(3, 4)
print(a)
b = a.flatten()
print(b)
the flat
attribute on the other hand gives a view of the array in 1D. It looks like an iterator object (like range
and zip
). This allows
In [100]:
a.flat[6] = -999
print(a)
In contrast, this does not work as expected. WHY?
In [101]:
a.flatten()[5] = -888
print(a)
Other operations can be done to the array first. For example, we can take a transpose of the array before we flatten it.
In [102]:
a.T.flat[6] = -998
print(a)
Here, the T
attribute (equivalent to the a.transpose()
method) gives a view of the array transposed (similar to MATLAB's tick notation).
In [103]:
print(a.T)
Masked arrays are ways to create arrays with missing values. MATLAB™ uses NaNs (NaN stands for 'Not a Number'), and the NaNs are the values of the arrays at those points. This approach also works in Python. Masked arrays are preferred since they retain the masked array values, and also some plotting routines require masked arrays when plotting arrays with missing values. Masked arrays are usually created through some condition, like
In [105]:
arr = np.random.randn(7, 8)
cond = arr > 0.1 # `cond` is True for the random values greater than 0.5
marr = np.ma.masked_where(cond, arr)
print(marr)
In [108]:
marr.mean(axis=0)
Out[108]:
The mask can also be supplied explicity when creating the masked array,
In [ ]:
marr = np.ma.masked_array([1, 2, 3, 4, 5], mask=[True, True, False, False, True])
marr
One of the basic commands in numpy
for loading in data is the loadtxt
command. There are other ways to do this, such as the genfromtxt
command, but loadtxt
is sufficient for most purposes, and is easy to use.
In [18]:
data = np.loadtxt('../data/CTD.txt', comments='*')
data[:,1] # a column of data representing temperature
Out[18]:
Read in the oceanographic data file '../data/CTD.txt' into an array. You can look at the data file itself to see what variables are stored in each column.
Using this data, write a function to calculate the linear equation of state. This is an approximation of the density of water, as it depends on salinity, temperature, and some empirical constants. We will use the following form for the linear equation of state:
$\rho = 1027[1+7.6\times 10^{-4}(S-35) -1.7\times 10^{-4}(T-25)]$
where $\rho$ is the density, $S$ is the salinity, and $T$ is the temperature.
This is more free form than the homework, so you should set up all of the associated code to call the function, and write out the function yourself. Don't forget docstrings! For a check, the first value of your density array in order should equal 1021.7519981630001 and the last should equal 1028.0471353619998.
In [ ]:
In [9]:
import numpy as np
x = np.random.rand(100)
y = 5 + 3*x + 0.1*np.random.randn(100) # A straight line with some noise
p = np.polyfit(x, y, 3) # fit a straight line (order is 1)
print(p) # The coefficients of the polynomial, with highest order first. (i.e,. [slope, intercept])
Let's plot it to make sure this makes sense:
In [8]:
import matplotlib.pyplot as plt
%matplotlib inline
# plot data
plt.plot(x, y, '.')
# plot fitted line
plt.plot(x, p[0]*x + p[1])
# plt.legend(('Data', 'Fitted line'))
Out[8]:
Once you have the fit, you can use it to find other useful things, like the value of the fitted line at $x=1$:
In [138]:
np.polyval(p, 1)
Out[138]:
You can also use the np.polynomial.Polynomial
class to work with polynomials. Note, these define polynomials the opposite way, with the lowest order first. The Polynomial class gives an excellent example of operator overloading, and the flexibility of classes.
In [139]:
p1 = np.polynomial.Polynomial([5, 3]) # y = 5 + 3 x
p2 = np.polynomial.Polynomial([3, 6, 8, 2]) # y = 3 + 6 x + 8 x**2 + 2 x**3
You can use the Polynomial object to evaluate the value of the polynomial at various input values:
In [148]:
print('Evaluation')
print('p1(0.0) = ', p1(0))
print('p2(5.0) = ', p2(5))
We can use this to make a plot to see the function:
In [151]:
x = np.linspace(0,10)
plt.plot(x, p1(x), x, p2(x))
plt.legend(['p1', 'p2'])
Out[151]:
Other things we can do:
In [152]:
print('Roots')
print('Roots of p2 = ', p2.roots())
print()
print('Operations')
print('p1 + p2 = ', p1 + p2)
print('p1 * p2 = ', p1 * p2)
print()
print('Calculus')
print('Derivative of p1', p1.deriv(1))
print('Integral of p2', p2.integ(4, k=[4, 3, 2, 1]))
Vectorization and array broadcasting are two big reasons that numpy
can be efficient and fast. With these tools, you can avoid writing for loops (which are slow).
The best way to do mathematical operations using numpy
arrays is to do vector
operations. That is, mathematical operations are defined to be element by element, and this is done much faster than looping. As a rule of thumb, you should be very concerned if your code has more than one significant for
loop in the numerical analysis section.
Here is a way to do multiply 2 big arrays using for loops, which is not how you should do it. The sum at the end is included for comparison with the subsequent approach.
In [21]:
a = np.arange(102400.0).reshape(4, 8, 1600, 2) # a 4D array using sequential numbers
b = np.random.rand(4, 8, 1600, 2) # a 4D array using random numbers
li, lj, lk, lm = b.shape # size of b in each dimension
sol = np.zeros(b.shape)
for i in range(li):
for j in range(lj):
for k in range(lk):
for m in range(lm):
sol[i,j,k,m] = a[i,j,k,m]*b[i,j,k,m]
print(sol.sum())
The better way is to directly multiply the arrays together, taking advantage of C code that Python has in the background.
In [22]:
sol = a * b # element-by-element multiplication. This operation is about as fast as it can be on your computer.
print(sol.sum())
In [19]:
b = np.random.randn(5000, 2000)
%time u, s, v = np.linalg.svd(b)
%time
runs the line once and gives the time required. However, calculation times vary depending on many things including the numbers involved and the state of your computer at the moment. In this case, the %timeit
function can be used to perform the test a number of times to get an average calculation time.
In [20]:
%timeit b = np.random.randn(50, 20); u, s, v = np.linalg.svd(b)
For statements that are longer than a single line, the time.time
function can be used.
In [100]:
import time
t_start = time.time()
time.sleep(0.25) # Do nothing for 0.25 seconds
t_stop = time.time()
print('{:6.4f} seconds have passed.'.format(t_stop-t_start))
Earlier, we discussed using array operations instead of looping because it is faster. Let's compare.
Calculate the time it takes to calculate the $a$ and $b$ arrays with dimensions [4, 8, 1600, 2] by both methods demonstrated: using a series of 4
for
loops, one for each dimension of the arrays and using array operations. Compare the times by calculating a ratio.
In [ ]:
In [78]:
a = np.random.randn(100, 100)
e, v = np.linalg.eig(a)
b = np.random.randn(500, 200)
u, s, v = np.linalg.svd(b)
Matrix multiplication is done using the np.dot
function. In this case, matrices do not need to be the same shape, but must follow the rules of matrix multiplication. E.g., the operation dot(<4x5 array>, <5x12 array>) results in a 4x12 array; i.e., the inner dimensions must match (technically last and second-to-last, for arrays with more than two dimensions).
In [81]:
x = np.random.rand(4, 5)
y = np.random.rand(5, 12)
res = np.dot(x, y)
print(res.shape)
# np.dot(y, x) # This gives an error -- order is important.
Output from a numerical model of the northwestern Gulf of Mexico are saved in a file
../data/model.npz
. Read in this file usingnp.load
. Among other things, it containsh
, the depths within the numerical domain, andssh
, the sea surface heights at two time steps. The sea surface height gives the deviation above and below sea level from a reference water level (which changes in time as the water moves), and the depths of the seabed are also given with respect to that reference water level.Find the full water column depth, between the seabed and the sea surface, for the two given times.
You can use as a comparison that at the first time step the [0,0] value of this array should be 3007.6088347392124, and at the second time step the [0,-1] value should be 605.25282427018749. Note that there is a differences between the two time steps though it is generally quite small since it is the difference between time steps in the numerical circulation model.
Don't forget array conventions, repeated here for convenience:
Generally, you will want to think of arrays as representing dimensions in space and time. The conventional way to think of this is that the dimensions are $(t, z, y, x)$; missing dimensions are omitted. This will help make plotting and analysis easier. Some examples might be:
temp[:, :, :, :] # A 4D array (time, height, latitude, longitude)
press[:, :] # A 2D array (time, height)
humid[:, :] # A 2D array (latitude, longitude)
In [ ]: