Numpy


In [6]:
import numpy as np

The core of the numpy package is the array class. Let's examine that first. We can make an array out of a sequence, like a list.


In [7]:
d = [1, 2, 3, 4, 5]
np.array(d)


Out[7]:
array([1, 2, 3, 4, 5])

data types

Unlike lists, arrays must be homogeneous, in that the data types of each element must be the same. The data type of the array is upcast to be able to represent all of the data. So, if only one element is a float, all elements will be converted to floats.


In [8]:
d = [1, 2, 3.1415, 4, 5]
np.array(d)


Out[8]:
array([ 1.    ,  2.    ,  3.1415,  4.    ,  5.    ])

You can query the datatype by examining the dtype attribute of the array.


In [9]:
d = [1, 2, 3.1415, 4, 5]
arr = np.array(d)
arr.dtype


Out[9]:
dtype('float64')

Array types may be defined explicity in the call


In [10]:
arr = np.array([1, 2, 3, 4, 5], dtype='float32')
arr


Out[10]:
array([ 1.,  2.,  3.,  4.,  5.], dtype=float32)

Complex numbers are noted with a lowercase j or uppercase J, like this


In [11]:
cmplx = np.array([1.0+2.0j, 3.0+4.0J])
print(cmplx)
cmplx.dtype


[ 1.+2.j  3.+4.j]
Out[11]:
dtype('complex128')

As we have seen before, arrays are like multidimensional sequences. We can create a 2D array by supplying a list of lists as the argument.


In [12]:
arr = np.array([[1., 2., 3.,], [4., 5., 6.]])
arr


Out[12]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

Array attributes

Arrays have a few other important attributes. Note attributes never have parentheses after them. Methods always do.


In [13]:
arr.size          # The number of elements in the array


Out[13]:
6

In [14]:
arr.shape         # The shape of the array (i.e., the size of each dimension)


Out[14]:
(2, 3)

In [15]:
arr.ndim          # The number of dimensions of the array


Out[15]:
2

Setting array shape

You can set the array.shape attribute to change the shape of the array. This attribute does not change the elements of the array, or how it is stored in memory, just how it is seen.


In [16]:
arr.shape = (3, 2)
arr


Out[16]:
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.]])

In [17]:
arr.shape = (6,)
arr


Out[17]:
array([ 1.,  2.,  3.,  4.,  5.,  6.])

Singleton dimensions add to the dimensionality of an array. The last example was a 1D array (also called a vector), the next are 2D arrays.


In [18]:
arr.shape = (1, 6)
arr   # Note that there are *two* square brackets in the output sequence. This is a row vector.


Out[18]:
array([[ 1.,  2.,  3.,  4.,  5.,  6.]])

In [19]:
arr.shape = (6, 1)
arr   # this is also a 2D array, like a column vector


Out[19]:
array([[ 1.],
       [ 2.],
       [ 3.],
       [ 4.],
       [ 5.],
       [ 6.]])

Array indexing

Arrays are indexed in a similar way to sequences, with start:stop:stride notation, except that this is used for each dimension in the array. Colons denote all the values in a particular dimension, slices indicate some particular subset of the data in that particular dimension.

A common use case is to get a single row or column from a 2D array (a table of data).


In [21]:
arr = np.arange(60).reshape(6, 10)
arr


Out[21]:
array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59]])

In [22]:
arr[:, 4]   # the 5th column


Out[22]:
array([ 4, 14, 24, 34, 44, 54])

In [23]:
arr[2, :]   # the 3rd row


Out[23]:
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [24]:
arr[2]     # Trailing colons do not need to be explicitly typed. This is equivalent to the last example.


Out[24]:
array([20, 21, 22, 23, 24, 25, 26, 27, 28, 29])

In [25]:
arr[4, 7]   # an individual element in the table


Out[25]:
47

Exercise

Slices can be combined in any way. Define a new array or use array arr and grab out every other row and the 4th column and beyond.



In [ ]:

Conventions concerning arrays containing spatio-temporal information

Generally, you will want to think of arrays as representing dimensions in space and time. The conventional way to think of this is that the dimensions are $(t, z, y, x)$; missing dimensions are omitted. This will help make plotting and analysis easier. Some examples might be:

temp[:, :, :, :]     # A 4D array (time, height, latitude, longitude)
press[:, :]          # A 2D array (time, height)
humid[:, :]          # A 2D array (latitude, longitude)

Array methods

Arrays have a number of methods. Let's take a look at the mean method as an example.


In [27]:
arr = np.array([[1., 2., 3.,], [4., 5., 6.]])  # reset the array to our 2x3 array.

arr.mean()        # The mean of all of the elements in the array


Out[27]:
3.5

Mean takes the optional argument axis that can be used to take the mean along a single axis of the array. Just like with indexing, the axes are reference in a zero-based system; axis=0 means the first dimension.


In [32]:
arr.mean(axis=0)  # The mean


Out[32]:
array([ 2.5,  3.5,  4.5])

In this case, there are two rows in the first dimension, and arr.mean(axis=0) takes the average in the 'row' direction, resulting in a 1D array that is the average across the rows.


Exercise

Find the mean of the array in the 'column' direction, along axis=1.

Use the sum method of the array class to get the sum of the numbers in each column. The result should be a 1D array with three elements.



In [ ]:

You can also use the reshape method to change the shape of an array.


In [33]:
arr


Out[33]:
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

In [34]:
arr.reshape(3, 2)


Out[34]:
array([[ 1.,  2.],
       [ 3.,  4.],
       [ 5.,  6.]])

You can find the mininum and maximum of an array with the min and max methods. Sometimes it is useful to find the indices of these minima and maxima. For this use argmin and argmax, like


In [35]:
x = np.random.rand(10)
imax = x.argmax()
print(imax, x[imax], x.max())


7 0.947778556457 0.947778556457

Creating standard arrays

There are a few standard arrays, for example, arrays filled with zeros or ones (or empty). Here are some examples of creating arrays.


In [36]:
o = np.ones((3, 4, 5))    # The argument is a shape, so is a tuple with the length of each dimension as an argument
b = np.ones((2, 3), dtype=np.bool)
z = np.zeros((2, 3), dtype=np.float32)

b


Out[36]:
array([[ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

You can also create these arrays with the same shape and datatype of the input array using np.ones_like and np.zeros_like.


In [37]:
zb = np.zeros_like(b)
zb


Out[37]:
array([[False, False, False],
       [False, False, False]], dtype=bool)

You can also create a diagonal array with a given vector along the diagonal. These can be offset with an optional argument k (default=0). This example creates a tri-diagonal array similar to that used for finite difference calculations


In [38]:
np.diag(-2*np.ones(6)) + np.diag(np.ones(5), k=-1) + np.diag(np.ones(5), k=1)


Out[38]:
array([[-2.,  1.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  1., -2.]])

There are also a number of ways to generate sequences of numbers.

  • np.arange([start,] stop [[, stride]]) Create a sequence of numbers, similar to range
  • np.linspace(min, max, length) Create a uniform series of specified length between min and max, inclusive.
  • np.logspace(minpow, maxpow, length) Create a uniform series in logspace of specified length between 10**minpow and 10**maxpow, inclusive.

In [53]:
np.arange(10.)


Out[53]:
array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])

In [54]:
np.arange(2, 10, 2)


Out[54]:
array([2, 4, 6, 8])

In [55]:
np.linspace(2, 4, 17)


Out[55]:
array([ 2.   ,  2.125,  2.25 ,  2.375,  2.5  ,  2.625,  2.75 ,  2.875,
        3.   ,  3.125,  3.25 ,  3.375,  3.5  ,  3.625,  3.75 ,  3.875,  4.   ])

You can create arrays of random numbers easily with methods in np.random.

  • np.random.rand(d0, d1, ..., d2): Create an array of the given shape d0, ..., dn and populate it with random samples from a uniform distribution over [0,1).
  • np.random.randint(low, high=None, size=None): Return random integers from low (inclusive) to high (exclusive). If high is None then return integers from [0, low). size is an int or tuple of ints to give the output shape.
  • np.randon.randn(d0, d1, ..., d2): Create an array of the given shape d0, ..., dn and populate it with random samples from the "standard normal" distribution.
  • np.random.random(size=None): Return random floats of size (int or tuple of ints) in the interval [0, 1).

In [47]:
np.random.rand(2, 4)


Out[47]:
array([[ 0.65562795,  0.84163171,  0.55489888,  0.769313  ],
       [ 0.01631938,  0.42805918,  0.24618697,  0.89156945]])

In [49]:
np.random.randint(1, 50, (2, 4))


Out[49]:
array([[25, 44, 48, 35],
       [25, 35, 18, 19]])

Exercise

Create an array of random floats between 0 and 1 that has dimension 5 x 3. Calculate the standard deviation of the columns of the array. Then add to this a linspace array of the appropriate size that contains numbers between 10 and 15.



In [ ]:

Combining and splitting arrays

Generally, arrays can be combined with the np.concatenate function. The arguments are a sequence of arrays to join, and the axis along which to join them (default=0).


In [54]:
x = np.random.rand(4, 5, 6)
y = np.random.rand(4, 5, 6)

print(np.concatenate((x, y)).shape)
print()
print(np.concatenate((x, y), axis=0).shape)
print(np.concatenate((x, y), axis=1).shape)
print(np.concatenate((x, y), axis=2).shape)


(8, 5, 6)

(8, 5, 6)
(4, 10, 6)
(4, 5, 12)

There are a number of convenience functions that act like concatenate for specific axes:

  • np.vstack – vertical stack (stack along axis=0)
  • np.hstack – horizontal stack (stack along axis=1)
  • np.dstack – depth stack (stack along axis=2)

In [55]:
print(np.vstack((x, y)).shape)
print(np.hstack((x, y)).shape)
print(np.dstack((x, y)).shape)


(8, 5, 6)
(4, 10, 6)
(4, 5, 12)

Likewise, arrays can be split with np.split or np.array_split. There are also convenience functions to split horizontally, vertically, and with depth.


In [42]:
x = np.random.rand(12, 2, 5)
[a.shape for a in np.split(x, 4, axis=0)]


Out[42]:
[(3, 2, 5), (3, 2, 5), (3, 2, 5), (3, 2, 5)]

Exercise

Create an array, A, of shape (40, 50, 60). The array slices for first ten entries in the axis=1 direction of A should be filled with 1's, for the next ten filled with 2's, and on up to 5's.

Split it along axis=1 into five sections.

Concatenate two of these back together along axis 1.

What is the resulting shape of each array? [Advanced: can you calculate this on one line?]



In [ ]:

Finding values

There are a number of ways to find values in an array. The simplest is always to create a boolean array, like


In [57]:
x = np.random.rand(5, 5)
print(x)
ind = x > 0.5
print(ind)


[[ 0.11873289  0.40188555  0.96426409  0.60360203  0.54526375]
 [ 0.20289134  0.98457289  0.3847002   0.00733166  0.63739985]
 [ 0.50234835  0.25930096  0.36977497  0.05787961  0.95391643]
 [ 0.02016065  0.69361985  0.5945571   0.381457    0.33079702]
 [ 0.62571904  0.18626591  0.35990447  0.65804042  0.12489787]]
[[False False  True  True  True]
 [False  True False False  True]
 [ True False False False  True]
 [False  True  True False False]
 [ True False False  True False]]

The boolean array can be used as an index to other arrays. Note this will return a 1D array, no matter what dimension the origial arrays are, because there is no way to know what structure the True values have.


In [60]:
x = np.random.rand(5, 5)
y = np.sin(x)

y[x > 0.5]
# or, equivalently, as two lines
idx = x > 0.5
y[idx]


Out[60]:
array([ 0.63127609,  0.775866  ,  0.71526395,  0.77716728,  0.66380859,
        0.57566832,  0.77979431,  0.79410003,  0.81556793,  0.82887742])

To get the indices of the places where the conditional is true (i.e., the locations of the True values in the boolean array), use the np.where command.


In [61]:
x = np.random.rand(5, 5)
idx = np.where(x > 0.5)
idx


Out[61]:
(array([0, 0, 0, 1, 1, 1, 3, 3, 3, 4]), array([1, 2, 3, 0, 2, 3, 1, 3, 4, 3]))

Note that np.where always returns a tuple of indices for each dimension. This is a little strange for 1D arrays, but is done for consistency across all input values. Often, you will want to explicitly pull out the (single) array of indices from the tuple, like


In [ ]:
x = np.random.rand(10)
idx = np.where(x>0.5)[0]
print(idx)

What happens with the [0] is missing behind the call to where?


Exercise

You can also use these calculated indices, or boolean matrices on the left hand side for assignment.

Create a 10x10 random array, with values between 0 and 1. Replace all of the numbers smaller than 0.5 with zero.

Do this first not using where and then do it using where.



In [ ]:

Array views

The data for an array may be stored in memory using C or FORTRAN ordered memory. Typically, there is no need to think about this, some details can be found here.

However, it is important to remember that subsets of an array can produce a different 'view' of the array that addresses the same memory as the original array. This can lead to some unexpected behaviors. One way to think of this is that assignment in Python is more like a C-pointer (i.e., a reference to a memory location) than an actual value.


In [29]:
a = np.arange(10.0)
b = a[::2]
print(a)
print(b)


[ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]
[ 0.  2.  4.  6.  8.]

In [30]:
a[4] = -999   # this will modify b as well...
print(a)
print(b)


[   0.    1.    2.    3. -999.    5.    6.    7.    8.    9.]
[   0.    2. -999.    6.    8.]

In [31]:
b[-1] = -888  # this will modify a as well...
print(a)
print(b)


[   0.    1.    2.    3. -999.    5.    6.    7. -888.    9.]
[   0.    2. -999.    6. -888.]

Normally, this will not be a problem, but if you need to make sure that a subset of an array has it's own memory, make sure you make a copy of the array, like


In [32]:
a = np.arange(10.0)
b = a.copy()[::2]     # or np.copy(a)
a[4] = -999   # this will NOT modify b now
print(a)
print(b)


[   0.    1.    2.    3. -999.    5.    6.    7.    8.    9.]
[ 0.  2.  4.  6.  8.]

Array broadcasting

(Largely taken from SciPy docs)

Generally arrays should be the same shape for them to be multiplied together.


In [61]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b


Out[61]:
array([ 2.,  4.,  6.])

The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations.

For example, the simplest broadcasting example occurs when an array and a scalar value are combined in an operation:


In [62]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b


Out[62]:
array([ 2.,  4.,  6.])

The result is equivalent to the previous example where b was an array. We can think of the scalar b being stretched during the arithmetic operation into an array with the same shape as a. The new elements in b are simply copies of the original scalar. The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies, so that broadcasting operations are as memory and computationally efficient as possible.

The code in the second example is more efficient than that in the first because broadcasting moves less memory around during the multiplication (b is a scalar rather than an array).

General Broadcasting Rules

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

  1. they are equal, or
  2. one of them is 1

If these conditions are not met, a ValueError: frames are not aligned exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the maximum size along each dimension of the input arrays.

Arrays do not need to have the same number of dimensions. For example, if you have a 256x256x3 array of RGB values, and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. Lining up the sizes of the trailing axes of these arrays according to the broadcast rules, shows that they are compatible:

Image  (3d array): 256 x 256 x 3
Scale  (1d array):             3
Result (3d array): 256 x 256 x 3

When either of the dimensions compared is one, the other is used. In other words, dimensions with size 1 are stretched or “copied” to match the other.

In the following example, both the A and B arrays have axes with length one that are expanded to a larger size during the broadcast operation:

A      (4d array):  8 x 1 x 6 x 1
B      (3d array):      7 x 1 x 5
Result (4d array):  8 x 7 x 6 x 5

Here are some more examples:

A      (2d array):  5 x 4
B      (1d array):      1
Result (2d array):  5 x 4

A      (2d array):  5 x 4
B      (1d array):      4
Result (2d array):  5 x 4

A      (3d array):  15 x 3 x 5
B      (3d array):  15 x 1 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 1
Result (3d array):  15 x 3 x 5

Let's create an example with arrays of random numbers.


In [65]:
A = np.random.rand(15, 3, 5)
B = np.random.rand(3, 1)
print(A.shape, B.shape)

Result = A * B
print(Result.shape)


(15, 3, 5) (3, 1)
(15, 3, 5)

Here are examples of shapes that do not broadcast:

A      (1d array):  3
B      (1d array):  4 # trailing dimensions do not match

A      (2d array):      2 x 1
B      (3d array):  8 x 4 x 3 # second from last dimensions mismatched

Exercise

a = np.random.rand(5, 7, 1, 8)
b = np.random.rand(8)
c = np.random.rand(7, 3, 8)
d = np.random.rand(5, 1, 3, 1)

Experiment with multiplying combinations of the arrays above together. Try to predict the resulting shape beforehand.



In [ ]:

Notice that the rules for broadcasting are based on the location of singleton dimensions. Singleton dimensions are implied forward (to the left), but not backward (to the right). So, the first example here works but not the second:

A      (2d array):  5 x 4
B      (1d array):      4
Result (2d array):  5 x 4

A      (2d array):  5 x 4
B      (1d array):  5


Compare with large set of examples above. How can the bottom example here be fixed?

This problem can be fixed by creating new singleton dimensions in arrays. This can be done by putting np.newaxis in the appropriate space when indexing the array. For example:


In [77]:
A = np.random.rand(5, 4)
B = np.random.rand(5)
A*B


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-77-71f7563bb66b> in <module>()
      1 A = np.random.rand(5, 4)
      2 B = np.random.rand(5)
----> 3 A*B

ValueError: operands could not be broadcast together with shapes (5,4) (5,) 

In [78]:
print(B.shape)
print(B[:,np.newaxis].shape)


(5,)
(5, 1)

In [82]:
(A*B[:,np.newaxis]).shape


Out[82]:
(5, 4)

Exercise

Multiply b = np.random.rand(8) and c = np.random.rand(8, 3, 7). What is another way you could accomplish this calculation besides using newaxis?



In [ ]:


Exercise

b = np.random.rand(2)
c = np.random.rand(2, 3)

Concatenate arrays b and c. Along which axis would it make sense to concatenate, given the arrays dimensions? Do you need to make any changes to the arrays to get this to work?



In [ ]:

Flattening arrays with a.flat and a.flatten()

There are two basic ways to turn any array into a 1D array. They are slightly different.

a.flatten() returns a copy of an array, in one dimension.


In [99]:
a = np.arange(12).reshape(3, 4)
print(a)
b = a.flatten()
print(b)


[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
[ 0  1  2  3  4  5  6  7  8  9 10 11]

the flat attribute on the other hand gives a view of the array in 1D. It looks like an iterator object (like range and zip). This allows


In [100]:
a.flat[6] = -999
print(a)


[[   0    1    2    3]
 [   4    5 -999    7]
 [   8    9   10   11]]

In contrast, this does not work as expected. WHY?


In [101]:
a.flatten()[5] = -888
print(a)


[[   0    1    2    3]
 [   4    5 -999    7]
 [   8    9   10   11]]

Other operations can be done to the array first. For example, we can take a transpose of the array before we flatten it.


In [102]:
a.T.flat[6] = -998
print(a)


[[   0    1 -998    3]
 [   4    5 -999    7]
 [   8    9   10   11]]

Here, the T attribute (equivalent to the a.transpose() method) gives a view of the array transposed (similar to MATLAB's tick notation).


In [103]:
print(a.T)


[[   0    4    8]
 [   1    5    9]
 [-998 -999   10]
 [   3    7   11]]

Masked arrays

Masked arrays are ways to create arrays with missing values. MATLAB™ uses NaNs (NaN stands for 'Not a Number'), and the NaNs are the values of the arrays at those points. This approach also works in Python. Masked arrays are preferred since they retain the masked array values, and also some plotting routines require masked arrays when plotting arrays with missing values. Masked arrays are usually created through some condition, like


In [105]:
arr = np.random.randn(7, 8)
cond = arr > 0.1   # `cond` is True for the random values greater than 0.5

marr = np.ma.masked_where(cond, arr)

print(marr)


[[0.02941043010940724 -0.5466136203983295 -- 0.04566353711871264 --
  -0.5739572572606573 -- -0.8220133514850928]
 [-0.7675872990412954 -- -0.934486870430469 -2.311278697446725 --
  -0.023208795258473497 -- -0.29419922711162017]
 [-1.0232570680249242 -- -- -0.49795955508459233 -- -1.4441319062795115
  -0.6790882401823027 --]
 [-1.4920489367985965 -- -0.4269997610429884 -- -- -- -- --]
 [-- -- -0.2575580888713064 -- -1.3048474660410572 -- 0.0896451390382945
  -0.5334695721742757]
 [-- -0.30986677252626726 -0.45133263577692634 0.026890976874324297
  -1.4992378360437504 -2.3250446109192966 -0.5827616511361144
  -0.2910417799410729]
 [-1.1132356659249456 -0.06832504980060151 -2.1181568344765713
  -0.29709346856671165 -0.39718603135250913 -- -- -0.11786219870099636]]

In [108]:
marr.mean(axis=0)


Out[108]:
masked_array(data = [-0.8733437079360709 -0.3082684809083994 -0.8377068381196523
 -0.6067554414209984 -1.0670904444791056 -1.0915856424294847
 -0.3907349174267076 -0.4117172258826116],
             mask = [False False False False False False False False],
       fill_value = 1e+20)

The mask can also be supplied explicity when creating the masked array,


In [ ]:
marr = np.ma.masked_array([1, 2, 3, 4, 5], mask=[True, True, False, False, True])
marr

Importing data

One of the basic commands in numpy for loading in data is the loadtxt command. There are other ways to do this, such as the genfromtxt command, but loadtxt is sufficient for most purposes, and is easy to use.


In [18]:
data = np.loadtxt('../data/CTD.txt', comments='*')
data[:,1]    # a column of data representing temperature


Out[18]:
array([ 2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,  5.5,  6. ,  6.5,  7. ,
        7.5,  8. ,  8.5,  9. ,  9.5, 10. , 10.5, 11. , 11.5, 12. , 12.5,
       13. , 13.5, 14. , 14.5, 15. , 15.5, 16. , 16.5, 17. , 17.5, 18. ,
       18.5, 19. , 19.5, 20. ])

Exercise

Read in the oceanographic data file '../data/CTD.txt' into an array. You can look at the data file itself to see what variables are stored in each column.

Using this data, write a function to calculate the linear equation of state. This is an approximation of the density of water, as it depends on salinity, temperature, and some empirical constants. We will use the following form for the linear equation of state:

$\rho = 1027[1+7.6\times 10^{-4}(S-35) -1.7\times 10^{-4}(T-25)]$

where $\rho$ is the density, $S$ is the salinity, and $T$ is the temperature.

This is more free form than the homework, so you should set up all of the associated code to call the function, and write out the function yourself. Don't forget docstrings! For a check, the first value of your density array in order should equal 1021.7519981630001 and the last should equal 1028.0471353619998.



In [ ]:

Polynomial fitting

The basic function for fitting a polynomial (e.g., a straight line) is np.polyfit(x, y, deg). There are a number of other functions that let you add (np.polyadd), multiply (np.polymul), find zeros (np.roots), and do other operations to polynomials.


In [9]:
import numpy as np

x = np.random.rand(100)
y = 5 + 3*x + 0.1*np.random.randn(100)   # A straight line with some noise

p = np.polyfit(x, y, 3)  # fit a straight line (order is 1)
print(p)  # The coefficients of the polynomial, with highest order first. (i.e,. [slope, intercept])


[-0.99813687  1.37935569  2.52617167  5.02224807]

Let's plot it to make sure this makes sense:


In [8]:
import matplotlib.pyplot as plt
%matplotlib inline

# plot data
plt.plot(x, y, '.')

# plot fitted line
plt.plot(x, p[0]*x + p[1])

# plt.legend(('Data', 'Fitted line'))


Out[8]:
[<matplotlib.lines.Line2D at 0x7f5b8df78eb8>]

Once you have the fit, you can use it to find other useful things, like the value of the fitted line at $x=1$:


In [138]:
np.polyval(p, 1)


Out[138]:
8.0372816397507396

You can also use the np.polynomial.Polynomial class to work with polynomials. Note, these define polynomials the opposite way, with the lowest order first. The Polynomial class gives an excellent example of operator overloading, and the flexibility of classes.


In [139]:
p1 = np.polynomial.Polynomial([5, 3])         # y = 5 + 3 x
p2 = np.polynomial.Polynomial([3, 6, 8, 2])   # y = 3 + 6 x + 8 x**2 + 2 x**3

You can use the Polynomial object to evaluate the value of the polynomial at various input values:


In [148]:
print('Evaluation')
print('p1(0.0) = ', p1(0))
print('p2(5.0) = ', p2(5))


Evaluation
p1(0.0) =  5.0
p2(5.0) =  483.0

We can use this to make a plot to see the function:


In [151]:
x = np.linspace(0,10)
plt.plot(x, p1(x), x, p2(x))
plt.legend(['p1', 'p2'])


Out[151]:
<matplotlib.legend.Legend at 0x7fbc8ec37358>

Other things we can do:


In [152]:
print('Roots')
print('Roots of p2 = ', p2.roots())
print()
print('Operations')
print('p1 + p2 = ', p1 + p2)
print('p1 * p2 = ', p1 * p2)
print()
print('Calculus')
print('Derivative of p1', p1.deriv(1))
print('Integral of p2', p2.integ(4, k=[4, 3, 2, 1]))


Roots
Roots of p2 =  [-3.21124270+0.j         -0.39437865-0.55818847j -0.39437865+0.55818847j]

Operations
p1 + p2 =  poly([ 8.  9.  8.  2.])
p1 * p2 =  poly([ 15.  39.  58.  34.   6.])

Calculus
Derivative of p1 poly([ 3.])
Integral of p2 poly([ 1.          2.          1.5         0.66666667  0.125       0.05
  0.02222222  0.00238095])

Vectorization

Vectorization and array broadcasting are two big reasons that numpy can be efficient and fast. With these tools, you can avoid writing for loops (which are slow).

The best way to do mathematical operations using numpy arrays is to do vector operations. That is, mathematical operations are defined to be element by element, and this is done much faster than looping. As a rule of thumb, you should be very concerned if your code has more than one significant for loop in the numerical analysis section.

Here is a way to do multiply 2 big arrays using for loops, which is not how you should do it. The sum at the end is included for comparison with the subsequent approach.


In [21]:
a = np.arange(102400.0).reshape(4, 8, 1600, 2)   # a 4D array using sequential numbers
b = np.random.rand(4, 8, 1600, 2)                # a 4D array using random numbers

li, lj, lk, lm = b.shape  # size of b in each dimension
sol = np.zeros(b.shape)
for i in range(li):
    for j in range(lj):
        for k in range(lk):
            for m in range(lm):
                sol[i,j,k,m] = a[i,j,k,m]*b[i,j,k,m]
print(sol.sum())


2620343610.3570576

The better way is to directly multiply the arrays together, taking advantage of C code that Python has in the background.


In [22]:
sol = a * b       # element-by-element multiplication. This operation is about as fast as it can be on your computer.
print(sol.sum())


2620343610.3570576

Basic performance evaluation

We can do some very basic perfomance testing using the %time special function in jupyter notebooks. Lets use this to examine the time it takes to do a singular value decomposition for different sized matrices.


In [19]:
b = np.random.randn(5000, 2000)
%time u, s, v = np.linalg.svd(b)


CPU times: user 1min, sys: 0 ns, total: 1min
Wall time: 5.17 s

%time runs the line once and gives the time required. However, calculation times vary depending on many things including the numbers involved and the state of your computer at the moment. In this case, the %timeit function can be used to perform the test a number of times to get an average calculation time.


In [20]:
%timeit b = np.random.randn(50, 20); u, s, v = np.linalg.svd(b)


640 µs ± 140 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

For statements that are longer than a single line, the time.time function can be used.


In [100]:
import time

t_start = time.time()
time.sleep(0.25)   # Do nothing for 0.25 seconds
t_stop = time.time()

print('{:6.4f} seconds have passed.'.format(t_stop-t_start))


0.2506 seconds have passed.

Exercise

Earlier, we discussed using array operations instead of looping because it is faster. Let's compare.

Calculate the time it takes to calculate the $a$ and $b$ arrays with dimensions [4, 8, 1600, 2] by both methods demonstrated: using a series of 4 for loops, one for each dimension of the arrays and using array operations. Compare the times by calculating a ratio.



In [ ]:

Linear algebra

One of the key elements of the numpy package is the numpy.linalg subpackage that contains a number of linear algebra functions that work efficiently on arrays.


In [78]:
a = np.random.randn(100, 100)
e, v = np.linalg.eig(a)

b = np.random.randn(500, 200)
u, s, v = np.linalg.svd(b)

Matrix multiplication is done using the np.dot function. In this case, matrices do not need to be the same shape, but must follow the rules of matrix multiplication. E.g., the operation dot(<4x5 array>, <5x12 array>) results in a 4x12 array; i.e., the inner dimensions must match (technically last and second-to-last, for arrays with more than two dimensions).


In [81]:
x = np.random.rand(4, 5)
y = np.random.rand(5, 12)

res = np.dot(x, y)
print(res.shape)

# np.dot(y, x)  # This gives an error -- order is important.


(4, 12)


Exercise

Output from a numerical model of the northwestern Gulf of Mexico are saved in a file ../data/model.npz. Read in this file using np.load. Among other things, it contains h, the depths within the numerical domain, and ssh, the sea surface heights at two time steps. The sea surface height gives the deviation above and below sea level from a reference water level (which changes in time as the water moves), and the depths of the seabed are also given with respect to that reference water level.

Find the full water column depth, between the seabed and the sea surface, for the two given times.

You can use as a comparison that at the first time step the [0,0] value of this array should be 3007.6088347392124, and at the second time step the [0,-1] value should be 605.25282427018749. Note that there is a differences between the two time steps though it is generally quite small since it is the difference between time steps in the numerical circulation model.

Don't forget array conventions, repeated here for convenience:

Generally, you will want to think of arrays as representing dimensions in space and time. The conventional way to think of this is that the dimensions are $(t, z, y, x)$; missing dimensions are omitted. This will help make plotting and analysis easier. Some examples might be:

temp[:, :, :, :]     # A 4D array (time, height, latitude, longitude)
press[:, :]          # A 2D array (time, height)
humid[:, :]          # A 2D array (latitude, longitude)


In [ ]: