Before we get to some of the more advanced sequence analysis techniques, we need to cover some critical concepts using advanced Python data structures. In this lecture, we'll see how we can use an external library to make these computations much easier and much faster. By the end of this lecture, you should be able to:
NumPy, or Numerical Python, is an incredible library of basic functions and data structures that provide a robust foundation for computational scientists.
Put another way: if you're using Python and doing any kind of math, you'll probably use NumPy.
At this point, NumPy is so deeply embedded in so many other 3rd-party modules related to scientific computing that even if you're not making explicit use of it, at least one of the other modules you're using probably is.
NumPy, or Numerical Python, is an incredible library of basic functions and data structures that provide a robust foundation for computational scientists.
In [1]:
matrix = [[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9] ]
print(matrix)
Indexing would still work as you would expect, but looping through a matrix--say, to do matrix multiplication--would be laborious and highly inefficient.
We'll demonstrate this experimentally later, but suffice to say Python lists embody the drawbacks of using an interpreted language such as Python: they're easy to use, but oh so slow.
By contrast, in NumPy, we have the ndarray
structure (short for "n-dimensional array") that is a highly optimized version of Python lists, perfect for fast and efficient computations. To make use of NumPy arrays, import NumPy (it's installed by default in Anaconda, and on JupyterHub):
In [2]:
import numpy
Now just call the array
method using our list from before!
In [3]:
arr = numpy.array(matrix)
print(arr)
To reference an element in the array, just use the same notation we did for lists:
In [4]:
arr[0]
Out[4]:
In [5]:
arr[2][2]
Out[5]:
You can also separate dimensions by commas:
In [6]:
arr[2, 2]
Out[6]:
Remember, with indexing matrices: the first index is the row, the second index is the column.
NumPy has an impressive array of utility modules that come along with it, optimized to use its ndarray
data structure. I highly encourage you to use them, even if you're not using NumPy arrays.
1: Basic mathematical routines
All the core functions you could want; for example, all the built-in Python math routines (trig, logs, exponents, etc) all have NumPy versions. (numpy.sin
, numpy.cos
, numpy.log
, numpy.exp
, numpy.max
, numpy.min
)
In [7]:
a = numpy.array([45, 2, 59, -2, 70, 3, 6, 790])
print("Minimum: {}".format(numpy.min(a)))
print("Cosine of 1st element: {:.2f}".format(numpy.cos(a[0])))
2: Fourier transforms
If you do any signal processing using Fourier transforms (which we might, later!), NumPy has an entire sub-module full of tools for this type of analysis in numpy.fft
3: Linear algebra
This is most of your vector and matrix linear algebra operations, from vector norms (numpy.linalg.norm
) to singular value decomposition (numpy.linalg.svd
) to matrix determinants (numpy.linalg.det
).
4: Random numbers
NumPy has a phenomenal random number library in numpy.random
. In addition to generating uniform random numbers in a certain range, you can also sample from any known parametric distribution.
In [8]:
print(numpy.random.randint(10)) # Random integer between 0 and 10
print(numpy.random.randint(10)) # Another one!
print(numpy.random.randint(10)) # Yet another one!
"Vectorized arithmetic" refers to how NumPy allows you to efficiently perform arithmetic operations on entire NumPy arrays at once, as you would with "regular" Python variables.
For example: let's say you have a vector and you want to normalize it to be unit length; that involves dividing every element in the vector by a constant (the magnitude of the vector). With lists, you'd have to loop through them manually.
In [9]:
vector = [4.0, 15.0, 6.0, 2.0]
# To normalize this to unit length, we need to divide each element by the vector's magnitude.
# To learn it's magnitude, we need to loop through the whole vector.
# So. We need two loops!
magnitude = 0.0
for element in vector:
magnitude += element ** 2
magnitude = (magnitude ** 0.5) # square root
print("Original magnitude: {:.2f}".format(magnitude))
In [10]:
new_magnitude = 0.0
for i in range(len(vector)):
element = vector[i]
normalized = element / magnitude
vector[i] = normalized
new_magnitude += normalized ** 2
new_magnitude = (new_magnitude ** 0.5)
print("Normalized magnitude: {:.2f}".format(new_magnitude))
Now, let's see the same operation, this time with NumPy arrays.
In [11]:
import numpy as np # This tends to be the "standard" convention when importing NumPy.
import numpy.linalg as nla
vector = [4.0, 15.0, 6.0, 2.0]
np_vector = np.array(vector) # Convert to NumPy array.
magnitude = nla.norm(np_vector) # Computing the magnitude: one-liner.
print("Original magnitude: {:.2f}".format(magnitude))
np_vector /= magnitude # Vectorized division!!! No loop needed!
new_magnitude = nla.norm(np_vector)
print("Normalized magnitude: {:.2f}".format(new_magnitude))
No loops needed, far fewer lines of code, and a simple intuitive operation.
Operations involving arrays on both sides of the sign will also work (though the two arrays need to be the same length).
For example, adding two vectors together:
In [12]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = x + y
print(z)
Works exactly as you'd expect, but no [explicit] loop needed.
This becomes particularly compelling with matrix multiplication. Say you have two matrices, $A$ and $B$:
In [13]:
A = np.array([ [1, 2], [3, 4] ])
B = np.array([ [5, 6], [7, 8] ])
If you recall from algebra, matrix multiplication $A \times B$ involves multipliying each row of $A$ by each column of $B$. But rather than write that code yourself, Python (as of version 3.5) gives us a dedicated matrix multiplication operator: the @
symbol!
In [14]:
A @ B
Out[14]:
In almost every case, vectorized operations are far more efficient than loops written in Python to do the same thing.
In [15]:
def multiply_loops(A, B):
C = np.zeros((A.shape[0], B.shape[1]))
for i in range(A.shape[1]):
for j in range(B.shape[0]):
C[i, j] = A[i, j] * B[j, i]
return C
In [16]:
def multiply_vector(A, B):
return A @ B
In [17]:
X = np.random.random((100, 100))
Y = np.random.random((100, 100))
%timeit multiply_loops(X, Y)
%timeit multiply_vector(X, Y)
If you're implementing loops in conjunction with arrays, see if there's any way to use vectorized operations instead.
numpy.array()
to convert it. Adjusting the length of the NumPy array after it's constructed is more difficult than a standard list.import
statements. There is even more functionality from 3rd-party vendors, but it needs to be installed before it can be import
ed. NumPy falls in this lattermost category.Hopefully, you recall basic indexing and slicing from Lecture 4.
In [18]:
li = [1, 2, 3, 4, 5]
print(li)
print(li[1:3]) # Print element 1 (inclusive) to 3 (exclusive)
print(li[2:]) # Print element 2 and everything after that
print(li[:-1]) # Print everything BEFORE element -1 (the last one)
With NumPy arrays, all the same functionality you know and love from lists is still there.
In [19]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
print(x)
print(x[1:3]) # Print element 1 (inclusive) to 3 (exclusive)
print(x[2:]) # Print element 2 and everything after that
print(x[:-1]) # Print everything BEFORE element -1 (the last one)
These operations all work whether you're using Python lists or NumPy arrays.
The first place in which Python lists and NumPy arrays differ is when we get to multidimensional arrays. We'll start with matrices.
To build matrices using Python lists, you basically needed "nested" lists, or a list containing lists:
In [20]:
python_matrix = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
print(python_matrix)
To build the NumPy equivalent, you can basically just feed the Python list-matrix into the NumPy array()
method:
In [21]:
numpy_matrix = np.array(python_matrix)
print(numpy_matrix)
The real difference, though, comes with actually indexing these elements. With Python lists, you can index individual elements only in this way:
In [22]:
print(python_matrix) # The full list-of-lists
print(python_matrix[0]) # The inner-list at the 0th position of the outer-list
print(python_matrix[0][0]) # The 0th element of the 0th inner-list
With NumPy arrays, you can use that same notation...or you can use comma-separated indices (this may be more familiar to Matlab and R users):
In [23]:
print(numpy_matrix)
print(numpy_matrix[0])
print(numpy_matrix[0, 0]) # Note the comma-separated format!
It's not earth-shattering, but enough to warrant a heads-up.
When you index NumPy arrays, the nomenclature used is that of an axis: you are indexing specific axes of a NumPy array object. In particular, when access the .shape
attribute on a NumPy array, that tells you two things:
1: How many axes there are. This number is len(ndarray.shape)
, or the number of elements in the tuple returned by .shape
. In our previous example, numpy_matrix.shape
would return (3, 3)
, so it would have 2 axes.
2: How many elements are in each axis. In our above example, where numpy_matrix.shape
returns (3, 3)
, there are 2 axes (since the length of that tuple is 2), and both axes have 3 elements (hence the numbers 3).
Here's the breakdown of axis notation and indices used in a 2D NumPy array:
As with lists, if you want an entire axis, just use the colon operator all by itself:
In [24]:
x = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
print(x)
print()
print(x[:, 1]) # Take ALL of axis 0, and one index of axis 1.
Here's a great visual aid of slicing NumPy arrays, assuming you're starting from an array with shape (3, 3)
:
Depending on your field, it's entirely possible that you'll go beyond 2D matrices. If so, it's important to be able to recognize what these structures "look" like.
For example, a video can be thought of as a 3D cube. Put another way, it's a NumPy array with 3 axes: the first axis is height, the second axis is width, and the third axis is number of frames.
In [25]:
video = np.empty(shape = (1920, 1080, 5000))
print("Axis 0 (frame height) : {}".format(video.shape[0])) # How many rows?
print("Axis 1 (frame width): {}".format(video.shape[1])) # How many columns?
print("Axis 2 (number of frames): {}".format(video.shape[2])) # How many frames?
We know video is 3D because we can also access its ndim
attribute.
In [26]:
print(video.ndim)
In [27]:
del video
Another example: 3D video microscope data of multiple tagged fluorescent markers. This would result in a five-axis NumPy object:
In [28]:
tensor = np.empty(shape = (3, 640, 480, 360, 100))
print(tensor.shape)
We can also ask how many elements there are total, using the size
attribute:
In [29]:
print(tensor.size)
Can you explain this number?
In [30]:
del tensor
These are admittedly extreme examples, but they're to illustrate how flexible NumPy arrays are.
If in doubt: once you index the first axis, the NumPy array you get back has the shape of all the remaining axes.
Put another way: when you index an axis directly, that axis essentially "drops out", and you're left with an array that has all the remaining axes you didn't index.
In [31]:
example = np.empty(shape = (3, 5, 9))
print(example.shape)
sliced = example[0] # Indexed the first axis.
What is sliced.shape
?
In [32]:
print(sliced.shape)
In [33]:
sliced_again = example[0, 0] # Indexed the first and second axes.
What is sliced_again.shape
?
In [34]:
print(sliced_again.shape)
In [35]:
sliced_finally = example[0, 0, 0]
What is sliced_finally.shape
?
Trick question! I've indexed all three axes, so the value I get back is no longer a NumPy array, but rather the value type I've filled the array with.
In [36]:
type(sliced_finally)
Out[36]:
In [37]:
print(sliced_finally)
"Broadcasting" is a fancy term for how NumPy handles vectorized operations when arrays of differing shapes are involved. (this is, in some sense, "how the sausage is made")
When you write code like this:
In [38]:
x = np.array([1, 2, 3, 4, 5])
x += 10
print(x)
how does Python know that you want to add the scalar value 10 to each element of the vector x? Because broadcasting!
Broadcasting is the operation through which a low(er)-dimensional array is in some way "replicated" to be the same shape as a high(er)-dimensional array.
Put another way: Python will internally recognize that two NumPy arrays are of different shapes, but will nonetheless attempt to invisibly and temporarily "reshape" them so that the operation the programmer (you) wrote can still happen.
We saw this in our previous example: the low-dimensional scalar "10" was replicated, or broadcast, to each element of the array x
so that the addition operation could be performed individually on each element of the array.
This concept can be generalized to higher-dimensional NumPy arrays.
In [39]:
zeros = np.zeros(shape = (3, 4)) # A 3-by-4 matrix of zeros.
ones = 1
zeros += ones
print(zeros)
In this example, the scalar value 1 is broadcast to all the elements of the NumPy array zeros
, converting the operation to element-wise addition.
This all happens under the NumPy/Python hood--we don't see it! It "just works"...most of the time.
There are some rules that broadcasting abides by. Essentially, dimensions of arrays need to be "compatible" in order for broadcasting to work. "Compatible" is defined as
If these rules aren't met, you get all kinds of strange errors:
In [53]:
x = np.zeros(shape = (3, 3)) # A 3-by-3 matrix of zeros.
y = np.ones(4) # A 1D array of four elements (all of them 1s).
x + y
On some intuitive level, this hopefully makes sense: there's no reasonable arithmetic operation that can be performed when you have one $3 \times 3$ matrix and a vector of length 4.
To be rigorous: it's the trailing dimensions / axes that you want to make sure line up.
Recall how matrix-matrix multiplication works: the inner dimensions have to match for the multiplication to work at all.
In [41]:
x = np.zeros(shape = (3, 4))
y = np.array([1, 2, 3, 4])
z = x + y
print(z)
In this example, the shape of x
is (3, 4)
. The shape of y
is just 4. Their trailing axes are both 4, therefore the "smaller" array will be broadcast to fit the size of the larger array, and the operation (addition, in this case) is performed element-wise.
Hopefully, the rules of indexing and broadcasting have made sense on some level so far.
Unfortunately, it gets still more complicated. These complications, however, are ultimately there to make life easier.
We've already seen that you can index by slicing. Using the colon operator, you can even specify ranges, slicing out entire swaths of rows and columns.
But suppose we want something very specific; data in our array which satisfies certain criteria, as opposed to data which is found at certain indices.
Put another way: can we pull data out of an array that meets certain conditions?
Let's say you have some toy data:
In [42]:
x = np.random.standard_normal(size = (7, 4))
print(x)
This is randomly generated data, yes, but it could easily be 7 data points in 4 dimensions. That is, we have 7 observations of variables with 4 descriptors.
It could be
Whatever our data, a common first step before any analysis involves some kind of preprocessing. In this case, if the example we're looking at is the gene expression level from the previous slide, then perhaps we know that any negative values are recording errors.
So our first course of action might be to set all negative numbers in the data to 0. We could potentially set up a pair of loops to go through each element of the matrix, but it's much easier (and faster) to use boolean indexing.
First, we create a mask. This is what it sounds like: it "masks" certain portions of the data, picking out only those numbers that meet the condition of the mask.
In [43]:
mask = x < 0 # For every element of x, ask: is it < 0?
print(mask)
Now, we can use our mask to access only the indices we want to set to 0.
In [44]:
x[mask] = 0
print(x)
voilà! Every negative number has been set to 0, and all the other values were left unchanged. Now we can continue with whatever analysis we may have had in mind.
One small caveat with boolean indexing.
and
and or
DO NOT WORK. You have to use the arithmetic versions of the operators: &
(for and
) and |
(for or
).
In [45]:
mask = (x < 1) & (x > 0.5) # True for any value less than 1 but greater than 0.5
x[mask] = 99
print(x)
"Fancy" indexing is a term coined by the NumPy community to refer to this little indexing trick. To explain is simple enough: fancy indexing allows you to index arrays with other [integer] arrays.
Now, to demonstrate. Let's build a 2D array that, for the sake of simplicity, has across each row the index of that row.
In [46]:
import numpy as np
matrix = np.empty(shape = (8, 4))
for i in range(8):
matrix[i] = i # Broadcasting is happening here!
print(matrix)
We have 8 rows and 4 columns, where each row is a vector of the same value repeated across the columns, and that value is the index of the row.
In addition to using regular integer indices, and masks to perform boolean indexing, we can also use other NumPy arrays to very selectively pick and choose what elements we want, and even the order in which we want them.
First, let's say I want the first three rows. Well, we already know how to do that with slicing:
In [47]:
print(matrix[0:3])
Now, I want the first three even-numbered rows. You could do this with a loop, but it might be easier with fancy indexing:
In [48]:
indices = np.array([0, 2, 4])
print(matrix[indices])
See how easy that is?
Now, let's say I want rows 7, 0, 5, and 2.
In that order!
In [49]:
indices = np.array([7, 0, 5, 2])
print(matrix[indices])
Yep, the order in which I list the integers in the indices
array is the ordering in which I get them back. Very convenient for retrieving specific data in a specific order!
But wait, there's more! Rather than just specifying one dimension, you can provide tuples of NumPy arrays that very explicitly pick out certain elements (in a certain order) from another NumPy array.
(bear with me, I promise this is as bad as it gets)
In [50]:
matrix = np.arange(32).reshape((8, 4))
print(matrix) # This 8x4 matrix has integer elements that increment by 1 column-wise, then row-wise.
In [51]:
indices = ( np.array([1, 7, 4]), np.array([3, 0, 1]) ) # This is a tuple of 2 NumPy arrays!
print(matrix[indices])
Let's step through this slowly.
When you pass in tuples as indices, they act as $(x, y)$ coordinate pairs: the first NumPy array of the tuple is the list of $x$ coordinates, while the second NumPy array is the list of corresponding $y$ coordinates.
In this way, the corresponding elements of the two NumPy arrays in the tuple give you the row and column indices to be selected from the original NumPy array.
In our previous example, this was our tuple of indices:
In [52]:
( np.array([1, 7, 4]), np.array([3, 0, 1]) )
Out[52]:
The $x$ coordinates are in array([1, 7, 4])
, and the $y$ coordinates are in array([3, 0, 1])
. More concretely:
(1, 3)
--this is the 7 that was printed!(7, 4)
--this is the 28 that followed.(4, 1)
--this corresponds to the 17!So if you're having trouble with the material, please contact me! You can also post about it in Slack; that's specifically what I created the #questions
channel for--everyone to ask and answer questions and discuss the material, and I'll jump in when I can and when I'm needed.