Lecture 8: Advanced Data Structures

CBIO (CSCI) 4835/6835: Introduction to Computational Biology

Overview and Objectives

Before we get to some of the more advanced sequence analysis techniques, we need to cover some critical concepts using advanced Python data structures. In this lecture, we'll see how we can use an external library to make these computations much easier and much faster. By the end of this lecture, you should be able to:

Compare and contrast NumPy arrays to built-in Python lists
Define "broadcasting" in the context of vectorized programming
Use NumPy arrays in place of explicit loops for basic arithmetic operations
Understand the benefits of NumPy's "fancy indexing" capabilities and its advantages over built-in indexing

Part 1: Introduction to NumPy

NumPy, or Numerical Python, is an incredible library of basic functions and data structures that provide a robust foundation for computational scientists.

Put another way: if you're using Python and doing any kind of math, you'll probably use NumPy.

At this point, NumPy is so deeply embedded in so many other 3rd-party modules related to scientific computing that even if you're not making explicit use of it, at least one of the other modules you're using probably is.

NumPy's core: the `ndarray`

NumPy, or Numerical Python, is an incredible library of basic functions and data structures that provide a robust foundation for computational scientists.



In [1]:

    
matrix = [[ 1, 2, 3], 
          [ 4, 5, 6],
          [ 7, 8, 9] ]
print(matrix)









    



[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Indexing would still work as you would expect, but looping through a matrix--say, to do matrix multiplication--would be laborious and highly inefficient.

We'll demonstrate this experimentally later, but suffice to say Python lists embody the drawbacks of using an interpreted language such as Python: they're easy to use, but oh so slow.

By contrast, in NumPy, we have the ndarray structure (short for "n-dimensional array") that is a highly optimized version of Python lists, perfect for fast and efficient computations. To make use of NumPy arrays, import NumPy (it's installed by default in Anaconda, and on JupyterHub):



In [2]:

    
import numpy

Now just call the array method using our list from before!



In [3]:

    
arr = numpy.array(matrix)
print(arr)









    



[[1 2 3]
 [4 5 6]
 [7 8 9]]

To reference an element in the array, just use the same notation we did for lists:



In [4]:

    
arr[0]









    Out[4]:





array([1, 2, 3])



In [5]:

    
arr[2][2]









    Out[5]:





9

You can also separate dimensions by commas:



In [6]:

    
arr[2, 2]









    Out[6]:





9

Remember, with indexing matrices: the first index is the row, the second index is the column.

NumPy's submodules

NumPy has an impressive array of utility modules that come along with it, optimized to use its ndarray data structure. I highly encourage you to use them, even if you're not using NumPy arrays.

1: Basic mathematical routines

All the core functions you could want; for example, all the built-in Python math routines (trig, logs, exponents, etc) all have NumPy versions. (numpy.sin, numpy.cos, numpy.log, numpy.exp, numpy.max, numpy.min)



In [7]:

    
a = numpy.array([45, 2, 59, -2, 70, 3, 6, 790])

print("Minimum: {}".format(numpy.min(a)))
print("Cosine of 1st element: {:.2f}".format(numpy.cos(a[0])))









    



Minimum: -2
Cosine of 1st element: 0.53

2: Fourier transforms

If you do any signal processing using Fourier transforms (which we might, later!), NumPy has an entire sub-module full of tools for this type of analysis in numpy.fft

3: Linear algebra

This is most of your vector and matrix linear algebra operations, from vector norms (numpy.linalg.norm) to singular value decomposition (numpy.linalg.svd) to matrix determinants (numpy.linalg.det).

4: Random numbers

NumPy has a phenomenal random number library in numpy.random. In addition to generating uniform random numbers in a certain range, you can also sample from any known parametric distribution.



In [8]:

    
print(numpy.random.randint(10))  # Random integer between 0 and 10
print(numpy.random.randint(10))  # Another one!
print(numpy.random.randint(10))  # Yet another one!

Part 2: Vectorized Arithmetic

"Vectorized arithmetic" refers to how NumPy allows you to efficiently perform arithmetic operations on entire NumPy arrays at once, as you would with "regular" Python variables.

For example: let's say you have a vector and you want to normalize it to be unit length; that involves dividing every element in the vector by a constant (the magnitude of the vector). With lists, you'd have to loop through them manually.



In [9]:

    
vector = [4.0, 15.0, 6.0, 2.0]
# To normalize this to unit length, we need to divide each element by the vector's magnitude.
# To learn it's magnitude, we need to loop through the whole vector.
# So. We need two loops!
magnitude = 0.0
for element in vector:
    magnitude += element ** 2
magnitude = (magnitude ** 0.5)  # square root
print("Original magnitude: {:.2f}".format(magnitude))









    



Original magnitude: 16.76



In [10]:

    
new_magnitude = 0.0

for i in range(len(vector)):
    element = vector[i]
    normalized = element / magnitude
    vector[i] = normalized

    new_magnitude += normalized ** 2
new_magnitude = (new_magnitude ** 0.5)
print("Normalized magnitude: {:.2f}".format(new_magnitude))









    



Normalized magnitude: 1.00

Now, let's see the same operation, this time with NumPy arrays.



In [11]:

    
import numpy as np  # This tends to be the "standard" convention when importing NumPy.
import numpy.linalg as nla

vector = [4.0, 15.0, 6.0, 2.0]

np_vector = np.array(vector)  # Convert to NumPy array.
magnitude = nla.norm(np_vector)  # Computing the magnitude: one-liner.
print("Original magnitude: {:.2f}".format(magnitude))

np_vector /= magnitude  # Vectorized division!!! No loop needed!

new_magnitude = nla.norm(np_vector)
print("Normalized magnitude: {:.2f}".format(new_magnitude))









    



Original magnitude: 16.76
Normalized magnitude: 1.00

No loops needed, far fewer lines of code, and a simple intuitive operation.

Operations involving arrays on both sides of the sign will also work (though the two arrays need to be the same length).

For example, adding two vectors together:



In [12]:

    
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = x + y

print(z)

Works exactly as you'd expect, but no [explicit] loop needed.

This becomes particularly compelling with matrix multiplication. Say you have two matrices, $A$ and $B$:



In [13]:

    
A = np.array([ [1, 2], [3, 4] ])
B = np.array([ [5, 6], [7, 8] ])

If you recall from algebra, matrix multiplication $A \times B$ involves multipliying each row of $A$ by each column of $B$. But rather than write that code yourself, Python (as of version 3.5) gives us a dedicated matrix multiplication operator: the @ symbol!



In [14]:

    
A @ B









    Out[14]:





array([[19, 22],
       [43, 50]])

In almost every case, vectorized operations are far more efficient than loops written in Python to do the same thing.



In [15]:

    
def multiply_loops(A, B):
    C = np.zeros((A.shape[0], B.shape[1]))
    for i in range(A.shape[1]):
        for j in range(B.shape[0]):
            C[i, j] = A[i, j] * B[j, i]
    return C



In [16]:

    
def multiply_vector(A, B):
    return A @ B



In [17]:

    
X = np.random.random((100, 100))
Y = np.random.random((100, 100))

%timeit multiply_loops(X, Y)
%timeit multiply_vector(X, Y)









    



100 loops, best of 3: 4.14 ms per loop
The slowest run took 71.05 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 46 µs per loop

If you're implementing loops in conjunction with arrays, see if there's any way to use vectorized operations instead.

In summary

NumPy arrays have all the abilities of lists (indexing, mutability, slicing) plus a whole lot of additional benefits, such as vectorized computations.

About the only limitation of NumPy arrays relative to Python lists is constructing them: if you're building an array from scratch, the best option would be to build the list and then pass that to numpy.array() to convert it. Adjusting the length of the NumPy array after it's constructed is more difficult than a standard list.

The Python ecosystem is huge. There is some functionality that comes with Python by default, and some of this default functionality is available immediately; the other default functionality is accessible using import statements. There is even more functionality from 3rd-party vendors, but it needs to be installed before it can be imported. NumPy falls in this lattermost category.

Vectorized operations are always, always preferred to loops. They're easier to write, easier to understand, and in almost all cases, much more efficient.

Part 3: NumPy Array Indexing and Slicing

Hopefully, you recall basic indexing and slicing from Lecture 4.



In [18]:

    
li = [1, 2, 3, 4, 5]

print(li)
print(li[1:3])  # Print element 1 (inclusive) to 3 (exclusive)
print(li[2:])   # Print element 2 and everything after that
print(li[:-1])  # Print everything BEFORE element -1 (the last one)









    



[1, 2, 3, 4, 5]
[2, 3]
[3, 4, 5]
[1, 2, 3, 4]

With NumPy arrays, all the same functionality you know and love from lists is still there.



In [19]:

    
import numpy as np
x = np.array([1, 2, 3, 4, 5])

print(x)
print(x[1:3])  # Print element 1 (inclusive) to 3 (exclusive)
print(x[2:])   # Print element 2 and everything after that
print(x[:-1])  # Print everything BEFORE element -1 (the last one)









    



[1 2 3 4 5]
[2 3]
[3 4 5]
[1 2 3 4]

These operations all work whether you're using Python lists or NumPy arrays.

Multidimensional arrays

The first place in which Python lists and NumPy arrays differ is when we get to multidimensional arrays. We'll start with matrices.

To build matrices using Python lists, you basically needed "nested" lists, or a list containing lists:



In [20]:

    
python_matrix = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
print(python_matrix)









    



[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

To build the NumPy equivalent, you can basically just feed the Python list-matrix into the NumPy array() method:



In [21]:

    
numpy_matrix = np.array(python_matrix)
print(numpy_matrix)









    



[[1 2 3]
 [4 5 6]
 [7 8 9]]

The real difference, though, comes with actually indexing these elements. With Python lists, you can index individual elements only in this way:



In [22]:

    
print(python_matrix)       # The full list-of-lists
print(python_matrix[0])    # The inner-list at the 0th position of the outer-list
print(python_matrix[0][0]) # The 0th element of the 0th inner-list









    



[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
[1, 2, 3]
1

With NumPy arrays, you can use that same notation...or you can use comma-separated indices (this may be more familiar to Matlab and R users):



In [23]:

    
print(numpy_matrix)
print(numpy_matrix[0])
print(numpy_matrix[0, 0])  # Note the comma-separated format!









    



[[1 2 3]
 [4 5 6]
 [7 8 9]]
[1 2 3]
1

It's not earth-shattering, but enough to warrant a heads-up.

When you index NumPy arrays, the nomenclature used is that of an axis: you are indexing specific axes of a NumPy array object. In particular, when access the .shape attribute on a NumPy array, that tells you two things:

1: How many axes there are. This number is len(ndarray.shape), or the number of elements in the tuple returned by .shape. In our previous example, numpy_matrix.shape would return (3, 3), so it would have 2 axes.

2: How many elements are in each axis. In our above example, where numpy_matrix.shape returns (3, 3), there are 2 axes (since the length of that tuple is 2), and both axes have 3 elements (hence the numbers 3).

Here's the breakdown of axis notation and indices used in a 2D NumPy array:

As with lists, if you want an entire axis, just use the colon operator all by itself:



In [24]:

    
x = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])

print(x)
print()
print(x[:, 1])  # Take ALL of axis 0, and one index of axis 1.









    



[[1 2 3]
 [4 5 6]
 [7 8 9]]

[2 5 8]

Here's a great visual aid of slicing NumPy arrays, assuming you're starting from an array with shape (3, 3):

Putting the "multi" in multidimensional

Depending on your field, it's entirely possible that you'll go beyond 2D matrices. If so, it's important to be able to recognize what these structures "look" like.

For example, a video can be thought of as a 3D cube. Put another way, it's a NumPy array with 3 axes: the first axis is height, the second axis is width, and the third axis is number of frames.



In [25]:

    
video = np.empty(shape = (1920, 1080, 5000))

print("Axis 0 (frame height) : {}".format(video.shape[0]))  # How many rows?
print("Axis 1 (frame width): {}".format(video.shape[1]))  # How many columns?
print("Axis 2 (number of frames): {}".format(video.shape[2]))  # How many frames?









    



Axis 0 (frame height) : 1920
Axis 1 (frame width): 1080
Axis 2 (number of frames): 5000

We know video is 3D because we can also access its ndim attribute.



In [26]:

    
print(video.ndim)



In [27]:

    
del video

Another example: 3D video microscope data of multiple tagged fluorescent markers. This would result in a five-axis NumPy object:

Each time point is a 3D volume of some object of interest (x, y, z coordinates)
Time-lapse means a fourth dimension
Using three fluorescent markers--nuclear, f-actin, and mitochondria--means a fifth dimension



In [28]:

    
tensor = np.empty(shape = (3, 640, 480, 360, 100))
print(tensor.shape)









    



(3, 640, 480, 360, 100)

Axis 0: color channel, used to differentiate between fluorescent markers
Axis 1: height of video frames (i.e. rows)
Axis 2: width of video frames (i.e. columns)
Axis 3: depth of 3D volume at each time interval
Axis 4: time interval (frame number)

We can also ask how many elements there are total, using the size attribute:



In [29]:

    
print(tensor.size)

33177600000

Can you explain this number?



In [30]:

    
del tensor

These are admittedly extreme examples, but they're to illustrate how flexible NumPy arrays are.

If in doubt: once you index the first axis, the NumPy array you get back has the shape of all the remaining axes.

Put another way: when you index an axis directly, that axis essentially "drops out", and you're left with an array that has all the remaining axes you didn't index.



In [31]:

    
example = np.empty(shape = (3, 5, 9))
print(example.shape)

sliced = example[0]  # Indexed the first axis.









    



(3, 5, 9)

What is sliced.shape?



In [32]:

    
print(sliced.shape)









    



(5, 9)



In [33]:

    
sliced_again = example[0, 0]  # Indexed the first and second axes.

What is sliced_again.shape?



In [34]:

    
print(sliced_again.shape)









    



(9,)



In [35]:

    
sliced_finally = example[0, 0, 0]

What is sliced_finally.shape?

Trick question! I've indexed all three axes, so the value I get back is no longer a NumPy array, but rather the value type I've filled the array with.



In [36]:

    
type(sliced_finally)









    Out[36]:





numpy.float64



In [37]:

    
print(sliced_finally)

0.0

Part 4: NumPy Array Broadcasting

"Broadcasting" is a fancy term for how NumPy handles vectorized operations when arrays of differing shapes are involved. (this is, in some sense, "how the sausage is made")

When you write code like this:



In [38]:

    
x = np.array([1, 2, 3, 4, 5])
x += 10
print(x)









    



[11 12 13 14 15]

how does Python know that you want to add the scalar value 10 to each element of the vector x? Because broadcasting!

Broadcasting is the operation through which a low(er)-dimensional array is in some way "replicated" to be the same shape as a high(er)-dimensional array.

Put another way: Python will internally recognize that two NumPy arrays are of different shapes, but will nonetheless attempt to invisibly and temporarily "reshape" them so that the operation the programmer (you) wrote can still happen.

We saw this in our previous example: the low-dimensional scalar "10" was replicated, or broadcast, to each element of the array x so that the addition operation could be performed individually on each element of the array.

This concept can be generalized to higher-dimensional NumPy arrays.



In [39]:

    
zeros = np.zeros(shape = (3, 4))  # A 3-by-4 matrix of zeros.

ones = 1
zeros += ones
print(zeros)









    



[[ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]
 [ 1.  1.  1.  1.]]

In this example, the scalar value 1 is broadcast to all the elements of the NumPy array zeros, converting the operation to element-wise addition.

This all happens under the NumPy/Python hood--we don't see it! It "just works"...most of the time.

There are some rules that broadcasting abides by. Essentially, dimensions of arrays need to be "compatible" in order for broadcasting to work. "Compatible" is defined as

both dimensions are of equal size (e.g., two 1D arrays of length 10 have equal-sized dimensions, so adding them together will work fine), OR

one of the dimensions is 1 (i.e., it's a scalar)

If these rules aren't met, you get all kinds of strange errors:



In [53]:

    
x = np.zeros(shape = (3, 3))  # A 3-by-3 matrix of zeros.
y = np.ones(4)                # A 1D array of four elements (all of them 1s).

x + y









    



---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-53-ac78d99aeb1e> in <module>()
      2 y = np.ones(4)                # A 1D array of four elements (all of them 1s).
      3 
----> 4 x + y

ValueError: operands could not be broadcast together with shapes (3,3) (4,)

On some intuitive level, this hopefully makes sense: there's no reasonable arithmetic operation that can be performed when you have one $3 \times 3$ matrix and a vector of length 4.

To be rigorous: it's the trailing dimensions / axes that you want to make sure line up.

Recall how matrix-matrix multiplication works: the inner dimensions have to match for the multiplication to work at all.

If you do $A \times B$, where $A$ is $3 \times 5$ and $B$ is $5 \times 4$, the inner dimensions are the two 5s. Since these match, the multiplication will work.
But if you do $B \times A$, now the inner dimensions are 4 (for $B$) and 3 (for $A$). Since they don't match, you can't multiply them in this order.



In [41]:

    
x = np.zeros(shape = (3, 4))
y = np.array([1, 2, 3, 4])

z = x + y
print(z)









    



[[ 1.  2.  3.  4.]
 [ 1.  2.  3.  4.]
 [ 1.  2.  3.  4.]]

In this example, the shape of x is (3, 4). The shape of y is just 4. Their trailing axes are both 4, therefore the "smaller" array will be broadcast to fit the size of the larger array, and the operation (addition, in this case) is performed element-wise.

Part 5: Advanced Indexing

Hopefully, the rules of indexing and broadcasting have made sense on some level so far.

Unfortunately, it gets still more complicated. These complications, however, are ultimately there to make life easier.

Boolean Indexing

We've already seen that you can index by slicing. Using the colon operator, you can even specify ranges, slicing out entire swaths of rows and columns.

But suppose we want something very specific; data in our array which satisfies certain criteria, as opposed to data which is found at certain indices.

Put another way: can we pull data out of an array that meets certain conditions?

Let's say you have some toy data:



In [42]:

    
x = np.random.standard_normal(size = (7, 4))

print(x)









    



[[ 0.80795184  0.63365448 -0.4888516  -0.4643724 ]
 [ 0.15980446  0.32237596  0.89203407 -0.1911409 ]
 [-1.59715832 -0.48901988 -0.5990402  -0.65841947]
 [ 0.81723396 -0.89865375  0.78175969 -0.35433069]
 [-0.78392293 -0.61156021  0.49446314  0.03290263]
 [ 0.83146954  1.06975418  0.2258841  -0.43151449]
 [-0.70897506 -1.63035987 -0.60915601  1.32901324]]

This is randomly generated data, yes, but it could easily be 7 data points in 4 dimensions. That is, we have 7 observations of variables with 4 descriptors.

It could be

7 people who are described by their height, weight, age, and 40-yard dash time.
7 video games, each described by their review rating, Steam downloads count, average number of active players, and total cheating complaints
7 different genes and their expression levels under 4 separate conditions or replicates
???

Whatever our data, a common first step before any analysis involves some kind of preprocessing. In this case, if the example we're looking at is the gene expression level from the previous slide, then perhaps we know that any negative values are recording errors.

So our first course of action might be to set all negative numbers in the data to 0. We could potentially set up a pair of loops to go through each element of the matrix, but it's much easier (and faster) to use boolean indexing.

First, we create a mask. This is what it sounds like: it "masks" certain portions of the data, picking out only those numbers that meet the condition of the mask.



In [43]:

    
mask = x < 0  # For every element of x, ask: is it < 0?

print(mask)









    



[[False False  True  True]
 [False False False  True]
 [ True  True  True  True]
 [False  True False  True]
 [ True  True False False]
 [False False False  True]
 [ True  True  True False]]

Now, we can use our mask to access only the indices we want to set to 0.



In [44]:

    
x[mask] = 0

print(x)









    



[[ 0.80795184  0.63365448  0.          0.        ]
 [ 0.15980446  0.32237596  0.89203407  0.        ]
 [ 0.          0.          0.          0.        ]
 [ 0.81723396  0.          0.78175969  0.        ]
 [ 0.          0.          0.49446314  0.03290263]
 [ 0.83146954  1.06975418  0.2258841   0.        ]
 [ 0.          0.          0.          1.32901324]]

voilà! Every negative number has been set to 0, and all the other values were left unchanged. Now we can continue with whatever analysis we may have had in mind.

One small caveat with boolean indexing.

Yes, you can string multiple boolean conditions together, as you may recall doing in the lecture with conditionals.

But... and and or DO NOT WORK. You have to use the arithmetic versions of the operators: & (for and) and | (for or).



In [45]:

    
mask = (x < 1) & (x > 0.5)  # True for any value less than 1 but greater than 0.5
x[mask] = 99

print(x)









    



[[  9.90000000e+01   9.90000000e+01   0.00000000e+00   0.00000000e+00]
 [  1.59804457e-01   3.22375958e-01   9.90000000e+01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  9.90000000e+01   0.00000000e+00   9.90000000e+01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   4.94463139e-01   3.29026296e-02]
 [  9.90000000e+01   1.06975418e+00   2.25884101e-01   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   1.32901324e+00]]

Fancy Indexing

"Fancy" indexing is a term coined by the NumPy community to refer to this little indexing trick. To explain is simple enough: fancy indexing allows you to index arrays with other [integer] arrays.

Now, to demonstrate. Let's build a 2D array that, for the sake of simplicity, has across each row the index of that row.



In [46]:

    
import numpy as np
matrix = np.empty(shape = (8, 4))
for i in range(8):
    matrix[i] = i  # Broadcasting is happening here!

print(matrix)









    



[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]
 [ 4.  4.  4.  4.]
 [ 5.  5.  5.  5.]
 [ 6.  6.  6.  6.]
 [ 7.  7.  7.  7.]]

We have 8 rows and 4 columns, where each row is a vector of the same value repeated across the columns, and that value is the index of the row.

In addition to using regular integer indices, and masks to perform boolean indexing, we can also use other NumPy arrays to very selectively pick and choose what elements we want, and even the order in which we want them.

First, let's say I want the first three rows. Well, we already know how to do that with slicing:



In [47]:

    
print(matrix[0:3])









    



[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]]

Now, I want the first three even-numbered rows. You could do this with a loop, but it might be easier with fancy indexing:



In [48]:

    
indices = np.array([0, 2, 4])

print(matrix[indices])









    



[[ 0.  0.  0.  0.]
 [ 2.  2.  2.  2.]
 [ 4.  4.  4.  4.]]

See how easy that is?

Now, let's say I want rows 7, 0, 5, and 2.

In that order!



In [49]:

    
indices = np.array([7, 0, 5, 2])

print(matrix[indices])









    



[[ 7.  7.  7.  7.]
 [ 0.  0.  0.  0.]
 [ 5.  5.  5.  5.]
 [ 2.  2.  2.  2.]]

Yep, the order in which I list the integers in the indices array is the ordering in which I get them back. Very convenient for retrieving specific data in a specific order!

But wait, there's more! Rather than just specifying one dimension, you can provide tuples of NumPy arrays that very explicitly pick out certain elements (in a certain order) from another NumPy array.

(bear with me, I promise this is as bad as it gets)



In [50]:

    
matrix = np.arange(32).reshape((8, 4))

print(matrix)  # This 8x4 matrix has integer elements that increment by 1 column-wise, then row-wise.









    



[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]



In [51]:

    
indices = ( np.array([1, 7, 4]), np.array([3, 0, 1]) )  # This is a tuple of 2 NumPy arrays!

print(matrix[indices])

Let's step through this slowly.

When you pass in tuples as indices, they act as $(x, y)$ coordinate pairs: the first NumPy array of the tuple is the list of $x$ coordinates, while the second NumPy array is the list of corresponding $y$ coordinates.

In this way, the corresponding elements of the two NumPy arrays in the tuple give you the row and column indices to be selected from the original NumPy array.

In our previous example, this was our tuple of indices:



In [52]:

    
( np.array([1, 7, 4]), np.array([3, 0, 1]) )









    Out[52]:





(array([1, 7, 4]), array([3, 0, 1]))

The $x$ coordinates are in array([1, 7, 4]), and the $y$ coordinates are in array([3, 0, 1]). More concretely:

The first element to take from the matrix is (1, 3)--this is the 7 that was printed!
The second element is at (7, 4)--this is the 28 that followed.
The final element is at (4, 1)--this corresponds to the 17!

Questions?

Examples?

Administrivia

Haven't graded Assignment 1 yet, but plan to have that done sometime this week.

Assignment 2 is up! Lots of string manipulation, array indexing, and sequence analysis!

Previous experience teaching this material suggests indexing is easily the most difficult, but also the most important, programming aspect of Python.

So if you're having trouble with the material, please contact me! You can also post about it in Slack; that's specifically what I created the #questions channel for--everyone to ask and answer questions and discuss the material, and I'll jump in when I can and when I'm needed.

Additional Resources

Grus, Joel. Data Science from Scratch. 2015. ISBN-13: 978-1491901427
McKinney, Wes. Python for Data Analysis. 2012. ISBN-13: 860-1400898857
NumPy Quickstart Tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
NumPy documentation on array broadcasting http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html
NumPy documentation on indexing http://docs.scipy.org/doc/numpy/user/basics.indexing.html
Broadcasting Arrays in NumPy. http://eli.thegreenplace.net/2015/broadcasting-arrays-in-numpy/