Most of this lecture will be a review of basic indexing and slicing operations, albeit within the context of NumPy arrays. Therefore, there will be some additional functionalities that are critical to understand. By the end of this lecture, you should be able to
In [1]:
li = ["this", "is", "a", "list"]
print(li)
print(li[1:3]) # Print element 1 (inclusive) to 3 (exclusive)
print(li[2:]) # Print element 2 and everything after that
print(li[:-1]) # Print everything BEFORE element -1 (the last one)
With NumPy arrays, all the same functionality you know and love from lists is still there.
In [2]:
import numpy as np
x = np.array([1, 2, 3, 4, 5])
print(x)
print(x[1:3])
print(x[2:])
print(x[:-1])
These operations all work whether you're using Python lists or NumPy arrays.
The first place in which Python lists and NumPy arrays differ is when we get to multidimensional arrays. We'll start with matrices.
To build matrices using Python lists, you basically needed "nested" lists, or a list containing lists:
In [3]:
python_matrix = [ [1, 2, 3], [4, 5, 6], [7, 8, 9] ]
print(python_matrix)
To build the NumPy equivalent, you can basically just feed the Python list-matrix into the NumPy array
method:
In [4]:
numpy_matrix = np.array(python_matrix)
print(numpy_matrix)
The real difference, though, comes with actually indexing these elements. With Python lists, you can index individual elements only in this way:
In [5]:
print(python_matrix) # The full list-of-lists
print(python_matrix[0]) # The inner-list at the 0th position of the outer-list
print(python_matrix[0][0]) # The 0th element of the 0th inner-list
With NumPy arrays, you can use that same notation...or you can use comma-separated indices:
In [6]:
print(numpy_matrix)
print(numpy_matrix[0])
print(numpy_matrix[0, 0]) # Note the comma-separated format!
It's not earth-shattering, but enough to warrant a heads-up.
When you index NumPy arrays, the nomenclature used is that of an axis: you are indexing specific axes of a NumPy array object. In particular, when you call the .shape
method on a NumPy array, that tells you two things:
1: How many axes there are. This number is len(ndarray.shape)
, or the number of elements in the tuple returned by .shape
. In our above example, numpy_matrix.shape
would return (3, 3)
, so it would have 2 axes.
2: How many elements are in each axis. In our above example, where numpy_matrix.shape
returns (3, 3)
, there are 2 axes (since the length of that tuple is 2), and both axes have 3 elements (hence the numbers 3).
Here's the breakdown of axis notation and indices used in a 2D NumPy array:
As with lists, if you want an entire axis, just use the colon operator all by itself:
In [7]:
x = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9] ])
print(x)
print()
print(x[:, 1]) # Take ALL of axis 0, and one index of axis 1.
Here's a great visual summary of slicing NumPy arrays, assuming you're starting from an array with shape (3, 3):
Depending on your field, it's entirely possible that you'll go beyond 2D matrices. If so, it's important to be able to recognize what these structures "look" like.
For example, a video can be thought of as a 3D cube. Put another way, it's a NumPy array with 3 axes: the first axis is height, the second axis is width, and the third axis is number of frames.
In [8]:
video = np.empty(shape = (1920, 1080, 5000))
print("Axis 0 length: {}".format(video.shape[0])) # How many rows?
print("Axis 1 length: {}".format(video.shape[1])) # How many columns?
print("Axis 2 length: {}".format(video.shape[2])) # How many frames?
del video
Another example--to go straight to cutting-edge academic research--is 3D video microscope data of multiple tagged fluorescent markers. This would result in a five-axis NumPy object:
In [9]:
tensor = np.empty(shape = (2, 640, 480, 360, 100))
print(tensor.shape)
del tensor
# Axis 0: color channel--used to differentiate between fluorescent markers
# Axis 1: height--same as before
# Axis 2: width--same as before
# Axis 3: depth--capturing 3D depth at each time interval, like a 3D movie
# Axis 4: frame--same as before
These are extreme examples, but they're to illustrate how flexible NumPy arrays are.
If in doubt: once you index the first axis, the NumPy array you get back has the shape of all the remaining axes.
In [10]:
example = np.empty(shape = (3, 5, 9))
print(example.shape)
sliced = example[0] # Indexed the first axis.
print(sliced.shape)
sliced_again = example[0, 0] # Indexed the first and second axes.
print(sliced_again.shape)
When you write code like this:
In [11]:
x = np.array([1, 2, 3, 4, 5])
x += 10
print(x)
how does Python know that you want to add the scalar value 10 to each element of the vector x
? Because (in a word) broadcasting.
Broadcasting is the operation through which a low(er)-dimensional array is in some way "replicated" to be the same shape as a high(er)-dimensional array.
We saw this in our previous example: the low-dimensional scalar was replicated, or broadcast, to each element of the array x
so that the addition operation could be performed element-wise.
This concept can be generalized to higher-dimensional NumPy arrays.
In [12]:
zeros = np.zeros(shape = (3, 4))
ones = 1
zeros += ones
print(zeros)
In this example, the scalar value 1 is broadcast to all the elements of zeros
, converting the operation to element-wise addition.
This all happens under the NumPy hood--we don't see it! It "just works"...most of the time.
There are some rules that broadcasting abides by. Essentially, dimensions of arrays need to be "compatible" in order for broadcasting to work. "Compatible" is defined as
If these rules aren't met, you get all kinds of strange errors:
In [24]:
x = np.zeros(shape = (3, 3))
y = np.ones(4)
x + y
But on some intuitive level, this hopefully makes sense: there's no reasonable arithmetic operation that can be performed when you have one $3 \times 3$ matrix and a vector of length 4.
To be rigorous, though: it's the trailing dimensions / axes that you want to make sure line up.
In [14]:
x = np.zeros(shape = (3, 4))
y = np.array([1, 2, 3, 4])
z = x + y
print(z)
In this example, the shape of x
is (3, 4). The shape of y
is just 4. Their trailing axes are both 4, therefore the "smaller" array will be broadcast to fit the size of the larger array, and the operation (addition, in this case) is performed element-wise.
First: indexing by boolean masks.
We've already seen that you can index by integers. Using the colon operator, you can even specify ranges, slicing out entire swaths of rows and columns.
But suppose we want something very specific; data in our array which satisfies certain criteria, as opposed to data which is found at certain indices?
Put another way: can we pull data out of an array that meets certain conditions?
Let's say you have some data.
In [15]:
x = np.random.standard_normal(size = (7, 4))
print(x)
This is randomly generated data, yes, but it could easily be 7 data points in 4 dimensions. That is, we have 7 observations of variables with 4 descriptors. Perhaps it's 7 people who are described by their height, weight, age, and 40-yard dash time. Or it's a matrix of data on 7 video games, each described by their PC Gamer rating, Steam downloads count, average number of active players, and total cheating complaints.
Whatever our data, a common first step before any analysis involves some kind of preprocessing. In this case, if the example we're looking at is the video game scenario from the previous slide, then we know that any negative numbers are junk. After all, how can you have a negative rating? Or a negative number of active players?
So our first course of action might be to set all negative numbers in the data to 0. We could potentially set up a pair of loops, but it's much easier (and faster) to use boolean indexing.
First, we create a mask. This is what it sounds like: it "masks" certain portions of the data we don't want to change (in this case, all the numbers greater than 0).
In [16]:
mask = x < 0
print(mask)
Now, we can use our mask to access only the indices we want to set to 0.
In [17]:
x[mask] = 0
print(x)
voilà! Every negative number has been set to 0, and all the other values were left unchanged. Now we can continue with whatever analysis we may have had in mind.
One small caveat with boolean indexing.
and
and or
DO NOT WORK. You have to use the arithmetic versions of the operators: &
(for and
) and |
(for or
).
In [18]:
mask = (x < 1) & (x > 0.5) # True for any value less than 1 but greater than 0.5
x[mask] = 99
print(x)
Now, to demonstrate:
Let's build a 2D array that, for the sake of simplicity, has across each row the index of that row.
In [19]:
matrix = np.empty(shape = (8, 4))
for i in range(8):
matrix[i] = i # Broadcasting is happening here!
print(matrix)
We have 8 rows and 4 columns, where each row is a vector of the same value repeated across the columns, and that value is the index of the row.
In addition to slicing and boolean indexing, we can also use other NumPy arrays to very selectively pick and choose what elements we want, and even the order in which we want them.
Let's say I want rows 7, 0, 5, and 2. In that order.
In [20]:
indices = np.array([7, 0, 5, 2])
print(matrix[indices])
Ta-daaa! Pretty spiffy!
But wait, there's more! Rather than just specifying one dimension, you can provide tuples of NumPy arrays that very explicitly pick out certain elements (in a certain order) from another NumPy array.
In [21]:
matrix = np.arange(32).reshape((8, 4))
print(matrix) # This 8x4 matrix has integer elements that increment by 1 column-wise, then row-wise.
In [22]:
indices = ( np.array([1, 7, 4]), np.array([3, 0, 1]) ) # This is a tuple of 2 NumPy arrays!
print(matrix[indices])
Ok, this will take a little explaining, bear with me:
When you pass in tuples as indices, they act as $(x, y)$ coordinate pairs: the first NumPy array of the tuple is the list of $x$ coordinates, while the second NumPy array is the list of corresponding $y$ coordinates.
In this way, the corresponding elements of the two NumPy arrays in the tuple give you the row and column indices to be selected from the original NumPy array.
In our previous example, this was our tuple of indices:
In [23]:
( np.array([1, 7, 4]), np.array([3, 0, 1]) )
Out[23]:
The $x$ coordinates are in array([1, 7, 4])
, and the $y$ coordinates are in array([3, 0, 1])
. More concretely:
(1, 3)
--this is the 7 that was printed!(7, 4)
--this is the 28 that followed.(4, 1)
--this corresponds to the 17!Fancy indexing can be tricky at first, but it can be very useful when you want to pull very specific elements out of a NumPy array and in a very specific order.
Some questions to discuss and consider:
1: Given some arbitrary NumPy array and only access to its .shape
method, describe (in words or in Python pseudocode) how you would compute exactly how many individual elements exist in the array.
2: Broadcasting hints that there is more happening under the hood than meets the eye with NumPy. With this in mind, do you think it would be more or less efficient to write a loop yourself in Python to add a scalar to each element in a Python list, rather than use NumPy broadcasting? Why or why not?
3: Let's say I have a 2D matrix, where the rows represent individual gamers, and the columns represent games. There's a "1" in the column if the gamer won that game, and a "0" if they lost. Describe how you might use boolean indexing to select only the rows corresponding to gamers whose average score was above a certain threshold
.
csci1360e-discussions
channel TODAY at 10am EDT! Please come with any questions you may have!