Lecture 7: Vectorized Programming

CSCI 1360E: Foundations for Informatics and Analytics

Overview and Objectives

We've covered loops and lists, and how to use them to perform some basic arithmetic calculations. In this lecture, we'll see how we can use an external library to make these computations much easier and much faster.

  • Understand how to use import to add functionality beyond base Python
  • Compare and contrast NumPy arrays to built-in Python lists
  • Define "broadcasting" in the context of vectorized programming
  • Use NumPy arrays in place of explicit loops for basic arithmetic operations

Part 1: Importing modules

With all the data structures we've discussed so far--lists, sets, tuples, dictionaries, comprehensions, generators--it's hard to believe there's anything else. But oh man, is there a big huge world of Python extensions out there.

These extensions are known as modules. You've seen at least one in play in your assignments so far:


In [ ]:
import random

Anytime you see a statement that starts with import, you'll recognize that the programmer is pulling in some sort of external functionality not previously available to Python by default. In this case, the random package provides some basic functionality for computing random numbers.

That's just one of countless examples...an infinite number that continues to nonetheless increase daily.

Python has a bunch of functionality that comes by default--no import required. Remember writing functions to compute the maximum and minimum of a list? Turns out, those already exist by default (sorry everyone):


In [1]:
x = [3, 7, 2, 9, 4]
print("Maximum: {}".format(max(x)))
print("Minimum: {}".format(min(x)))


Maximum: 9
Minimum: 2

Quite a bit of other functionality--still built-in to the default Python environment!--requires explicit import statements to unlock. Here are just a couple of examples:


In [ ]:
import random   # For generating random numbers, as we've seen.
import os       # For interacting with the filesystem of your computer.
import re       # For regular expressions. Unrelated: https://xkcd.com/1171/
import datetime # Helps immensely with determining the date and formatting it.
import math     # Gives some basic math functions: trig, factorial, exponential, logarithms, etc.
import xml      # Abandon all hope, ye who enter.

If you are so inclined, you can see the full Python default module index here: https://docs.python.org/3/py-modindex.html.

It's quite a bit! Made all the more mind-blowing to consider the default Python module index is bit a tiny, miniscule drop in the bucket compared to the myriad 3rd-party module ecosystem.

These packages provides methods and functions wrapped inside, which you can access via the "dot-notation":


In [4]:
import random
random.randint(0, 1)


Out[4]:
0

Dot-notation works by

  1. specifying package_name (in this case, random)
  2. dot: .
  3. followed by function_name (in this case, randint, which returns a random integer between two numbers)

As a small tidbit--you can treat imported packages almost like variables, in that you can name them whatever you like, using the as keyword in the import statement.

Instead of


In [6]:
import random
random.randint(0, 1)


Out[6]:
1

We can tweak it


In [7]:
import random as r
r.randint(0, 1)


Out[7]:
0

You can put whatever you want after the as, and anytime you call methods from that module, you'll use the name you gave it.

Don't worry about trying to memorize all the available modules in core Python; in looking through them just now, I was amazed how many I'd never even heard of. Suffice to say, you can get by.

Especially since, once you get beyond the core modules, there's an ever-expanding universe of 3rd-party modules you can install and use. Anaconda comes prepackaged with quite a few (see the column "In Installer") and the option to manually install quite a few more.

Again, don't worry about trying to learn all these. There are simply too many. You'll come across packages as you need them. For now, we're going to focus on one specific package that is central to most modern data science:

NumPy, short for Numerical Python.

Part 2: Introduction to NumPy

NumPy, or Numerical Python, is an incredible library of basic functions and data structures that provide a robust foundation for computational scientists.

Put another way: if you're using Python and doing any kind of math, you'll probably use NumPy.

At this point, NumPy is so deeply embedded in so many other 3rd-party modules related to scientific computing that even if you're not making explicit use of it, at least one of the other modules you're using probably is.

NumPy's core: the ndarray

In core Python, if we wanted to represent a matrix, we would more or less have to build a "list of lists", a monstrosity along these lines:


In [1]:
matrix = [[ 1, 2, 3], 
          [ 4, 5, 6],
          [ 7, 8, 9] ]
print(matrix)


[[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Indexing would still work as you would expect, but looping through a matrix--say, to do matrix multiplication--would be laborious and highly inefficient.

We'll demonstrate this experimentally later, but suffice to say Python lists embody the drawbacks of using an interpreted language such as Python: they're easy to use, but oh so slow.

By contrast, in NumPy, we have the ndarray structure (short for "n-dimensional array") that is a highly optimized version of Python lists, perfect for fast and efficient computations. To make use of NumPy arrays, import NumPy (it's installed by default in Anaconda, and on JupyterHub):


In [2]:
import numpy

Now just call the array method using our list from before!


In [5]:
arr = numpy.array(matrix)
print(arr)


[[1 2 3]
 [4 5 6]
 [7 8 9]]

To reference an element in the array, just use the same notation we did for lists:


In [27]:
arr[0]


Out[27]:
array([1, 2, 3])

In [30]:
arr[2][2]


Out[30]:
9

You can also separate dimensions by commas:


In [31]:
arr[2, 2]


Out[31]:
9

Remember, with indexing matrices: the first index is the row, the second index is the column.

NumPy's submodules

NumPy has an impressive array of utility modules that come along with it, optimized to use its ndarray data structure. I highly encourage you to use them, even if you're not using NumPy arrays.

1: Basic mathematical routines

All the core functions you could want; for example, all the built-in Python math routines (trig, logs, exponents, etc) all have NumPy versions. (numpy.sin, numpy.cos, numpy.log, numpy.exp, numpy.max, numpy.min)

2: Fourier transforms

If you do any signal processing using Fourier transforms (which we might, later!), NumPy has an entire sub-module full of tools for this type of analysis in numpy.fft

3: Linear algebra

We'll definitely be using this submodule later in the course. This is most of your vector and matrix linear algebra operations, from vector norms (numpy.linalg.norm) to singular value decomposition (numpy.linalg.svd) to matrix determinants (numpy.linalg.det).

4: Random numbers

NumPy has a phenomenal random number library in numpy.random. In addition to generating uniform random numbers in a certain range, you can also sample from any known parametric distribution.

Part 3: Vectorized Arithmetic

"Vectorized arithmetic" refers to how NumPy allows you to efficiently perform arithmetic operations on entire NumPy arrays at once, as you would with "regular" Python variables.

For example: let's say you have a vector and you want to normalize it to be unit length; that involves dividing every element in the vector by a constant (the magnitude of the vector). With lists, you'd have to loop through them manually.


In [22]:
vector = [4.0, 15.0, 6.0, 2.0]
# To normalize this to unit length, we need to divide each element by the vector's magnitude.
# To learn it's magnitude, we need to loop through the whole vector.
# So. We need two loops!
magnitude = 0.0
for element in vector:
    magnitude += element ** 2
magnitude = (magnitude ** 0.5)  # square root
print("Original magnitude: {:.2f}".format(magnitude))
new_magnitude = 0.0
for index, element in enumerate(vector):
    vector[index] = element / magnitude
    new_magnitude += vector[index] ** 2
new_magnitude = (new_magnitude ** 0.5)
print("Normalized magnitude: {:.2f}".format(new_magnitude))


Original magnitude: 16.76
Normalized magnitude: 1.00

Now, let's see the same operation, this time with NumPy arrays.


In [23]:
import numpy as np  # This tends to be the "standard" convention when importing NumPy.
import numpy.linalg as nla

vector = [4.0, 15.0, 6.0, 2.0]
np_vector = np.array(vector)  # Convert to NumPy array.
magnitude = nla.norm(np_vector)  # Computing the magnitude: one-liner.
print("Original magnitude: {:.2f}".format(magnitude))

np_vector /= magnitude  # Vectorized division!!! No loop needed!
new_magnitude = nla.norm(np_vector)
print("Normalized magnitude: {:.2f}".format(new_magnitude))


Original magnitude: 16.76
Normalized magnitude: 1.00

No loops needed, far fewer lines of code, and a simple intuitive operation.

Operations involving arrays on both sides of the sign will also work (though the two arrays need to be the same length).

For example, adding two vectors together:


In [24]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = x + y
print(z)


[5 7 9]

Works exactly as you'd expect, but no [explicit] loop needed.

This becomes particularly compelling with matrix multiplication. Say you have two matrices, $A$ and $B$:


In [25]:
A = np.array([ [1, 2], [3, 4] ])
B = np.array([ [5, 6], [7, 8] ])

If you recall from algebra, matrix multiplication $A \times B$ involves multipliying each row of $A$ by each column of $B$. But rather than write that code yourself, Python (as of version 3.5) gives us a dedicated matrix multiplication operator: the @ symbol!


In [26]:
A @ B


Out[26]:
array([[19, 22],
       [43, 50]])

In summary

  • NumPy arrays have all the abilities of lists (indexing, mutability, slicing) plus a whole lot of additional benefits, such as vectorized computations.
  • About the only limitation of NumPy arrays relative to Python lists is constructing them: with lists, you can build them through generators or the append() method, but you can't do this with NumPy arrays. Therefore, if you're building an array from scratch, the best option would be to build the list and then pass that to numpy.array() to convert it. Adjusting the length of the NumPy array after it's constructed is more difficult.
  • The Python ecosystem is huge. There is some functionality that comes with Python by default, and some of this default functionality is available immediately; the other default functionality is accessible using import statements. There is even more functionality from 3rd-party vendors, but it needs to be installed before it can be imported. NumPy falls in this lattermost category.
  • Vectorized operations are always, always preferred to loops. They're easier to write, easier to understand, and in almost all cases, much more efficient.

Review Questions

Some questions to discuss and consider:

1: NumPy arrays have an attribute called .shape that will return the dimensions of the array in the form of a tuple. If the array is just a vector, the tuple will only have 1 element: the length of the array. If the array is a matrix, the tuple will have 2 elements: the number of rows and the number of columns. What will the shape tuple be for the following array: tensor = np.array([ [ [1, 2], [3, 4] ], [ [5, 6], [7, 8] ], [ [9, 10], [11, 12] ] ])

2: Vectorized computations may seem almost like magic, and indeed they are, but at the end of the day there has to be a loop somewhere that performs the operations. Given what we've discussed about interpreted languages, compiled languages, and in particular how the delineations between the two are blurring, what would your best educated guess be (ideally without Google's help) as to where these loops actually happen that implemented the vectorized computations?

3: Using your knowledge of slicing from a few lecture ago, and your knowledge from this lecture that NumPy arrays also support slicing, let's take an example of selecting a sub-range of rows from a two-dimensional matrix. Write the notation you would use for slicing out / selecting all the rows except for the first one, while retaining all the columns (hint: by just using : as your slicing operator, with no numbers, this means "everything").

Course Administrivia

If you're having trouble with the assignments, please contact me! You can also post about it in Slack; that's specifically what I created the #csci1360e-discussions channel for: to collaborate and discuss any concepts you want with your classmates. I'll jump in wherever and whenever I can, but if you're having an issue chances are someone else is, too. All I want to avoid is straight copy-paste of code, but collaboration is perfectly fine and strongly encouraged!

Additional Resources

  1. Grus, Joel. Data Science from Scratch. 2015. ISBN-13: 978-1491901427
  2. McKinney, Wes. Python for Data Analysis. 2012. ISBN-13: 860-1400898857
  3. NumPy Quickstart Tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html