We've covered loops and lists, and how to use them to perform some basic arithmetic calculations. In this lecture, we'll see how we can use an external library to make these computations much easier and much faster.
import
to add functionality beyond base PythonThese extensions are known as modules. You've seen at least one in play in your assignments so far:
In [ ]:
import random
Anytime you see a statement that starts with import
, you'll recognize that the programmer is pulling in some sort of external functionality not previously available to Python by default. In this case, the random
package provides some basic functionality for computing random numbers.
That's just one of countless examples...an infinite number that continues to nonetheless increase daily.
Python has a bunch of functionality that comes by default--no import
required. Remember writing functions to compute the maximum and minimum of a list? Turns out, those already exist by default (sorry everyone):
In [1]:
x = [3, 7, 2, 9, 4]
print("Maximum: {}".format(max(x)))
print("Minimum: {}".format(min(x)))
Quite a bit of other functionality--still built-in to the default Python environment!--requires explicit import
statements to unlock. Here are just a couple of examples:
In [ ]:
import random # For generating random numbers, as we've seen.
import os # For interacting with the filesystem of your computer.
import re # For regular expressions. Unrelated: https://xkcd.com/1171/
import datetime # Helps immensely with determining the date and formatting it.
import math # Gives some basic math functions: trig, factorial, exponential, logarithms, etc.
import xml # Abandon all hope, ye who enter.
If you are so inclined, you can see the full Python default module index here: https://docs.python.org/3/py-modindex.html.
It's quite a bit! Made all the more mind-blowing to consider the default Python module index is bit a tiny, miniscule drop in the bucket compared to the myriad 3rd-party module ecosystem.
These packages provides methods and functions wrapped inside, which you can access via the "dot-notation":
In [4]:
import random
random.randint(0, 1)
Out[4]:
Dot-notation works by
package_name
(in this case, random
).
function_name
(in this case, randint
, which returns a random integer between two numbers)As a small tidbit--you can treat imported packages almost like variables, in that you can name them whatever you like, using the as
keyword in the import statement.
Instead of
In [6]:
import random
random.randint(0, 1)
Out[6]:
We can tweak it
In [7]:
import random as r
r.randint(0, 1)
Out[7]:
You can put whatever you want after the as
, and anytime you call methods from that module, you'll use the name you gave it.
Don't worry about trying to memorize all the available modules in core Python; in looking through them just now, I was amazed how many I'd never even heard of. Suffice to say, you can get by.
Especially since, once you get beyond the core modules, there's an ever-expanding universe of 3rd-party modules you can install and use. Anaconda comes prepackaged with quite a few (see the column "In Installer") and the option to manually install quite a few more.
Again, don't worry about trying to learn all these. There are simply too many. You'll come across packages as you need them. For now, we're going to focus on one specific package that is central to most modern data science:
NumPy, short for Numerical Python.
Put another way: if you're using Python and doing any kind of math, you'll probably use NumPy.
At this point, NumPy is so deeply embedded in so many other 3rd-party modules related to scientific computing that even if you're not making explicit use of it, at least one of the other modules you're using probably is.
In [1]:
matrix = [[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9] ]
print(matrix)
Indexing would still work as you would expect, but looping through a matrix--say, to do matrix multiplication--would be laborious and highly inefficient.
We'll demonstrate this experimentally later, but suffice to say Python lists embody the drawbacks of using an interpreted language such as Python: they're easy to use, but oh so slow.
By contrast, in NumPy, we have the ndarray
structure (short for "n-dimensional array") that is a highly optimized version of Python lists, perfect for fast and efficient computations. To make use of NumPy arrays, import NumPy (it's installed by default in Anaconda, and on JupyterHub):
In [2]:
import numpy
Now just call the array
method using our list from before!
In [5]:
arr = numpy.array(matrix)
print(arr)
To reference an element in the array, just use the same notation we did for lists:
In [27]:
arr[0]
Out[27]:
In [30]:
arr[2][2]
Out[30]:
You can also separate dimensions by commas:
In [31]:
arr[2, 2]
Out[31]:
Remember, with indexing matrices: the first index is the row, the second index is the column.
1: Basic mathematical routines
All the core functions you could want; for example, all the built-in Python math routines (trig, logs, exponents, etc) all have NumPy versions. (numpy.sin
, numpy.cos
, numpy.log
, numpy.exp
, numpy.max
, numpy.min
)
2: Fourier transforms
If you do any signal processing using Fourier transforms (which we might, later!), NumPy has an entire sub-module full of tools for this type of analysis in numpy.fft
3: Linear algebra
We'll definitely be using this submodule later in the course. This is most of your vector and matrix linear algebra operations, from vector norms (numpy.linalg.norm
) to singular value decomposition (numpy.linalg.svd
) to matrix determinants (numpy.linalg.det
).
4: Random numbers
NumPy has a phenomenal random number library in numpy.random
. In addition to generating uniform random numbers in a certain range, you can also sample from any known parametric distribution.
For example: let's say you have a vector and you want to normalize it to be unit length; that involves dividing every element in the vector by a constant (the magnitude of the vector). With lists, you'd have to loop through them manually.
In [22]:
vector = [4.0, 15.0, 6.0, 2.0]
# To normalize this to unit length, we need to divide each element by the vector's magnitude.
# To learn it's magnitude, we need to loop through the whole vector.
# So. We need two loops!
magnitude = 0.0
for element in vector:
magnitude += element ** 2
magnitude = (magnitude ** 0.5) # square root
print("Original magnitude: {:.2f}".format(magnitude))
new_magnitude = 0.0
for index, element in enumerate(vector):
vector[index] = element / magnitude
new_magnitude += vector[index] ** 2
new_magnitude = (new_magnitude ** 0.5)
print("Normalized magnitude: {:.2f}".format(new_magnitude))
Now, let's see the same operation, this time with NumPy arrays.
In [23]:
import numpy as np # This tends to be the "standard" convention when importing NumPy.
import numpy.linalg as nla
vector = [4.0, 15.0, 6.0, 2.0]
np_vector = np.array(vector) # Convert to NumPy array.
magnitude = nla.norm(np_vector) # Computing the magnitude: one-liner.
print("Original magnitude: {:.2f}".format(magnitude))
np_vector /= magnitude # Vectorized division!!! No loop needed!
new_magnitude = nla.norm(np_vector)
print("Normalized magnitude: {:.2f}".format(new_magnitude))
No loops needed, far fewer lines of code, and a simple intuitive operation.
Operations involving arrays on both sides of the sign will also work (though the two arrays need to be the same length).
For example, adding two vectors together:
In [24]:
x = np.array([1, 2, 3])
y = np.array([4, 5, 6])
z = x + y
print(z)
Works exactly as you'd expect, but no [explicit] loop needed.
This becomes particularly compelling with matrix multiplication. Say you have two matrices, $A$ and $B$:
In [25]:
A = np.array([ [1, 2], [3, 4] ])
B = np.array([ [5, 6], [7, 8] ])
If you recall from algebra, matrix multiplication $A \times B$ involves multipliying each row of $A$ by each column of $B$. But rather than write that code yourself, Python (as of version 3.5) gives us a dedicated matrix multiplication operator: the @
symbol!
In [26]:
A @ B
Out[26]:
append()
method, but you can't do this with NumPy arrays. Therefore, if you're building an array from scratch, the best option would be to build the list and then pass that to numpy.array()
to convert it. Adjusting the length of the NumPy array after it's constructed is more difficult.import
statements. There is even more functionality from 3rd-party vendors, but it needs to be installed before it can be import
ed. NumPy falls in this lattermost category.Some questions to discuss and consider:
1: NumPy arrays have an attribute called .shape
that will return the dimensions of the array in the form of a tuple. If the array is just a vector, the tuple will only have 1 element: the length of the array. If the array is a matrix, the tuple will have 2 elements: the number of rows and the number of columns. What will the shape
tuple be for the following array: tensor = np.array([ [ [1, 2], [3, 4] ], [ [5, 6], [7, 8] ], [ [9, 10], [11, 12] ] ])
2: Vectorized computations may seem almost like magic, and indeed they are, but at the end of the day there has to be a loop somewhere that performs the operations. Given what we've discussed about interpreted languages, compiled languages, and in particular how the delineations between the two are blurring, what would your best educated guess be (ideally without Google's help) as to where these loops actually happen that implemented the vectorized computations?
3: Using your knowledge of slicing from a few lecture ago, and your knowledge from this lecture that NumPy arrays also support slicing, let's take an example of selecting a sub-range of rows from a two-dimensional matrix. Write the notation you would use for slicing out / selecting all the rows except for the first one, while retaining all the columns (hint: by just using :
as your slicing operator, with no numbers, this means "everything").
If you're having trouble with the assignments, please contact me! You can also post about it in Slack; that's specifically what I created the #csci1360e-discussions
channel for: to collaborate and discuss any concepts you want with your classmates. I'll jump in wherever and whenever I can, but if you're having an issue chances are someone else is, too. All I want to avoid is straight copy-paste of code, but collaboration is perfectly fine and strongly encouraged!