See Jake Vanderplas' Intro to NumPy, Efficient NumPy and Intro to Scipy for more details. This notebook was created from a selection of materials in those more exhaustive notebooks.
While the Python language is an excellent tool for general-purpose programming, with a highly readable syntax, rich and powerful data types (strings, lists, sets, dictionaries, arbitrary length integers, etc) and a very comprehensive standard library, it was not designed specifically for mathematical and scientific computing. Neither the language nor its standard library have facilities for the efficient representation of multidimensional datasets, tools for linear algebra and general matrix manipulations (an essential building block of virtually all technical computing), nor any data visualization facilities.
SciPy is the premiere package for scientific computing in Python. It has many, many useful tools, which are central to scientific computing tasks you'll come across.
Some of the submodules available in SciPy include:
scipy.io
scipy.special
scipy.linalg
scipy.fftpack
scipy.optimize
scipy.stats
scipy.interpolate
scipy.integrate
scipy.signal
scipy.ndimage
scipy.sparse
scipy.constants
scipy.spatial
These are often best learned on an "as-needed basis", when a project calls for processing in one of these categories. Before we get into the SciPy libraries, we need to learn how to efficiently compute with the NumPy library. Both are installed by default with the Anaconda distribution of Python.
The NumPy library forms the base layer for the entire SciPy ecosystem. You can import it into the notebook with
In [ ]:
import numpy as np
We do this to save time with all future functions. Instead of needing to type (for example)
numpy.sin(x)
we can simply type
np.sin(x)
As you work through this notebook, I encourage you to not only run the code in each cell, but spend some time to manipulate and play with the code you see until you feel comfortable with what each cell does.
You don't need to memorize everything in this notebook.
It's great to have basic commands down by heart, but there is so much to coding that no one knows it all. Professionals are constantly googling and referencing other code. Coding is more about figuring out what you want in the end, and then using your knowledge AND resources to help the computer get you there.
You learned about Python lists during your CodeCademy Bootcamp. Although they are very flexible containers, they are poorly suited to represent efficiently common mathematical constructs like vectors and matrices.
The main object provided by numpy is a powerful array. We'll start by exploring how the numpy array differs from Python lists. We start by creating a simple list and an array with the same contents of the list:
In [ ]:
lst = [10, 20, 30, 40]
arr = np.array([10, 20, 30, 40])
Elements of a one-dimensional array are accessed with the same syntax as a list:
In [ ]:
arr[0]
In [ ]:
lst[0]
In [ ]:
arr[-1]
In [ ]:
arr[2:]
The first difference to note between lists and arrays is that arrays are homogeneous; i.e. all elements of an array must be of the same type. In contrast, lists can contain elements of arbitrary type. For example, we can change the last element in our list above to be a string:
In [ ]:
lst[-1] = 'a string inside a list'
lst
In [ ]:
arr[-1] = 'a string inside an array'
but this NumPy array can only contain integers, so you get an error when you try to add a string to a NumPy array.
The information about the type of an array is contained in its dtype
attribute:
In [ ]:
arr.dtype
Once an array has been created, its dtype
is fixed and it can only store elements of the same type. For this example where the dtype
is integer, if we store a floating point number it will be automatically converted into an integer:
In [ ]:
arr[-1] = 1.234
arr
As we did above, you can create an array from a python list, but there are other methods of creating arrays that are very useful. For example, we may want to create an array initialized to a specific value, like 0 or 1, which can then be built upon with addition or multiplication to get arrays of any initial value. To create these, use zeros
or ones
and specify the desired type:
In [ ]:
np.zeros(5, dtype=float)
In [ ]:
np.zeros(3, dtype=int)
In [ ]:
np.zeros(3, dtype=complex)
In [ ]:
print '5 ones:', np.ones(5)
In [ ]:
a = np.empty(4)
a.fill(5.5)
a
In [ ]:
a = np.empty(3,dtype=int)
a.fill(7)
a
Numpy also offers the arange
function, which works like the builtin range
but returns an array instead of a list:
In [ ]:
np.arange(5)
But notice something important about how it is used:
In [ ]:
c = np.arange(2,16,2)
c
The max number in the range is exclusive because python numbering starts at zero.
Alternatively, the linspace
and logspace
functions create linearly and logarithmically-spaced sequences respectively, with a fixed number of points and including both ends of the specified interval:
In [ ]:
print "A linear sequence between 0 and 1 with 4 elements:"
print np.linspace(0, 1, 4)
In [ ]:
print "A logarithmic sequence between 10**1 and 10**3 with 4 elements:"
print np.logspace(1, 3, 4)
It is often useful to create arrays with random numbers that follow a specific distribution. The np.random
module contains a number of functions that can be used to this effect, for example this will produce an array of 5 random samples taken from a standard normal distribution, or "bell curve" with a mean (average) value of 0 and variance (width) of 1:
In [ ]:
np.random.randn(5)
whereas this will also give 5 samples, but from a normal distribution (bell curve) with a mean of 10 and a variance of 3:
In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10
How about making a few arrays for yourself?
In [ ]:
#Your code here
Use np.linspace
to place 5 floating point numbers in the range between 7. and 12.
Look at the documentation for np.arange
and use it to create a set of numbers between 7 and 12.
How are np.linspace
and np.arange
different?
In [ ]:
#Your code here
Numpy can create arrays of aribtrary dimensions. For example, a list of lists can be used to initialize a two dimensional array:
In [ ]:
lst2 = [[1, 2], [3, 4]]
arr2 = np.array([[1, 2], [3, 4]])
How are these different? Print them to see.
arr2
In [ ]:
lst2
With two-dimensional arrays we start seeing the power of NumPy. If we wanted to access the second element in the first list of lst2
, we would have to type lst2[0],[1]
. We do this because first we need to access the "zeroth" list (remember Python starts numbering elements at 0), and then the "first" element of that list. (Again, since the number 1 corresponds to the second item in a list.) Whereas, if we wanted to access the second element in the first row of arr2
, we simply tell it the row number and column number. Remembering that since NumPy starts numbering at 0, we would access the "zeroth" row and "first" element as [0,1]
.
While nested lists can be indexed using the [ ]
operator repeatedly, NumPy arrays support a much more natural indexing syntax with a single [ ]
and a set of indices separated by commas:
In [ ]:
print lst2[0][1]
print arr2[0,1]
Most of the array creation functions listed above can be used with more than one dimension, for example:
In [ ]:
np.zeros((2,3))
The shape of an array can be changed at any time, as long as the total number of elements is unchanged. For example, if we want a 2x4 array with numbers increasing from 0, the easiest way to create it is:
In [ ]:
arr = np.arange(8).reshape(2, 4)
print arr
But note that reshaping (like most NumPy operations) provides a view of the same memory:
In [ ]:
arr = np.arange(8)
arr2 = arr.reshape(2, 4)
arr[0] = 1000
print arr
print arr2
This lack of copying allows for very efficient vectorized operations.
With multidimensional arrays, you can also use slices, and you can mix and match slices and single indices in the different dimensions:
In [ ]:
print 'Slicing in the second row:', arr2[1, 2:4]
print 'All rows, third column :', arr2[:, 2]
In [ ]:
print 'First row: ', arr2[0]
print 'Second row: ', arr2[1]
In [ ]:
#Print some properties of the array arr2
print 'Data type :', arr2.dtype
print 'Total number of elements :', arr2.size
print 'Number of dimensions :', arr2.ndim
print 'Shape (dimensionality) :', arr2.shape
print 'Memory used (in bytes) :', arr2.nbytes
In [ ]:
#Print some useful information that the arrays can calculate for us
print 'Minimum and maximum :', arr2.min(), arr2.max()
print 'Sum and product of all elements :', arr2.sum(), arr2.prod()
print 'Mean and standard deviation :', arr2.mean(), arr2.std()
It's also possible to do the computation along a single dimension, by passing the axis parameter; for example:
In [ ]:
print 'For the following array:\n', arr2
print 'The sum of elements along the rows is :', arr2.sum(axis=1)
print 'The sum of elements along the columns is :', arr2.sum(axis=0)
As you can see in this example, the value of the axis
parameter is the dimension which will be consumed once the operation has been carried out. This is why to sum along the rows we use axis=1
.
Another widely used property of arrays is the .T
attribute, which allows you to access the transpose of the array:
In [ ]:
print 'Array:\n', arr2
print 'Transpose:\n', arr2.T
NumPy can also create some useful matrices:
In [ ]:
#identity matrix
c = np.eye(3)
c
In [ ]:
#diagonal matrix wth elements of an array
d = np.diag(np.array([1,2,3,4]))
d
In [ ]:
e = np.diag(np.linspace(0,1,6))
e
To access the elements of a multidimensional (in this case 2D) array:
In [ ]:
e[1,1]
In [ ]:
e[2,2]
Create the following arrays with correct data types:
[[1 1 1 1]
[1 1 1 1]
[1 1 1 2]
[1 6 1 1]]
[[0. 0. 0. 0. 0.]
[2. 0. 0. 0. 0.]
[0. 3. 0. 0. 0.]
[0. 0. 4. 0. 0.]
[0. 0. 0. 5. 0.]
[0. 0. 0. 0. 6.]]
You should be able to accomplish this in 3 lines of code for each one. The documentation for the diag
method may be helpful.
In [ ]:
#Your code here
Read the documentation for np.tile
and use this function to construct the following array:
[[4 3 4 3 4 3]
[2 1 2 1 2 1]
[4 3 4 3 4 3]
[2 1 2 1 2 1]]
In [ ]:
#Your code here
Print the shape, size, dimensions, minimum, maximum, and mean values of the array from exercise 4.
In [ ]:
#Your code here
To select a range of elements in the array, you can use slices, just like for Python lists:
In [ ]:
a = np.arange(10)
a
In [ ]:
a[:4]
In [ ]:
a[2:4]
In [ ]:
a[5:]
In [ ]:
a[:]
You can also give the step size:
In [ ]:
a[2:9:3]
Create an array of 25 values between 17. and 42. using np.linspace
. Print the result and then use a slice to print only every third value in the array.
In [40]:
#Your code here
Arrays allow for a more sophisticated kind of indexing which is very powerful: you can index an array with another array, and in particular with an array of boolean values. This is particularly useful to extract information from an array that matches a certain condition.
Consider for example that in the array norm10
we want to replace all values above 9 with the value 0. We can do so by first finding the mask that indicates where this condition is true or false:
In [ ]:
norm10 = np.random.normal(10, 3, 5)
norm10
In [ ]:
mask = norm10 > 9
mask
Now that we have this mask, we can use it to either read those values or to reset them to 0:
In [ ]:
print 'Values above 9:', norm10[mask]
In [ ]:
print 'Resetting all values above 9 to 0...'
norm10[mask] = 0
print norm10
Create a normal distribution of 8 numbers with mean 0 and variance 2. Then create a mask that selects only the negative values in the array. Print the following:
Now use the mask to set all the negative values to equal 8 and print the array again.
In [ ]:
#Your code here
Arrays support all regular arithmetic operators, and the NumPy library also contains a complete collection of basic mathematical functions that operate on arrays. It is important to remember that in general, all operations with arrays are applied element-wise, i.e., are applied to all the elements of the array at the same time. Consider for example:
In [ ]:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print arr1, '+', arr2, '=', arr1+arr2
Importantly, you must remember that even the multiplication operator is by default applied element-wise, it is not the matrix multiplication from linear algebra (as is the case in Matlab, for example):
In [ ]:
print arr1, '*', arr2, '=', arr1*arr2
We may also multiply an array by a scalar:
In [ ]:
1.5 * arr1
NumPy ships with a full complement of mathematical functions that work on entire arrays, including logarithms, exponentials, trigonometric and hyperbolic trigonometric functions, etc. The most important reason to learn these functions now is that they allow you to very quickly manipulate arrays for mathematical calculations.
For example, sampling the sine function at 100 points between 0 and $2\pi$ is as simple as:
In [ ]:
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
These are examples of Universal Functions or ufuncs
in NumPy. They operate element-wise on an array. We've already seen examples of these in the various arithmetic operations:
In [ ]:
x = np.random.random(4)
print x
print x + 1 # add 1 to each element of x
In [ ]:
print x * 2 # multiply each element of x by 2
In [ ]:
print x * x # multiply each element of x by itself
In [ ]:
print x[1:] - x[:-1]
These are binary ufuncs
: they take two arguments. There are also unary ufuncs
:
In [ ]:
-x
In [ ]:
np.sin(x)
Ufuncs are very fast. If you wanted to create a very large array and then add 1 to every element in the array, you could write a for loop
to do it element by element, but it would take a long time to complete compared to what you can do with arrays. Here's an example, using the %%timeit
magic function to show how long the two methods take.
In [ ]:
x = np.random.random(10000)
In [ ]:
%%timeit
# compute element-wise x + 1 via a ufunc
y = np.zeros_like(x)
y = x + 1
In [ ]:
%%timeit
# compute element-wise x + 1 via a loop
y = np.zeros_like(x)
for i in range(len(x)):
y[i] = x[i] + 1
NumPy ufuncs
are faster than Python functions involving loops, because the looping happens in compiled code. This is only possible when types are known beforehand, which is why NumPy arrays must be typed.
np.sin
, np.cos
, etc.)scipy.special.j0
, scipy.special.gammaln
, etc.)np.minimum
, np.maximum
)ufuncs
In [ ]:
x = np.random.random(5)
print x
print np.minimum(x, 0.5)
print np.maximum(x, 0.5)
How is this different from min()
and max()
?
In [ ]:
print np.min(x)
print np.max(x)
We've only scratched the surface of what NumPy can do. It would be overwhelming to try to show you all of NumPy's capabilities in an introductory notebook, so we will end here with a final exercise. As you follow along in the subsequent notebooks, notice where and how NumPy arrays are used. You may have ideas about something you'd like to do with an array to solve a problem or complete an exercise. Check the NumPy documentation or Google for help with more advanced commands and syntax.
Let's practice some of the ideas we have learned with a simple exercise to create a prime number sieve. We will use the sieve to see what numbers between 0 and 100 are prime.
(a) First, we will construct an array of booleans called is prime with shape (100,)
, filled with True
in all the elements. Type the next line of code to create the array, then print it.
Don't copy/paste. Type it yourself.
is_prime = np.ones((100,), dtype=bool)
In [ ]:
#Your code here
(b) The index of each boolean element represents the number. “Cross out” 0
and 1
, which are not primes. You can either set them to False
or 0
(which python recognizes as equivalent to False
for boolean types) in the array is_prime
. Then print the array again to see what changed.
In [ ]:
#Your code here
(c) For each subsequent integer j
starting from 2
, cross out its higher multiples.
Again, type this code yourself. Don't copy/paste.
N_max = int(np.sqrt(len(is_prime)))
for j in range(2, N_max):
is_prime[2*j::j] = False
What the heck is this code doing? What does that slicing of the array mean? Try printing the values of j
, 2*j
, and is_prime
inside the loop to see what is happening at each step. Can you figure it out from the values at each iteration?
In [ ]:
#Your code here
(d) Look up the documentation for np.nonzero
(try help(np.nonzero)
or np.nonzero?
) and use it to print the prime numbers that are left in the array is_prime
.
In [ ]:
#Your code here
(e) Finally, combine the above code into a new function, called eratosthenes_sieve()
that takes one argument, maximum
, the maximum number to test for primes, and returns an array containing the prime numbers between 2
and maximum
.
The algorithm we just implemented was developed by Eratosthenes and is described on wikipedia.
Print the result for the test cases where maximum
= 10
and 100
.
Below is the scaffold for the function and the test cases you should implement.
def eratosthenes_sieve(maximum):
#your code here
return prime_array
print eratosthenes_sieve(10)
print eratosthenes_sieve(100)
In [ ]:
#Your code here