This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.
If you haven't done so already please read through the Introduction to this course, which covers:
This lesson covers:
In [ ]:
____ = 0
import os
import numpy as np
from codecs import decode
from numpy.testing import assert_almost_equal, assert_array_equal
from IPython.display import Image
Before we move on to look more at what numpy
can do for us let's first load the weather station data previously encountered in the Lesson 02 Notebook. This is necessary at the variables created in memory in one Notebook are not accessible in other Notebooks.
In [ ]:
data_dir = 'Weather_data'
csv_path = os.path.join(data_dir, 'Devonshire_Green_meteorological_data-preproc.csv')
csv_path
In [ ]:
dev_green = np.genfromtxt(csv_path,
delimiter=',',
skip_header=1)
Something not explicitly mentioned in Lesson 02 is how Python finds the CSV file that we want to load.
When we want Python to read from or write to a file we can provide it with
Here we construct a relative path the name of the file we want to read and the directory the file resides in. This path is relative to Python's current working directory, which is currently the directory we started the Notebook session from. This directory is:
In [ ]:
os.getcwd()
The function os.path.join
joins together directory name(s) and/or file name supplied as arguments to construct a path (of the 'string' type) by interspersing those arguments with the appropriate path separator character for the operating system. The path separator character differs between Windows (\
) and OS X/Linux (/
), so using os.path.join
(rather than manually inserting \
or /
characters into a long string) means that the same code can be run on any operating system.
The os
package includes many other useful functions for manipulating files/directories and os.path
module (a component of os
) contains functions for checking and manipulating paths.
Image
displays an image file in a Notebook. It accepts a path to an image as an argument. Use Image
to display saved in the assets
directory that contains an image relevant to the weather station dataset. To determine the name of that image you will first need to list the contents of the assets
directory. An appropriate function is provided by the os
module. Write your code so that it will run on any operating system.
Before we move on here's a quick reminder of the column headings in the .csv file (don't worry about how this does what it does). The order (specifically the indexes) of the column names will be useful for exercises later on.
In [ ]:
with open(csv_path, 'r') as my_open_file:
for column_idx, column_name in enumerate(my_open_file.readline().split(',')):
print(column_idx, column_name)
We have seen how we can extract portions of an ndarray
when using:
my_2d_array[5, 3]
;my_2d_array[10:500:5, 3]
or my_2d_array[275:300, -5:]
There are two other key ways of indexing ndarray
s that we need to know about.
We will often want to extract values from an ndarray
(and create a new ndarray
) where a condition is met e.g:
Nitrogen dioxide
column greater than 10.0Modelled Temperature
columnOzone
values where Month
is May or September (i.e. select from one column based on a condition associated with another)This can be achieved by
ndarray
(an array of True and False values) start:end:range
notation used in Lesson 02 for basic indexing)For example, we know from Lesson 02 that we can determine if each element of an array is not NaN using the isnan
function and the numpy
negation operator ~
e.g. for the Ozone
concentration column:
In [ ]:
~np.isnan(dev_green[:, 4])
We can then extract only the non-null Ozone
concentrations using this boolean array:
In [ ]:
dev_green[~np.isnan(dev_green[:, 4]), 4].shape
which, as we might expect, contains fewer values that the original Ozone column:
In [ ]:
dev_green[:, 4].shape
Don't worry if you do not immediately find the indexing expression particularly readable/comprehensible. We could make it clearer by assigning both the (integer) column index and the created boolean arrays to variables with meaningful names:
In [ ]:
ozone_col_idx = 4
is_ozone_not_null = ~np.isnan(dev_green[:, ozone_col_idx])
dev_green[is_ozone_not_null, ozone_col_idx]
Here we have:
ndarray
by using that boolean array to extract certain rows from just the column of interest in dev_green
Let's look at another example. Our objective in this case is to extract just the Nitrogen dioxide
values for December. Using numpy's unique
function we can see that values in the Month
column range from 1 to 12 (not 0 to 11), so December corresponds to 12:
In [ ]:
month_col_idx = 1
np.unique(dev_green[:, month_col_idx])
We can now
In [ ]:
is_in_december = dev_green[:, month_col_idx] == 12
nitr_diox_col_idx = 6
dev_green[is_in_december, nitr_diox_col_idx].shape
Again, note how assigning temporary values to variables makes our code more reable.
When creating boolean arrays for indexing (or other) purposes we can use any of Python's standard comparrison operators for testing for (in)equality and relative magnitude (some of which you will have already seen):
==
-> equals !=
-> not equals<
-> less than<=
-> less than or equal to>
-> greater than>=
-> greater than or equal toThese can be used to compare
my_array == 12
orsome_array > another_array
.In the second case each pair of elements are compared. The two arrays must be of the same size. For example:
In [ ]:
some_array = np.array([1, 3, 5, 7])
another_array = np.array([2, 9, 5, 1])
some_array >= another_array
What is the highest Ozone value recorded in October, November or December (accurate to two decimal places)?
Hints:
Zbagu
gung lbh ner vagrerfgrq va sebz gubfr gung lbh ner abg?
In [ ]:
assert_almost_equal(____, 79.56, decimal=2)
One of the possible uses for boolean indexing given above was to select Ozone
values where Month
is May or September. You should hopefully now see how we could create a boolean array for Month
being 5 and another boolean array for Month
being 9 but how do we combine them?
There are several operators we can use for combining boolean ndarrays
.
|
-> or (|
is the pipe character)&
-> and~
-> not (i.e. negation; already encountered)With the first two operators, every corresponding pair of values from the two boolean input arrays are compared to generate a value in the boolean output array.
In [ ]:
a = np.array([True, False, True, False])
b = np.array([True, True, False, False])
Now, evaluate each of the following expressions and look at how each pair of input elements (or single input element) relates to the corresponding output element.
a & b
b & a
a | b
b | a
~a
~b
Now let's assign two new boolean arrays to variables a
and b
:
In [ ]:
a = np.array([True, True, False, False, False])
b = np.array([False, True, False, False, True])
Given the results of evaluating the six expressions above can you predict what they will evaluate to now? Check to see if your predictions are correct.
Let's return to the problem of selecting Ozone
values where the Month
is May or September. We can solve this using the |
operator by asking select ozone samples where (corresponding month is May) or (corresponding month is September) (parentheses added for emphasis):
In [ ]:
is_sample_from_may = dev_green[:, month_col_idx] == 5
is_sample_from_sept = dev_green[:, month_col_idx] == 9
is_relevant_sample = is_sample_from_may | is_sample_from_sept
dev_green[is_relevant_sample, ozone_col_idx].shape
We can write that in a more concise form using fewer variables but note that we now need to wrap arguments to the |
operator to be valid:
In [ ]:
is_relevant_sample = (dev_green[:, month_col_idx] == 9) | (dev_green[:, month_col_idx] == 5)
dev_green[is_relevant_sample, ozone_col_idx].shape
What is the mean Volatile PM2.5
during the period May to August (inclusive) when Ozone is above average (above its mean value)? Give the answer correct to two decimal places.
Hints:
In [ ]:
volatile_pm2_5_col_idx = -4
assert_almost_equal(____, 2.00, decimal=2)
Indexing using boolean operations and joining together boolean arrays using operators such as &
are examples of what are called vectorised operations or vectorized code. This is where we use functions that operate on all elements of an array at once (e.g. my_array[my_array > 4]
rather than us manually going through all elements of an array one by one, performing an operation on each in turn (although we will look at how to do this later on)
Vectorized operations are used a lot in Python, R and Matlab by data scientists. You will find that reading and writing expressions using vectorised operations becomes much easier over time.
In [ ]:
heights_of_people = np.array([183.0, 167.2, 171.9, 180.1, 159.9])
We can extract the first three elements with the following familar notation:
In [ ]:
heights_of_people[:3]
What if we want to select a non-contiguous set of values such as the first two and the last? This is where we need to index with an integer sequence:
In [ ]:
heights_of_people[[0, 1, 4]]
Here the expression [0, 1, 4]
is a type of sequence is called a list.
Lists are the most common way in Python to store an ordered sequence of values (of any type). They are built in to Python itself i.e. are not provided by an additional package like numpy
). Lists are very powerful and flexible but are not often used by data scientists for storing large datasets (with thousands or millions of elements) as they do not support vectorized operations (see above), only less computationally-efficient methods for querying/manipulating big datasets.
TODO: ADD NOTE RE REVISITING LISTS LATER
Returning to indexing: we can mix and match
e.g. we could select elements from three specific rows in the first two columns of an array using:
my_matrix[[1, 10, -1], :2]
or could select values from two specific columns, using a boolean array to select certain rows
is_useful_row = another_matrix[:, 0] >= 33.3
another_matrix[is_useful_row, [13, 47]]
In [ ]:
np.nanpercentile(dev_green[:, 4], 50)
We don't always only want to query datasets; sometimes we want to make changes to them. One may wish to:
We saw in Lesson 01 that after creating variables we can then assign new values to them. An example of this could be:
In [ ]:
weights = np.array([1.2, 1.2, 1.2, 4.0])
weights = np.array([2.0, 2.0, 1.9])
Here we assign an ndarray
to a variable then assigns another ndarray
to that variable. Python is then clever enough to recognise that the the array created by np.array([1.2, 1.2, 1.2, 4.0])
is no longer associated with any variables so frees up the memory corresponding to this four-element array.
However, if we want to change only part of an array we can
For example, if we want to change only the first two elements of a one-dimensional array:
In [ ]:
youtube_video_likes = np.array([1024, 999, 712, 34])
youtube_video_likes[:2] = 1002
youtube_video_likes
Or we could assign using an array. Below we are multiplying all elements of an array by a single value then assigning the result to part of another array:
In [ ]:
youtube_video_likes[2:] = np.array([736, 39]) * 2
youtube_video_likes
As mentioned, if we are assigning using an array then we need to make sure that the expression on the right-hand side of the equals sign has the same size/shape as the expression on the left-hand side:
In [ ]:
youtube_video_likes[2:] = np.array([740, 43, 29])
In [ ]:
assert_almost_equal(____, 45.262, decimal=3)
We've seen how we can set parts of arrays to specific values. How can we add to or subtract from the values in part of or all of an array?
A cumbersome way would be to feature the same expression on the left and right sides of the equals sign e.g.:
In [ ]:
youtube_video_likes[2:] = youtube_video_likes[2:] + 4
youtube_video_likes
There is a more concise way of doing this:
In [ ]:
youtube_video_likes[2:] += 4
youtube_video_likes
Here the +=
operator means: 'take the thing on the left, determine its value (which might be an array), add four to that value, then assign the result back to the thing on the left'. There are also similar operators for
-=
*=
/=
Warning: this only works for mutating whole arrays and parts of arrays selected using basic indexing; it doesn't work for parts of arrays selected using boolean array indexing or integer sequence indexing. We'll now see why.
In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
two_oldest_trees = tree_ages[-2:]
two_oldest_trees
In [ ]:
two_oldest_trees += 5
two_oldest_trees
So the operation has increased the values in two_oldest_trees
but...
In [ ]:
tree_ages
...it has also affected tree_ages
! This is because indexing using a single integer or range notation creates a view of part of an array. Any changes to the view's elements are reflected in the array that the view was created from.
Let's look at a contrasting example.
In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
trees_older_than_59 = tree_ages[tree_ages > 59]
trees_older_than_59 -= 4.2
trees_older_than_59
In [ ]:
tree_ages
So here trees_older_than_59
was updated but tree_ages
was not. This is because *numpy's 'advanced indexing' (using boolean arrays and/or integer sequences) creates a copy of part of an array. Any changes to the copy's elements are not* reflected in the array the view was created from.
In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
two_oldest_trees = tree_ages[-2:]
two_oldest_trees.base is tree_ages
If the third line evaluates to False
instead then two_oldest_trees
is not a view of part of tree_ages
.
In [ ]:
tree_ages_backup = np.copy(tree_ages)
Precict which of the following expression return a view of part of dev_green
and which return a copy, then use expression.base is dev_green
(see above for why) to check your predictions:
dev_green[:, 4:]
dev_green[:, [1, 4, 5]]
dev_green[5:-5:2, 5::-1]
dev_green[dev_green[:, 6] > np.nanmedian(dev_green[:, 7])]