Introduction to Python for Data science: 03 - SOME TITLE

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.

If you haven't done so already please read through the Introduction to this course, which covers:

  1. What Python is and why it is of interest;
  2. Learning outcomes for the course;
  3. The course structure and support facilities;
  4. An introduction to Jupyter Notebooks;
  5. Information on course exercises.

This lesson covers:

TODO: FINISH TOC

Lesson setup code

Run the following Notebook cell every time you load this lesson (but do not edit it). Don't be concerned with what this code does at this stage.


In [ ]:
____ = 0
import os
import numpy as np
from codecs import decode
from numpy.testing import assert_almost_equal, assert_array_equal
from IPython.display import Image

File paths and the current working directory

Before we move on to look more at what numpy can do for us let's first load the weather station data previously encountered in the Lesson 02 Notebook. This is necessary at the variables created in memory in one Notebook are not accessible in other Notebooks.


In [ ]:
data_dir = 'Weather_data'
csv_path = os.path.join(data_dir, 'Devonshire_Green_meteorological_data-preproc.csv')
csv_path

In [ ]:
dev_green = np.genfromtxt(csv_path, 
                          delimiter=',', 
                          skip_header=1)

Something not explicitly mentioned in Lesson 02 is how Python finds the CSV file that we want to load.

When we want Python to read from or write to a file we can provide it with

  • an absolute path (fully specifies where the file is in the filesystem) or
  • a relative path (specifies the file's location relative to a particular directory).

Here we construct a relative path the name of the file we want to read and the directory the file resides in. This path is relative to Python's current working directory, which is currently the directory we started the Notebook session from. This directory is:


In [ ]:
os.getcwd()

The function os.path.join joins together directory name(s) and/or file name supplied as arguments to construct a path (of the 'string' type) by interspersing those arguments with the appropriate path separator character for the operating system. The path separator character differs between Windows (\) and OS X/Linux (/), so using os.path.join (rather than manually inserting \ or / characters into a long string) means that the same code can be run on any operating system.

The os package includes many other useful functions for manipulating files/directories and os.path module (a component of os) contains functions for checking and manipulating paths.


Exercise

Image displays an image file in a Notebook. It accepts a path to an image as an argument. Use Image to display saved in the assets directory that contains an image relevant to the weather station dataset. To determine the name of that image you will first need to list the contents of the assets directory. An appropriate function is provided by the os module. Write your code so that it will run on any operating system.


Before we move on here's a quick reminder of the column headings in the .csv file (don't worry about how this does what it does). The order (specifically the indexes) of the column names will be useful for exercises later on.


In [ ]:
with open(csv_path, 'r') as my_open_file:
    for column_idx, column_name in enumerate(my_open_file.readline().split(',')):
        print(column_idx, column_name)

Advanced indexing

We have seen how we can extract portions of an ndarray when using:

  • single index values e.g my_2d_array[5, 3];
  • a range of index values, possibly with e.g my_2d_array[10:500:5, 3] or my_2d_array[275:300, -5:]

There are two other key ways of indexing ndarrays that we need to know about.

Indexing with a boolean sequence

We will often want to extract values from an ndarray (and create a new ndarray) where a condition is met e.g:

  • Select all values in the Nitrogen dioxide column greater than 10.0
  • Select all even values in an integer column that are greater than 0 and less than 100
  • Select all non-null values in the Modelled Temperature column
  • Select Ozone values where Month is May or September (i.e. select from one column based on a condition associated with another)

This can be achieved by

  1. Creating a boolean ndarray (an array of True and False values)
  2. Use this an index to extract values from an array and create a new array (as opposed to indexing using the start:end:range notation used in Lesson 02 for basic indexing)

For example, we know from Lesson 02 that we can determine if each element of an array is not NaN using the isnan function and the numpy negation operator ~ e.g. for the Ozone concentration column:


In [ ]:
~np.isnan(dev_green[:, 4])

We can then extract only the non-null Ozone concentrations using this boolean array:


In [ ]:
dev_green[~np.isnan(dev_green[:, 4]), 4].shape

which, as we might expect, contains fewer values that the original Ozone column:


In [ ]:
dev_green[:, 4].shape

Don't worry if you do not immediately find the indexing expression particularly readable/comprehensible. We could make it clearer by assigning both the (integer) column index and the created boolean arrays to variables with meaningful names:


In [ ]:
ozone_col_idx = 4

is_ozone_not_null = ~np.isnan(dev_green[:, ozone_col_idx])

dev_green[is_ozone_not_null, ozone_col_idx]

Here we have:

  • Identified a column of interest
  • Determined whether all values in just that column are not null
  • Assigned the resulting boolean array to a variable
  • Create a new ndarray by using that boolean array to extract certain rows from just the column of interest in dev_green

Let's look at another example. Our objective in this case is to extract just the Nitrogen dioxide values for December. Using numpy's unique function we can see that values in the Month column range from 1 to 12 (not 0 to 11), so December corresponds to 12:


In [ ]:
month_col_idx = 1

np.unique(dev_green[:, month_col_idx])

We can now

  • create a boolean index that specifies whether each air quality sample was taken in December
  • use that to index the Nitrogen dioxide column

In [ ]:
is_in_december = dev_green[:, month_col_idx] == 12

nitr_diox_col_idx = 6

dev_green[is_in_december, nitr_diox_col_idx].shape

Again, note how assigning temporary values to variables makes our code more reable.

When creating boolean arrays for indexing (or other) purposes we can use any of Python's standard comparrison operators for testing for (in)equality and relative magnitude (some of which you will have already seen):

  • == -> equals
  • != -> not equals
  • < -> less than
  • <= -> less than or equal to
  • > -> greater than
  • >= -> greater than or equal to

These can be used to compare

  • an array and a single value e.g. my_array == 12 or
  • two arrays e.g. some_array > another_array.

In the second case each pair of elements are compared. The two arrays must be of the same size. For example:


In [ ]:
some_array = np.array([1, 3, 5, 7])
another_array = np.array([2, 9, 5, 1])

some_array >= another_array

Exercise

What is the highest Ozone value recorded in October, November or December (accurate to two decimal places)?

Hints:

  • Juvpu bs gur fvk bcrengbef yvfgrq nobir nyybj lbh gb qvfgvathvfu gur inyhrf va gur Zbagu gung lbh ner vagrerfgrq va sebz gubfr gung lbh ner abg?
  • Jurer zvtug lbh ybbx sbe n shapgvba gung ergheaf gur uvturfg inyhr va na neenl?

In [ ]:
assert_almost_equal(____, 79.56, decimal=2)

Combining boolean arrays

One of the possible uses for boolean indexing given above was to select Ozone values where Month is May or September. You should hopefully now see how we could create a boolean array for Month being 5 and another boolean array for Month being 9 but how do we combine them?

There are several operators we can use for combining boolean ndarrays.

  • | -> or (| is the pipe character)
  • & -> and
  • ~ -> not (i.e. negation; already encountered)

With the first two operators, every corresponding pair of values from the two boolean input arrays are compared to generate a value in the boolean output array.


Exercise

How do the | and & boolean operators compare values? First, let's generate two simple boolean arrays


In [ ]:
a = np.array([True, False, True, False])
b = np.array([True, True, False, False])

Now, evaluate each of the following expressions and look at how each pair of input elements (or single input element) relates to the corresponding output element.

  • a & b
  • b & a
  • a | b
  • b | a
  • ~a
  • ~b

Now let's assign two new boolean arrays to variables a and b:


In [ ]:
a = np.array([True, True, False, False, False])
b = np.array([False, True, False, False, True])

Given the results of evaluating the six expressions above can you predict what they will evaluate to now? Check to see if your predictions are correct.


Let's return to the problem of selecting Ozone values where the Month is May or September. We can solve this using the | operator by asking select ozone samples where (corresponding month is May) or (corresponding month is September) (parentheses added for emphasis):


In [ ]:
is_sample_from_may = dev_green[:, month_col_idx] == 5
is_sample_from_sept = dev_green[:, month_col_idx] == 9
is_relevant_sample = is_sample_from_may | is_sample_from_sept
dev_green[is_relevant_sample, ozone_col_idx].shape

We can write that in a more concise form using fewer variables but note that we now need to wrap arguments to the | operator to be valid:


In [ ]:
is_relevant_sample = (dev_green[:, month_col_idx] == 9) | (dev_green[:, month_col_idx] == 5)
dev_green[is_relevant_sample, ozone_col_idx].shape

Exercise

What is the mean Volatile PM2.5 during the period May to August (inclusive) when Ozone is above average (above its mean value)? Give the answer correct to two decimal places.

Hints:

  • Lbh pna fryrpg fnzcyrf sebz gur enatr bs zbaguf lbh ner vagrerfgrq va ol pbzovavat gjb obbyrna neenlf. Juvpu pbzcneevfba bcrengbef naq juvpu obbyrna neenl-pbzovavat bcrengbe qb lbh arrq gb qb guvf?
  • Gb vqragvsl fnzcyrf jurer bmbar pbapragengvba vf nobir gur zrna lbh svefg arrq gb pnyphyngr gung zrna.
  • Lbh hygvzngryl arrq gb pbzovar n obbyrna neenl gung fgngrf jurgure n fnzcyr jnf gnxra qhevat n eryrinag zbagu naq nabgure obbyrna neenl gung fgngrf jurgure bmbar rkprrqf n fcrpvsvrq inyhr. Juvpu obbyrna neenl pbzovangvba bcrengbe qb lbh arrq sbe guvf?

In [ ]:
volatile_pm2_5_col_idx = -4



assert_almost_equal(____, 2.00, decimal=2)

Tip: vectorised code/operations

Indexing using boolean operations and joining together boolean arrays using operators such as & are examples of what are called vectorised operations or vectorized code. This is where we use functions that operate on all elements of an array at once (e.g. my_array[my_array > 4] rather than us manually going through all elements of an array one by one, performing an operation on each in turn (although we will look at how to do this later on)

Vectorized operations are used a lot in Python, R and Matlab by data scientists. You will find that reading and writing expressions using vectorised operations becomes much easier over time.

Indexing with an integer sequence

Another form of 'advanced indexing' in numpy is selecting non-contiguous values from an ndarray by indexing with a sequence of integers (as opposed to a range (start:end:increment) or boolean array). Let's look at a quick example. Say we have the following array:


In [ ]:
heights_of_people = np.array([183.0, 167.2, 171.9, 180.1, 159.9])

We can extract the first three elements with the following familar notation:


In [ ]:
heights_of_people[:3]

What if we want to select a non-contiguous set of values such as the first two and the last? This is where we need to index with an integer sequence:


In [ ]:
heights_of_people[[0, 1, 4]]

Here the expression [0, 1, 4] is a type of sequence is called a list.


Tip: Lists

Lists are the most common way in Python to store an ordered sequence of values (of any type). They are built in to Python itself i.e. are not provided by an additional package like numpy). Lists are very powerful and flexible but are not often used by data scientists for storing large datasets (with thousands or millions of elements) as they do not support vectorized operations (see above), only less computationally-efficient methods for querying/manipulating big datasets.

TODO: ADD NOTE RE REVISITING LISTS LATER


Returning to indexing: we can mix and match

  • indexing with a single index value or a range
  • indexing with a boolean array
  • indexing with an integer sequence

e.g. we could select elements from three specific rows in the first two columns of an array using:

my_matrix[[1, 10, -1], :2]

or could select values from two specific columns, using a boolean array to select certain rows

is_useful_row = another_matrix[:, 0] >= 33.3
another_matrix[is_useful_row, [13, 47]]

In [ ]:
np.nanpercentile(dev_green[:, 4], 50)

Exercise

Create a new ndarray containing just the Ozone and Modelled Temperature for samples where the Modelled Temperature is greater than its median value. You'll most likely need to use two types of indexing to achive this.


Mutating numpy arrays

We don't always only want to query datasets; sometimes we want to make changes to them. One may wish to:

  • Replace missing (NaN) values with a default value
  • Replace all values greater than a threshold with that threshold
  • Update specific values by applying a arithmetic operation e.g. to convert some data from imperial to metric units

We saw in Lesson 01 that after creating variables we can then assign new values to them. An example of this could be:


In [ ]:
weights = np.array([1.2, 1.2, 1.2, 4.0])
weights = np.array([2.0, 2.0, 1.9])

Here we assign an ndarray to a variable then assigns another ndarray to that variable. Python is then clever enough to recognise that the the array created by np.array([1.2, 1.2, 1.2, 4.0]) is no longer associated with any variables so frees up the memory corresponding to this four-element array.

However, if we want to change only part of an array we can

  1. Used one of the indexing methods we learned about previously to select a subset of elements
  2. Assign values to those elements: either
    • assign a single value to all selected elements or
    • assign an array of the same size/shape as the selected subset

For example, if we want to change only the first two elements of a one-dimensional array:


In [ ]:
youtube_video_likes = np.array([1024, 999, 712, 34])
youtube_video_likes[:2] = 1002
youtube_video_likes

Or we could assign using an array. Below we are multiplying all elements of an array by a single value then assigning the result to part of another array:


In [ ]:
youtube_video_likes[2:] = np.array([736, 39]) * 2
youtube_video_likes

As mentioned, if we are assigning using an array then we need to make sure that the expression on the right-hand side of the equals sign has the same size/shape as the expression on the left-hand side:


In [ ]:
youtube_video_likes[2:] = np.array([740, 43, 29])

Exercise

Increase the last 400 Ozone in dev_green by 2.6 to compensate for perceived sensor error, then find the new mean of all Ozone values (correct to 3 decimal places):


In [ ]:
assert_almost_equal(____, 45.262, decimal=3)

Adding to or subtracting from the values of an array

We've seen how we can set parts of arrays to specific values. How can we add to or subtract from the values in part of or all of an array?

A cumbersome way would be to feature the same expression on the left and right sides of the equals sign e.g.:


In [ ]:
youtube_video_likes[2:] = youtube_video_likes[2:] + 4
youtube_video_likes

There is a more concise way of doing this:


In [ ]:
youtube_video_likes[2:] += 4
youtube_video_likes

Here the += operator means: 'take the thing on the left, determine its value (which might be an array), add four to that value, then assign the result back to the thing on the left'. There are also similar operators for

  • subtraction: -=
  • multiplication: *=
  • division: /=

Warning: this only works for mutating whole arrays and parts of arrays selected using basic indexing; it doesn't work for parts of arrays selected using boolean array indexing or integer sequence indexing. We'll now see why.

Create views or copies of arrays

There's something that we need to be mindful of assigning to arrays.

Say we assign to part of an array that we've indexed using the range notation (start:end:increment):


In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
two_oldest_trees = tree_ages[-2:]
two_oldest_trees

In [ ]:
two_oldest_trees += 5
two_oldest_trees

So the operation has increased the values in two_oldest_trees but...


In [ ]:
tree_ages

...it has also affected tree_ages! This is because indexing using a single integer or range notation creates a view of part of an array. Any changes to the view's elements are reflected in the array that the view was created from.

Let's look at a contrasting example.


In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
trees_older_than_59 = tree_ages[tree_ages > 59]
trees_older_than_59 -= 4.2
trees_older_than_59

In [ ]:
tree_ages

So here trees_older_than_59 was updated but tree_ages was not. This is because *numpy's 'advanced indexing' (using boolean arrays and/or integer sequences) creates a copy of part of an array. Any changes to the copy's elements are not* reflected in the array the view was created from.


Tip: Checking to see if an operation has returned a view or copy

Forgetting or not knowing when an operation will return a view or copy of an array is a common source of bugs when using numpy!

You can check to see whether you have created a view like this:


In [ ]:
tree_ages = np.array([34, 56, 60, 72.5, 86, 92])
two_oldest_trees = tree_ages[-2:]

two_oldest_trees.base is tree_ages

If the third line evaluates to False instead then two_oldest_trees is not a view of part of tree_ages.


Tip: forcing the creation of a copy

If you want to definitely create a copy of an array (for example to experiment with making changes to a dataset whilst retaining a backup) then you can use numpy's copy function:


In [ ]:
tree_ages_backup = np.copy(tree_ages)

Exercise

Precict which of the following expression return a view of part of dev_green and which return a copy, then use expression.base is dev_green (see above for why) to check your predictions:

  • dev_green[:, 4:]
  • dev_green[:, [1, 4, 5]]
  • dev_green[5:-5:2, 5::-1]
  • dev_green[dev_green[:, 6] > np.nanmedian(dev_green[:, 7])]