Introduction to Python for Data science: 02 - Loading and manipulating data using numpy

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.

If you haven't done so already please read through the Introduction to this course, which covers:

What Python is and why it is of interest;
Learning outcomes for the course;
The course structure and support facilities;
An introduction to Jupyter Notebooks;
Information on course exercises.

This lesson covers:

Loading data: numpy and pandas
Querying tabular data stored in Numpy ndarrays
Extracting single values from numpy ndarrays
Extracting ranges from numpy ndarrays
Missing data and statistical summaries

TODO: FINISH TOC

Lesson setup code

Run the following Notebook cell every time you load this lesson (but do not edit it). Don't be concerned with what this code does at this stage.



In [ ]:

    
____ = 0
import os
import numpy as np
from codecs import decode
from numpy.testing import assert_almost_equal, assert_array_equal

Loading data: `numpy` and `pandas`

Let's now jump in to looking at using Python for doing something useful: exploring some air quality data.

Other languages such as R come with example datasets to allow us to quickly begin experimenting with data analysis techniques and plotting tools. This is not the case with Python. However, this is not an issue as the numpy and pandas packages provide functions for loading datasets from many different sources.

pandas allows us to load (typically tabular) data from:

Excel spreadsheets
Comma-separated value (CSV) files
Relational databases (e.g. PostgreSQL).
Several other formats

numpy allows us to load data from:

Comma-separated value (CSV) files
Efficient binary-format files

<!--We're going to load and explore Fisher's Iris data, which is often used to demonstrate statistical and machine learning algorithms.

The functionality offered by pandas for loading data is far more powerful and flexible than that offered by numpy but we are going to start by exploring numpy as it is simpler and learning about pandas later on should then be easier.

The data we are going to load and explore to help us learn the basics of numpy is from a weather station that is situated very near the University on Devonshire Green in Sheffield. See this page on the UK Department for Environment, Food and Rural Affairs' site for a map and photos of the site. Many environmental parameters such as air quality, temperature and wind speed are recorded at this site on a regular basis.

We're wanting to load a table of historic data captured at this weather station. The data is stored in

Weather_data/Devonshire_Green_meteorological_data-preproc.csv

(The associated metadata is in Devonshire_Green_meteorological_metadata-preproc.json; only provided for interest).

Let's crudely print out the first four lines of the file. You do not have to understand the following; it is just to show you the format of the file. However, be aware that this does not load the data from the file in a meaningful way; it simply and dumbly prints out lines of characters.



In [ ]:

    
data_dir = 'Weather_data'
csv_path = os.path.join(data_dir, 'Devonshire_Green_meteorological_data-preproc.csv')

with open(csv_path, 'r') as csv_file:
    for (line_num, line) in enumerate(csv_file):
        if line_num >= 4: 
            break
        print(line)

You will see that

The first row of the file is column headings;
Columns are separated by commas;
Rows 2-4 contain a set of measurements taken at a specified time;
All lines have the same number of columns.

Let's load that file in Python in a way that recognises the format of the data (headings and data arranged in columns):



In [ ]:

    
dev_green = np.genfromtxt(csv_path, 
                          delimiter=',', 
                          skip_header=1)

Try explaining these two lines of code to yourself in terms of functions, arguments and assignments. Also, what do you think the purpose of delimiter and skip_header are? Don't worry if you cannot answer the second of those questions just yet and don't worry about exactly what genfromtxt does: that will become clear shortly.

Note that here we have split a single long statement over two lines to make the code easier to read. You can split a statement over multiple lines like this if you introduce line breaks between a pair of parentheses. Alternatively, if you want to split a line outside of any parentheses you must include a backslash (\) at the end of all but the last line to tell Python that the line is to be continued e.g.



In [ ]:

    
z = 45 + \
    36 + \
    27

Querying tabular data stored in Numpy n-dimensional arrays (`ndarray`s)

So, what has been assigned to the dev_green variable? What have we created? Let's ask IPython to generate a preview of it:



In [ ]:

    
dev_green

It's not a single number but a matrix of numbers (written in scientific notation) with many rows and columns.

More formally, this is a numpy ndarray, this being an $n$-dimensional array of values. For us $n$=2 i.e. dev_green is a 2D matrix. The '...' indicate that the matrix contains more rows and columns than have been displayed.

Exercise

How big is our dataset? You can determine the number of rows and columns of dev_green by passing that ndarray variable as a single argument to the np.shape function. Fill in the blank below to make the assert statement valid.



In [ ]:

    
assert(____ == (8760, 17))

The shape function has returned not one but two numbers:

the number of rows (8760): here the number of (hopefully distinct) times at which sets of measurements were taken;
the number of columns (17): here the number of columns required to represent the measurement time (4 columns) plus the number of different measurements that could be taken at each moment in time (13 columns).

ndarrays must always be rectangular. Another requirement is that every element in an ndarray must have the same data type e.g. every value must be an integer or every value must be a 'float' (decimal).

We can also quickly determine the number of elements in an ndarray using



In [ ]:

    
np.size(dev_green)

Extracting single values from `numpy` `ndarray`s

We can view both single elements and also 1D or 2D slices of our ndarray using expressions that look like this:

some_array[row_selector, column_selector]

How this works in practise should become clear after seing some examples. To view just a single element:



In [ ]:

    
dev_green[0, 0]

Here we are extracting the value in the first row and first column of the ndarray i.e. the ndarray element with a row index of 0 and a column index of 0.

Important: Python, like many other programming languages, counts collections of values starting from an index of 0, not 1. Therefore, if we have a one-dimensional array with 100 elements then the first and last elements have indexes of 0 and 99 respectively. Note that Python's most direct competitors in the data science world, R and MATLAB, both count collections starting from an index of 1, not 0!

You can therefore access the third element in the fifth column of the dev_green ndarray using:



In [ ]:

    
dev_green[2, 4]

Exercise

What is the Modelled Wind Speed for the 418th measurement in the dataset?

Hint: Svefg qrgrezvar gur ebj vaqrk hfvat vasbezngvba va gur dhrfgvba. Arkg, qrgrezvar gur pbyhza vaqrk hfvat gur pbyhza urnqvatf fubja nf bhgchg sebz n cerivbhf pryy



In [ ]:

    
assert_almost_equal(____, 4.2, decimal=6)

How can we access the last value in the first column? The simplest way is like this:



In [ ]:

    
dev_green[-1, 0]

Here, an index of -1 gives us the last element in a particular row or column. An index of -2 would give us the second-to-last element etc.

Extracting ranges inc. entire rows and columns from `numpy` `ndarray`s

If we want to extract muliple values from a ndarray we can specify an index range e.g. the first four values in the fifth column (ozone level) can be extracted with:



In [ ]:

    
dev_green[0:4, 5]

The number before ':' is the index of the first element we are interested in.
- If this is omitted then the range starts from the beginning of the row or column.
The number after ':' is the first index after the range of elements we want.
- If this is omitted then the range contiues to the end of the row or column.

Exercise

Here is a much simpler 1D ndarray. Each element is a string of characters rather than a float (decimal number):



In [ ]:

    
interviewees = np.array(["Albertha", "Aurora", "Collin", "Cris", "Genoveva", "Joanne", 
                         "Kamilah", "Katerine", "Lida", "Malcom", "Max", "Saundra"])

Use an index range to make the following assertion statements valid. Note that 1D ndarrays are indexed using some_array[selector].



In [ ]:

    
assert_array_equal(____, np.array(["Genoveva"]))
assert_array_equal(____, np.array(["Albertha", "Aurora", "Collin"]))
assert_array_equal(____, np.array(["Malcom", "Max"]))
assert_array_equal(____, np.array(["Max", "Saundra"]))

To view an entire row, select all elements in a specific dimension using ':'. Here is the earliest set of measurements in our dataset:



In [ ]:

    
dev_green[0, :]

The value returned by this indexing operation appears to be an ndarray.

Notice that numpy has displayed every value using scientific notation. We can disable this for values that can easily be displayed without scientific notation using numpy's set_printoptions function:



In [ ]:

    
np.set_printoptions(suppress=True)
dev_green[0, :]

Returning to our indexing and slicing, we can extract the entire first column (the year each measurement was taken) using:



In [ ]:

    
dev_green[:, 0]

(Ignore the fact that this column has been output as a row for now.)

We can confirm that this is an single column of the expected length using the np.shape function seen before:



In [ ]:

    
np.shape(dev_green[:, 0])

We can also extract 2D portions of our ndarray using index ranges.

For example, we can combine what we have learned so far to view the last five rows and first four columns of dev_green:



In [ ]:

    
dev_green[-5:, :4]

Note that we did not use dev_green[-5:-1, :4] as then we would have only have selected up to but not including the last element.

One final tip before we move on from slicing numpy arrays for now. The a:b range notation selects ranges of ajacent rows or columns. If we want to select rows or columns at other regular intervals then we use a range notation of the form a:b:c where c is our index increment between chosen elements.

For example, to select the year, month, day and hour for only even rows in our dataset:



In [ ]:

    
dev_green[::2, :4]

whereas to select only odd rows in a certain index range:



In [ ]:

    
dev_green[1:12:2, :4]

Missing data and statistical summaries

Now that we can extract different portions of our dataset we'd like to be able to calculate some summary statistics. For example, the mean Ozone level over the first five measurement times is:



In [ ]:

    
np.mean(dev_green[:5, 4])

However, we encounter a problem when we try to find the mean of all the values in the last column:



In [ ]:

    
np.mean(dev_green[:, 4])

What is this nan value that the mean function returned? It is an abbreviation for 'not a number' and serves two purposes.

It is the value returned when an operation evaluates to something that mathematical is undefined (test this with np.log(-1));
It can also indicate missing values in a dataset (e.g. values corresponding to times when a sensor was not functioning correctly).

The second case is the one we've encountered here: if we look at our CSV file we can see it contains some lines where there are no values between our column delimiters (commas). Here's an example of a line with missing values (don't worry if you don't understand the code; just look at the output); these missing values are interpretted as nan when we read in the file using the genfromtxt function:



In [ ]:

    
with open(csv_path, 'r') as csv_file:
    for line in csv_file:
        if ',,' in line:
            print(line)
            break

Let's explore how to identify and deal with missing values in ndarrays using a very simple dataset:



In [ ]:

    
lap_times = np.array([10.973, 25.152, np.NaN, 19.826,
                      19.312, 25.979, 17.147, 31.737,
                      np.NaN, np.NaN, 14.449, 11.135])

To count the number of missing values (a trivial task here but automation is necessary for larger datasets) we can use the following:



In [ ]:

    
np.sum(np.isnan(lap_times))

What happening here? Let's break it down into steps, starting with the middle:

1) Determine whether each element is NaN or not:



In [ ]:

    
np.isnan(lap_times)

This function returns a ndarray of same shape as its input where each element is True if the corresponding element in lap_times is NaN and False otherwise.

2) Pass this boolean (True/False only) ndarray to np.sum. This function adds all elements in this new ndarray, treating True values as 1 and False as 0. This is the standard behaviour when Python encounters boolean values in a numerical context such as an arithmetic expression:



In [ ]:

    
False + 4



In [ ]:

    
5 * True

What if we wanted to count the number of non-NaN values in the array? We can use the '~' operator to negate the result of np.isnan(lap_times):



In [ ]:

    
~np.isnan(lap_times)



In [ ]:

    
np.sum(~np.isnan(lap_times))

So, how many missing values do we have in our air quality dataset?



In [ ]:

    
np.sum(np.isnan(dev_green))

Exercise

What is this as a proportion of the dataset (accurate to at least 3 decimal places)?

Hint: Gur ac.fvmr shapgvba, zragvbarq cerivbhfyl, pbhyq or hfrshy urer



In [ ]:

    
assert_almost_equal(____, 0.231, decimal=3)

What if we want to count the missing values per row or per column?

Above, np.sum counts True values in all rows and columns; however, we can request per-column counts using



In [ ]:

    
np.sum(np.isnan(dev_green), axis=0)

So, the only the first four columns don't contain missing values. Note that you can replace 0 with 1 for per-row counts. Many other functions also have an axis parameter that works in the same way.

Now that we've learned more about the presence of missing data in our dataset let's return to the task of trying to calculate some statistical summaries.

As seen above, np.mean will return np.NaN if any element of the ndarray passed as an argument is np.NaN i.e. np.mean propagates np.NaN values.

However, numpy provides a set of functions to complement np.sum, np.mean, np.std (standard deviation) that ignore missing values. Their names all start with nan.

Exercise

Use np.nanmean to find the mean of each of the columns from the fourth column up to and including the last (whilst ignoring missing values).

Hint: see the help for this function.



In [ ]:

    
np.nanmean(

Exercise

Use tab completion (see Getting help) to see the other functions for which NaNs are removed rather than propagated.

Exercise

Calculate the median (correct to 3 decimal places) of the last 100 Ozone values in the dataset (Ozone is the 5th column)



In [ ]:

    
assert_almost_equal(____, 54.083, decimal=3)



In [ ]:



In [ ]:

    
### Indexing with a boolean sequence 





### Indexing with an integer sequence



In [ ]:

    
## Plotting data and comments

## DEPENDENCIES: ADVANCED INDEXING, MISSING DATA

# Knowing how to extract portions of our dataset and is essential for creating 
# visual plots of our data.  Some plots we may want to produce include:
#  - a histogram showing the distribution of a particular variable (column)
#  - a scatter plot showing the distribution of two variables (i.e. how they *covary*)
#  - a line plot showing how a variable changes over time

#In Python plots are typically created using the versatile and powerful
# [Matplotlib](http://matplotlib.org/) library.  Other libraries also offer 
# plotting functionality but these often use Matplotlib behind the scenes.

# Here's how we can create a histogram of Ozone concentration:

import matplotlib.pyplot as plt
%matplotlib notebook

fig, axes = plt.subplots(nrows=1, ncols=1)
axes.hist(dev_green[~np.isnan(dev_green[:, 4]), 4])

# The **X** lines of code above do the following:

# 1. Import the (necessary part of) the matplotlib package.
# 1. Tell Matplotlib to use a particular theme (colours and line styles)
# 1. Instruct `matplotlib` and `IPython` to display plots in the Notebook
#    between cells rather than in an external window
# 1. Create a new `Figure` (plot window) called `fig` 
#    containing a single Axes (subplot) object called `axes`.  
#    WILL SEE OTHER WAYS OF CREATING PLOTS; THIS ONE NICE AS 
#    CAN ALSO USE FOR CREATING GRIDS OF SUBPLOTS IF INCREASE nrows/ncols.
#    MULTIPLE ASSIGNMENT - EXPLAINED BEFORE?
# 1. ONLY MANDATORY ARGUMENT TO HIST IS A NDARRAY THAT DOES NOT CONTAIN NANS 
#    (ALREADY MENTIONED?)
#    MUST FILTER DEV_GREEN OZONE COLUMN TO JUST NON-NAN VALUES (ALREADY COVERED?)
#    NOTE THAT USING METHOD HERE (EXPLAIN LATER?) AND CAN USE TAB COMPLETION 

# WE CAN MAKE PLOT MORE USEFUL AND ATTRACTIVE WITH FEW REFINEMENTS.  

# # USE PLOT STYLE THAT LOOKS BETTER ON SCREEN
# plt.style.use('ggplot')
#
fig, axes = plt.subplots(nrows=1, ncols=1)
 
axes.hist(dev_green[~np.isnan(dev_green[:, 4]), 4], bins=30)
# Set the title
axes.set_title('Distribution of Ozone at Devonshire Green weather station')
# Set the x and y axis labels
axes.set_xlabel('Concentration (micrograms per cubic m)')
axes.set_ylabel('Number of samples')


# The above code cell contains several *comments**.  
# In Python, any characters to the right of a hash (`#`) sign 
# (unless #`# is in quotes) are comments ignored by the Python intepreter 
# (the mechanism that interprets and executes the code).  
# Comments are a valuable means for reminding you and/or others how the software works
# and why it was implemented in a particular way.
# You should get in the habit of using comments throughout your code 
# (either inline after `#` characters or, if using Jupyter Notebooks, in text cells). 
#
# **Tip:** write comments that would help you regain an understanding of your code
# were you revisit it after a period of three months.

# SCATTER PLOT

# NEED TO REMOVE NANS AGAIN?
# WHICH METHOD MOST ELEGANT BUT ALSO IN KEEPING WITH WHAT PRESENTED SO FAR?

## LINE PLOT
#
# NANS NOT MATTER HERE?
# WHICH SERIES MOST INTERESTING?  TEMPERATURE?

# LOOPING (USING LINE PLOTS)

Introduction to Python for Data science: 02 - Loading and manipulating data using numpy

TODO: FINISH TOC

Lesson setup code

Loading data: numpy and pandas

Querying tabular data stored in Numpy n-dimensional arrays (ndarrays)

Exercise

Extracting single values from numpy ndarrays

Exercise

Extracting ranges inc. entire rows and columns from numpy ndarrays

Exercise

Missing data and statistical summaries

Exercise

Exercise

Exercise

Exercise

Loading data: `numpy` and `pandas`

Querying tabular data stored in Numpy n-dimensional arrays (`ndarray`s)

Extracting single values from `numpy` `ndarray`s

Extracting ranges inc. entire rows and columns from `numpy` `ndarray`s