Introduction to Python for Data science: 04 - Plotting data

This is designed to be a self-directed study session where you work through the material at your own pace. If you are at a Code Cafe event, instructors will be on hand to help you.

If you haven't done so already please read through the Introduction to this course, which covers:

  1. What Python is and why it is of interest;
  2. Learning outcomes for the course;
  3. The course structure and support facilities;
  4. An introduction to Jupyter Notebooks;
  5. Information on course exercises.

This lesson covers:

TODO: FINISH TOC

Lesson setup code

Run the following Notebook cell every time you load this lesson (but do not edit it). Don't be concerned with what this code does at this stage.


In [ ]:
____ = 0
import os
import numpy as np
from codecs import decode
from numpy.testing import assert_almost_equal, assert_array_equal

data_dir = 'Weather_data'
csv_path = os.path.join(data_dir, 'Devonshire_Green_meteorological_data-preproc.csv')

dev_green = np.genfromtxt(csv_path, delimiter=',', skip_header=1)

Plotting data using Matplotlib

Knowing how to extract portions of our dataset is essential for visualising our data by creating plots. Some plots we may want to produce include:

  • a histogram showing the distribution of a particular variable (column)
  • a scatter plot showing the distribution of two variables (i.e. how they covary)
  • a line plot showing how a variable changes over time

In Python plots are typically created using the versatile and powerful Matplotlib library. Other libraries also offer plotting functionality but these often use Matplotlib behind the scenes.

Histograms

Here's how we can create a histogram of Ozone concentration:


In [ ]:
import matplotlib.pyplot as plt
%matplotlib notebook

(fig, axes) = plt.subplots(nrows=1, ncols=1)
axes.hist(dev_green[~np.isnan(dev_green[:, 4]), 4])

The 4 lines of code above do the following:

  1. Import the (necessary part of) the matplotlib package.
  2. Instruct matplotlib and IPython to display plots in the Notebook between cells rather than in an external window
  3. Create a new Figure (plot window) called fig containing a single Axes (subplot) object called axes.
  4. axes.hist can only plot histograms of data if the data does not contain NaN values. We therefore need to
    1. Extract the 5th column of the dev_green ndarray
    2. Create a boolean array showing where values in this column are NaN
    3. Negate this array
    4. Use this to select only non-NaN Ozone values from the dataset
    5. Pass the resulting ndarray as a single argument to the hist function associated with axes

Exercise

Experiment with the buttons immediately below the displayed histogram to find out what they do.


Some further comments:

  1. The third line of code demonstrates multiple assignment: rather than returning say a single number, plt.subplots instead returns a tuple (a particular type of sequence of things) of length 2 (a Figure object and an Axes object). These two values are then assigned to two variables at the same time, which are enclosed in parentheses. Being able to assign to multple variables at once in Python is very handy e.g.

    (x_coord, y_coord) = (123456, 654321)
    

    Just remember: multiple assignment requires that the thing(s) to the left and right of the equals sign must be sequence-like things with the same length.

  2. Note that there are ways other than plt.subplots of creating new plots; however plt.subplots is preferred by the author as it can be used to create not just single subplots but grids (rows and/or columns of subplots) too.

  3. You may have noticed that hist is associated with axes, which is a variable we created rather than a package. Here hist is a special type of function called a method. You can think of it as being a function that operates on axes but also uses the arguments supplied in parentheses. What's really happening here will become clear at a later date when you learn about a style of programming called Object Oriented programming.

We can make our plot more useful and attractive with a few refinements.


In [ ]:
# Enable a Matplotlib visual style that looks better on-screen
plt.style.use('ggplot')

# Create a new Figure and Axes
(fig, axes) = plt.subplots(nrows=1, ncols=1)

# Isolate the data we wish to plot the distribution of.
ozone_col_idx = 4
is_ozone_not_nan = ~np.isnan(dev_green[:, ozone_col_idx])
valid_ozone_data = dev_green[is_ozone_not_nan, ozone_col_idx]

# Plot a histogram of this data
# - using a larger number of bins (bands) than the default of 10
# - for only a certain range of the data; here the range parameter 
#   must be a 'tuple' (a sequence of values in parentheses) comprised of 
#   a lower and upper bound.
axes.hist(valid_ozone_data, bins=30, range=(0, 150))

# Set the plot title
axes.set_title('Distribution of Ozone at Devonshire Green weather station')
# Set the x and y axis labels
axes.set_xlabel('Concentration (micrograms per cubic m)')
# Ensure the last line in the cell finishes with a semi-colon; this ensures that we don't get a 
axes.set_ylabel('Number of samples');

Tip: commenting code

The above code cell contains several comments. In Python, any characters to the right of a hash (#) sign (unless # is in quotes) are comments ignored by the Python intepreter (the mechanism that interprets and executes the code) - they are solely for the reader's benefit.

Comments are a valuable means for reminding you and/or others how the software works and why it was implemented in a particular way. You should get in the habit of using comments throughout your code (either inline after # characters in any Python code and/or, if using Jupyter Notebooks, in text cells).

Write comments that would help you regain an understanding of your code were you revisit it after a period of three months!


We can also show multiple things on a single subplot e.g. overlaid histograms of temperature data for the months of July and December:


In [ ]:
# No need to enable the 'ggplot' Matplotlib visual style as already active

# Create a new Figure and Axes
(fig, axes) = plt.subplots(nrows=1, ncols=1)

# Isolate the data we wish to plot the distribution of.
month_col_idx = 1
temperature_col_idx = -1

is_sample_from_july = dev_green[:, month_col_idx] == 7
is_sample_from_december = dev_green[:, month_col_idx] == 12
is_temperature_valid = ~np.isnan(dev_green[:, temperature_col_idx])

valid_temperatures_july = dev_green[is_sample_from_july & is_temperature_valid, temperature_col_idx]
valid_temperatures_december = dev_green[is_sample_from_december & is_temperature_valid, temperature_col_idx]

# Plot a histogram of this data
axes.hist(valid_temperatures_july, bins=30, label='July', alpha=0.5)
axes.hist(valid_temperatures_december, bins=30, label='December', alpha=0.5)

# Set the plot title
axes.set_title('Distribution of modelled temperature (Devonshire Green weather station)')
# Set the x and y axis labels
axes.set_xlabel('Degrees centigrade')
# Ensure the last line in the cell finishes with a semi-colon; this ensures that we don't get a 
axes.set_ylabel('Number of samples');
# Add a legend
axes.legend();

Exercise

Look at the above plot. Identify the code that

  • Selects valid temperature data for July and December from dev_green;
    • Aside: What type of indexing is used here? Are we creating copies of views of parts of dev_green?
  • Associates a label with each plotted entity;
  • Makes each histogram semi-transparent to ensure neither is obscured;
  • Displays a key showing the mapping between the color and name of each plotted entity.

Hint:

Va pbzchgre tencuvpf gur grez 'nycun' bsgra ersref gb gur qrterr bs genafcnerapl bs n cvkry/vzntr.


Exercise

Create a plot showing the distribution of Ozone in January, April and August.


LIne plots

We might also want to look at how one or more variables in our dataset change over time. We do this by:

  • Again creating an object of type Axes using the plt.subplots function (we've again called our Axes object axes)
  • Calling the plot method of our axes object, passing it a one-dimensional ndarray as an argument. This array is the sequence of y-axis values that we wish to display.

Remember, as explained previously and briefly, a method is a type of function that operates on the thing it follows (and is separated from by a '.') plus the arguments passed to the method in parentheses.


In [ ]:
(fig, axes) = plt.subplots(nrows=1, ncols=1)

ozone_col_idx = 4
axes.plot(dev_green[:, ozone_col_idx])

It appears that Ozone wasn't recorded for some time during the studied monitoring period.

You might be wondering what the x-axis coordinates (and the numbers along the x-axis) are. The x-axis coordinates are simply the indexes of the array passed as an argument to the plot method i.e. a range from 0 to


In [ ]:
len(dev_green[:, ozone_col_idx])

We can plot something more meaningful on the x axis by passing an array of x-axis coordinates as the first argument to plot and an array of y-axis coordinates as the second argument. However, turning the first four columns (year, month, day and hour of the day) into a one-dimensional array that we can use for this purpose is beyond the scope of this lesson.

Note that the pandas library has far better support for using dates and times as coordinates.

NOTES BELOW


In [ ]:
dev_green[:, :4]

In [ ]:
ci = {
    'Year': 0,
    'Month': 1,
    'Day': 2,
    'Hour': 3,
    'Ozone': 4,
    'Nitric oxide': 5,
    'Nitrogen dioxide': 6,
    'Nitrogen oxides as nitrogen dioxide': 7,
    'PM10 particulate matter': 8,
    'Non-volatile PM10': 9,
    'Volatile PM10': 10,
    'PM2.5 particulate matter': 11,
    'Non-volatile PM2.5': 12,
    'Volatile PM2.5': 13,
    'Modelled Wind Direction': 14,
    'Modelled Wind Speed': 15,
    'Modelled Temperature': 16
}

ci['Nitric oxide']

In [ ]:
def dateparts2datetime64(year, month, day, hour):
    return np.datetime64("{:0>4}-{:0>2}-{:0>2}T{:0>2}:00:00".format(year, month, day, hour))
dateparts2datetime64_vectorized = np.vectorize(dateparts2datetime64)

timestamps = dateparts2datetime64_vectorized(*dev_green[:, :4].T.astype(np.int))
timestamps[:5]

In [ ]:
# Create a new Figure and Axes
(fig, axes) = plt.subplots(nrows=1, ncols=1)



axes.plot(dev_green[:, ci['Nitric oxide']], alpha=0.5)
#axes.plot(timestamps, dev_green[:, 6], alpha=0.5)
#axes.plot(timestamps, dev_green[:, 7], alpha=0.5)
#

Scatter plots

# SCATTER PLOT # NEED TO REMOVE NANS AGAIN? # WHICH METHOD MOST ELEGANT BUT ALSO IN KEEPING WITH WHAT PRESENTED SO FAR? ## LINE PLOT # # NANS NOT MATTER HERE? # WHICH SERIES MOST INTERESTING? TEMPERATURE? # LOOPING (USING LINE PLOTS)