Visualizing Climate Variables

In this small set of notes we will have a look at time series data arising from the measurement of two different climate variables (temperature and snow cover) and use different visualization techniques in order to gain insights from the data sets themselves.

Data-set Description

The data set itself is extracted from SEPA and is specific to edinburgh based measurents. The 3 files we will be looking at contain the following measurements:

  • Minimum temperature per year;
  • Average temperature per year;
  • Snow cover per year.

In [1]:
# Our home brewed data analysis library for dds
from dds_lab.climdat import ClimPlots
# Importing locations of the data
from dds_lab.datasets import climate
# Plotting facilities
from bokeh.plotting import show, output_notebook, figure
output_notebook()


BokehJS successfully loaded.

Scatter Plots

Lets say I have the following data set:

Width   Height
----------------
1         2  
3         5
4         6    
9         7 
7         8      
1         1   
6         3   
2         4

Which consists of the width and height of a set of apples picked randomly from the same tree. We can look at this table and try to picture what these numbers look like without much luck but instead we can just use a scatter plot and see how they look:


In [2]:
# Our data:
x = [1, 3, 4, 9, 7, 1, 6, 2]
y = [2, 5, 6, 7, 8, 1, 3, 4]

# Figure graphing object from bokeh library
fig = figure(title="apples width vs height",
             x_axis_label="width",
             y_axis_label="height",
             width=350, height=350)

# Create scatter plot on figure using our data
fig.scatter(x=x,y=y)

#Display figure
show(fig)


So it seems to be the trend that both width and height increase together.

This is quite similar to the plots used in highschool when studying functions. In order to visualize a line where we would plot $y$ vs $x$ (such that $y = x$) only that in this case y and x come from data and most likely have quite a complicated not exactly relationship if they even have a relationship at all.

Scatter Plot Matrix

We have three different climate measurements and we may wish to know how do these relate to each other.

One of the first things to do when wishing to gain knowledge from data is to visualize it before jumping towards statistics and extracting more numbers from the data.

We will use a plot called a scatter plot matrix a not so well known yet extremely powerful visualization tool.

What this plot does is that it groups all the measurements (snow cover, max temperature and average temperature) carried out in the same year and it displays in pairs one vs the other as a regular scatter plot.


In [3]:
# ClimPlots is an object. Our own customized data strucutre
# that allows us to create plots for the climate data with simple commands.
c = ClimPlots(['edinburgh_snow_cover.txt','edinburgh_tmp_min.txt','edinburgh_tmp.txt'],
              path=climate)

# Plot scatter plot matrix:
g = c.plot_pairs()

In [4]:
# Use Bokeh to display the scatter plot matrix
show(g)
# Behold the scatter plot matrix ! (Can you see anything interesting ?)


So first thing I should mention is that the dots are colored and sized in the following manner:

  • Red: from now till 2010
  • Green: from 2010 till 1980
  • Blue 1980 and before

Sometimes grouping data in certain ways helps spot patterns and in this case it allows us to see that the red dots (most recent ones) are all sigficantly higher temperatures than that of the other colors. This could be an indicator of global warming.

Linear relationships

We can see that minimum and and average temperature both fit relatively nice in to a straight line this is because both measurments are of the same climate variable only that one is calculated / measured quite differently from another.

The interesting trend here is found when comparing snow cover and temperature as expected they are inversely proportional ! (back to highschool maths $y=-x$) when temperature increasing snow cover decreases as one would expect. The data seems to have a linear pattern nonetheless there is quite a scatter around any straight line one could try to draw going through those points and this may be due to noise (contamination) in the measurements.

Extreme Values vs Errors

There is one massive outlier which is completely far from the trend in between snow and temperature and there are two possible reasons as to why this data point is so far from normal:

  • The measurement equipment was faulty that year thus this point is an error;
  • The point is a systematic outlier.

Is it possible to have an extreme value that is not an error ? The answer is yes. Sometimes the underlying physical distributions that we are obtaining data from can behave in unexpected manners and carry out unlikely processes.

So what do you think ? error or correct extreme value ?

If you hover over the point you will see that the year was 2011 a year in which UK was both extremely hot during summer and incredibly snowy during winter thus this shows us that our outlier is infact not an error and we do not have as much of a linear relationship as we would have had if we had discarded it as one.

Thought Exercise

Which plot seems less scatterd : snow cover vs average temperature or snow cover vs mean temperature ? why do you think this is the case ? can you spot any other interesting patterns using the scatter plot matrix ?