Averages

Most students will be familiar with this concept in terms of statistics and if not they will definitely be familiar with the word itself outside of the mathematical context.

An average of a list of numbers is just the sum of those numbers divided by the number of of things in that list :


In [1]:
example_list = [1, 2, 3,]
average = ((example_list[0] + example_list[1] + example_list[2])) / 3
# short cut using python functions :
average2 = sum(example_list) / len(example_list)
print(average)
print(average2)


2.0
2.0

If you are a fan of mathematical formulae here is an alternative using greek symbols:

\begin{equation} \frac{1}{n}\sum_{i=0}^{n-1}x_i = \frac{x_{0}+x_{1}+x_{2}+...+x_{n-1}}{n} \end{equation}

where we have a list of $n$ numbers $x_i\in\mathbb{R}$ s.t. $i \in [0,n)$.

The Moving Average (Advanced Application)

Now that we have defined what an average is lets use it to smooth out our climate data timeseries in order to improve its 'readability'. Before this lets talk a bit about timeseries so we have a clear idea of what they are.

Timeseries

Lets say that every day of the week I use a pedometer to track the number of steps I am taking as an experiment to monitor how much I am walking on a daily average:

Day        Number of Steps
--------------------------
Monday        132
Tuesday       250
Wednesday     101   
Thursday      230
Friday        396
Saturday      444
Sunday         60

This table is information that changes over time. Measurements that were taken one day after the other. More formally a timeseries is a list of measurements or data points that where taken sequentially at a given time interval. In our specific example the time interval was a day. Is this data 100% reliable ? Is it possible that the pedometer was on low battery on sunday or does it just reflect that sundays are my sleepy day?

Smoothing the Climate Timeseries

Just like the example above we have an equivalent timeseries which measures things such as anual temperature, snow cover, average sunshine and other climate related measurements. In this case the measurements are yearly (this is our time interval, a year).

Why would we want to smooth our data ? because of noise in our measurements. It is possible that in a particular year the average temperature was difficult to measure thus this will result in having an outlier which may make our timeseries a bit harder to interpret. We want to smooth in order to make patterns easier to find because sometimes too much information just makes things confusing.

Lets have a step by step run on how smoothing can be carried out using the pedometer timeseries as an example:

  • (1) We need to decide on a window size. What this means in our pedometer example is how many days back in time are we going take in to acount when averaging. Lets say $window$ $size$ is $2$;

    window size = 2
  • (2) Start with the most recent data point in time (in our case that is Sunday) and take $window$ $size$ $-1$ (our window size is 2 thus this is 1) steps backwards in time. Collect all the measurements within this range including the one on Sunday:

Day        Number of Steps
--------------------------
Saturday       444
Sunday         60
  • (3) Take the average of the measurements in the the $window$ $size$ $-1$ range:
(444 + 60) / 2 =  252
  • (4) This now becomes the measurement for Sunday so we update our table
Day        Number of Steps
--------------------------
Monday        132
Tuesday       250
Wednesday     101   
Thursday      230
Friday        396
Saturday      444
Sunday        255
  • (5) Now we take one step backwards to Saturday and we repeat steps 2,3,4 until we can no longer carry out step 2 (which is making the list to average by taking $window$ $size$ $-1$ steps backwards) this will happen when we reach Monday and thus we wont be able to compute a moving average value for that day.

On our pedometer data this is what the result of the moving average looks like:

Day        Number of Steps
--------------------------
Tuesday       191
Wednesday     175.5   
Thursday      165.5
Friday        313
Sunday        420
Saturday      255

As we can see now the numbers change a lot less drastically.

Now you try !

  • What would be the number of elements left in the new table if our $window$ size is 3 ?
  • In the template bellow try to replicate the exercise we did in python (Difficult).

In [2]:
# Our table as a list where the index is the day
# 0 being Monday and 6 Sunday
number_of_steps = [132, 250, 101, 230, 396, 444, 60]
window_size = 2
averaged_number_of_steps = list()

# Change this loop a bit to get our moving average working:
for i, step_count in enumerate(number_of_steps):
    # Makes sure that it calculates only up till tuesday
    if i - window_size + 1 >= 0:
        # this would be steps (2) and (3)
        average = 0
        # The new values get updated
        # Note they are being inserted in the wrong way
        averaged_number_of_steps.append(average)
        
#Reversed deals with them being inserted in the wrong way
print(list(reversed(averaged_number_of_steps)))


[0, 0, 0, 0, 0, 0]

Now lets have a look at our climate data and see what a moving average can do for it:


In [3]:
# Our home brewed data analysis library for dds
from dds_lab.climdat import ClimPlots
# Importing locations of the data
from dds_lab.datasets import climate
# Plotting facilities
from bokeh.plotting import show, output_notebook
output_notebook()


BokehJS successfully loaded.

In [4]:
# ClimPlots is an object. Our own customized data strucutre
# that allows us to create plots for the climate data with simple commands.
c = ClimPlots(['edinburgh_snow_cover.txt','edinburgh_tmp_min.txt','edinburgh_tmp.txt'],
              path=climate)

# Generate time series
ts = c.plot_time_series(moving_avg=True)
# Show time series
show(ts)


Heat Map

Lets attempt to use a more interesting visualization than the timeseries lines above in order to get a feel for what the moving average is doing in a more colorful way. Instead of representing each year by a metric of snow cover we use a color scale where big snow cover units are darker in the scale. This is called a Heat Map.


In [5]:
tsh = c.plot_time_series_heatMap()
show(tsh)


Now lets compare to our moving average result:


In [6]:
tsha = c.plot_time_series_heatMap(moving_avg=True)
show(tsha)


It looks like somone just smuged(or blured somehow) our original HeatMap and this is what a moving average really does it smoothens things out by smudging (averaging) with the elements before it.

Overall averages just give us the summary of a set of measurements by combining them together.