In [1]:
example_list = [1, 2, 3,]
average = ((example_list[0] + example_list[1] + example_list[2])) / 3
# short cut using python functions :
average2 = sum(example_list) / len(example_list)
print(average)
print(average2)
If you are a fan of mathematical formulae here is an alternative using greek symbols:
\begin{equation} \frac{1}{n}\sum_{i=0}^{n-1}x_i = \frac{x_{0}+x_{1}+x_{2}+...+x_{n-1}}{n} \end{equation}where we have a list of $n$ numbers $x_i\in\mathbb{R}$ s.t. $i \in [0,n)$.
Now that we have defined what an average is lets use it to smooth out our climate data timeseries in order to improve its 'readability'. Before this lets talk a bit about timeseries so we have a clear idea of what they are.
Lets say that every day of the week I use a pedometer to track the number of steps I am taking as an experiment to monitor how much I am walking on a daily average:
Day Number of Steps
--------------------------
Monday 132
Tuesday 250
Wednesday 101
Thursday 230
Friday 396
Saturday 444
Sunday 60
This table is information that changes over time. Measurements that were taken one day after the other. More formally a timeseries is a list of measurements or data points that where taken sequentially at a given time interval. In our specific example the time interval was a day. Is this data 100% reliable ? Is it possible that the pedometer was on low battery on sunday or does it just reflect that sundays are my sleepy day?
Just like the example above we have an equivalent timeseries which measures things such as anual temperature, snow cover, average sunshine and other climate related measurements
.
In this case the measurements are yearly (this is our time interval, a year).
Why would we want to smooth our data ? because of noise in our measurements. It is possible that in a particular year the average temperature was difficult to measure thus this will result in having an outlier which may make our timeseries a bit harder to interpret. We want to smooth in order to make patterns easier to find because sometimes too much information just makes things confusing.
Lets have a step by step run on how smoothing can be carried out using the pedometer timeseries as an example:
(1) We need to decide on a window size. What this means in our pedometer example is how many days back in time are we going take in to acount when averaging. Lets say $window$ $size$ is $2$;
window size = 2
(2) Start with the most recent data point in time (in our case that is Sunday) and take $window$ $size$ $-1$ (our window size is 2 thus this is 1) steps backwards in time. Collect all the measurements within this range including the one on Sunday:
Day Number of Steps
--------------------------
Saturday 444
Sunday 60
(444 + 60) / 2 = 252
Day Number of Steps
--------------------------
Monday 132
Tuesday 250
Wednesday 101
Thursday 230
Friday 396
Saturday 444
Sunday 255
On our pedometer data this is what the result of the moving average looks like:
Day Number of Steps
--------------------------
Tuesday 191
Wednesday 175.5
Thursday 165.5
Friday 313
Sunday 420
Saturday 255
As we can see now the numbers change a lot less drastically.
In [2]:
# Our table as a list where the index is the day
# 0 being Monday and 6 Sunday
number_of_steps = [132, 250, 101, 230, 396, 444, 60]
window_size = 2
averaged_number_of_steps = list()
# Change this loop a bit to get our moving average working:
for i, step_count in enumerate(number_of_steps):
# Makes sure that it calculates only up till tuesday
if i - window_size + 1 >= 0:
# this would be steps (2) and (3)
average = 0
# The new values get updated
# Note they are being inserted in the wrong way
averaged_number_of_steps.append(average)
#Reversed deals with them being inserted in the wrong way
print(list(reversed(averaged_number_of_steps)))
Now lets have a look at our climate data and see what a moving average can do for it:
In [3]:
# Our home brewed data analysis library for dds
from dds_lab.climdat import ClimPlots
# Importing locations of the data
from dds_lab.datasets import climate
# Plotting facilities
from bokeh.plotting import show, output_notebook
output_notebook()
In [4]:
# ClimPlots is an object. Our own customized data strucutre
# that allows us to create plots for the climate data with simple commands.
c = ClimPlots(['edinburgh_snow_cover.txt','edinburgh_tmp_min.txt','edinburgh_tmp.txt'],
path=climate)
# Generate time series
ts = c.plot_time_series(moving_avg=True)
# Show time series
show(ts)
Lets attempt to use a more interesting visualization than the timeseries lines above in order to get a feel for what the moving average is doing in a more colorful way. Instead of representing each year by a metric of snow cover we use a color scale where big snow cover units are darker in the scale. This is called a Heat Map.
In [5]:
tsh = c.plot_time_series_heatMap()
show(tsh)
Now lets compare to our moving average result:
In [6]:
tsha = c.plot_time_series_heatMap(moving_avg=True)
show(tsha)
It looks like somone just smuged(or blured somehow) our original HeatMap and this is what a moving average really does it smoothens things out by smudging (averaging) with the elements before it.
Overall averages just give us the summary of a set of measurements by combining them together.