Here we'll get a bit of practice constructing and fitting models to data.
In [ ]:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# use seaborn plotting defaults
import seaborn as sns; sns.set()
Since we're in Oslo, let's take a look at daily temperatures measured in Oslo. I found this data at the following website (uncomment this code to download the data):
In [ ]:
# !curl -O http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NOOSLO.txt
# !mv NOOSLO.txt data
We'll read it this way:
In [ ]:
data = pd.read_csv('data/NOOSLO.txt', delim_whitespace=True,
names=['month', 'day', 'year', 'degF'])
data.describe()
Notice that there's some craziness going on here. First of all, some of the years seem to have been mis-typed as 200. We'll want to filter those out. Also, some of the temperatures are reported as -99. This is a common value used to indicate missing data. Let's remove both of those and re-check the data description:
In [ ]:
# Filter bad years
data = data[data.year > 200]
# Filter missing data
data = data[data.degF > -99]
data.describe()
Looks much better! The next thing we'll want to do is to combine the month, day, and year columns into a single date index. We'll do this using Pandas to_datetime
functionality:
In [ ]:
# Create a date index
YMD = 10000 * data.year + 100 * data.month + data.day
data.index = pd.to_datetime(YMD, format='%Y%m%d').astype('datetime64[ns]')
data.head()
plot()
method to plot the temperature with timedata.index.dayofyear
attribute.Here we'll practice doing a simple model, even though it will not fit our data well: fit a line of best-fit to the data with the following model:
$$ y(x) = \theta_0 + \theta_1 x $$You can use a cost function of your choice (squared deviation is the best motivated) and use either the scipy minimization or the matrix algebra formalism.
Once you've computed the best model, plot this fit over the original data. Note that the best possible fit with this model is not very good – the model is highly biased!
Let's try a more complex model fit to our data. The data looks like it's sinusoidal, so we can fit a model that looks like this:
$$ y(x) = \theta_0 + \theta_1 \sin \left(\frac{2\pi x}{365}\right) + \theta_2 \cos\left(\frac{2\pi x}{365}\right) $$Note that this is still a linear model (i.e. Linear in the model parameters $\theta$), so it can also be solved via either iterative minimization, or via matrix methods.
What we've done above is a simple model to fit the annual temperature trends. Here we'll ask how the temperature varied from year to year.