Here we'll get a bit of practice constructing and fitting models to data.
In [1]:
from __future__ import print_function, division
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# use seaborn plotting defaults
import seaborn as sns; sns.set()
Since we're in Oslo, let's take a look at daily temperatures measured in Oslo. I found this data at the following website (uncomment this code to download the data):
In [2]:
# !curl -O http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NOOSLO.txt
# !mv NOOSLO.txt data
We'll read it this way:
In [3]:
data = pd.read_csv('../data/NOOSLO.txt', delim_whitespace=True,
names=['month', 'day', 'year', 'degF'])
data.describe()
Out[3]:
Notice that there's some craziness going on here. First of all, some of the years seem to have been mis-typed as 200. We'll want to filter those out. Also, some of the temperatures are reported as -99. This is a common value used to indicate missing data. Let's remove both of those and re-check the data description:
In [4]:
# Filter bad years
data = data[data.year > 200]
# Filter missing data
data = data[data.degF > -99]
data.describe()
Out[4]:
Looks much better! The next thing we'll want to do is to combine the month, day, and year columns into a single date index. We'll do this using Pandas to_datetime
functionality:
In [5]:
# Create a date index
YMD = 10000 * data.year + 100 * data.month + data.day
data.index = pd.to_datetime(YMD, format='%Y%m%d').astype('datetime64[ns]')
data.head()
Out[5]:
plot()
method to plot the temperature with timedata.index.dayofyear
attribute.
In [6]:
# Convert Fahrenheit to Celsius
data['degC'] = (data.degF - 32) / 1.8
data.head()
Out[6]:
In [7]:
data.degC.plot();
In [8]:
data['dayofyear'] = data.index.dayofyear
x = data['dayofyear'].values
y = data['degC'].values
plt.scatter(x, y, alpha=0.1);
Here we'll practice doing a simple model, even though it will not fit our data well: fit a line of best-fit to the data with the following model:
$$ y(x) = \theta_0 + \theta_1 x $$You can use a cost function of your choice (squared deviation is the best motivated) and use either the scipy minimization or the matrix algebra formalism.
Once you've computed the best model, plot this fit over the original data. Note that the best possible fit with this model is not very good – the model is highly biased!
In [9]:
from scipy import optimize
def linear_model(theta, x):
return theta[0] + theta[1] * x
def linear_cost(theta, x, y):
return np.sum((y - linear_model(theta, x)) ** 2)
theta_guess = [0, 0]
theta_best = optimize.fmin(linear_cost, theta_guess, args=(x, y))
print(theta_best)
In [10]:
xfit = np.linspace(0, 365)
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');
In [11]:
X = np.vstack([np.ones_like(x),
x]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');
Let's try a more complex model fit to our data. The data looks like it's sinusoidal, so we can fit a model that looks like this:
$$ y(x) = \theta_0 + \theta_1 \sin \left(\frac{2\pi x}{365}\right) + \theta_2 \cos\left(\frac{2\pi x}{365}\right) $$Note that this is still a linear model (i.e. Linear in the model parameters $\theta$), so it can also be solved via either iterative minimization, or via matrix methods.
In [12]:
def seasonal_model(theta, x):
phi = 2 * np.pi * x / 365
return theta[0] + theta[1] * np.sin(phi) + theta[2] * np.cos(phi)
def seasonal_cost(theta, x, y):
return np.sum((y - seasonal_model(theta, x)) ** 2)
theta_guess = [0, 0, 0]
theta_best = optimize.fmin(seasonal_cost, theta_guess, args=(x, y))
print(theta_best)
In [13]:
xfit = np.linspace(0, 365)
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');
In [14]:
X = np.vstack([np.ones_like(x),
np.sin(2 * np.pi * x / 365),
np.cos(2 * np.pi * x / 365)]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');
What we've done above is a simple model to fit the annual temperature trends. Here we'll ask how the temperature varied from year to year.
In [15]:
data['detrended_temp'] = y - seasonal_model(theta_best, x)
data.head()
Out[15]:
In [16]:
data.detrended_temp.plot();
In [17]:
days = data.index.astype(int)
y_detrended = data.detrended_temp
days = (days - days[0]) / (days[1] - days[0])
plt.scatter(days, y_detrended, alpha=0.1);
In [18]:
X = np.vstack([np.ones_like(days),
days]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
np.dot(X.T, y_detrended))
print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))
plt.scatter(days, y_detrended, alpha=0.1)
plt.plot(days, linear_model(theta_best, days), '-k');
So it looks like over the period in question, the average detrended temperature in oslo has been increasing ever so slightly.
This is certainly not a smoking gun for climate change (it's simply too short a baseline to say anything definitive) but this is the type of analysis that you can do to discover this sort of trend.
Our assumptions for least squares linear regression were:
In this case, the "error distribution" of (1) refers to the deviation of the points from the model. We can actually compute and plot this rather easily:
In [19]:
plt.hist(y_detrended, 100);
That does not look very Gaussian! For a more careful analysis, we'd have to have some sort of better model for this.
Assumption (2) is that all the deviations are independent. This is also not true! We can see that there is a strong correlation between the detrended y values. In words, if it was cold yesterday, it's likely to be cold today as well.
In [20]:
plt.scatter(y_detrended[:-1], y_detrended[1:]);
Assumption (3) is an assumption of frequentism; a Bayesian would disagree with this assumption in most cases. That discussion is beyond the scope of this problem set.
So in summary, our model very obviously breaks two of our three assumptions! We'll have to be more careful if we want to publish this.
In [21]:
# Define a combined model: offset + seasonal variation + linear trend
# (note: leap years might throw this off)
def combined_model(theta, days):
return (theta[0] +
theta[1] * days +
theta[2] * np.sin(2 * np.pi * days / 365.) +
theta[3] * np.cos(2 * np.pi * days / 365.))
# Solve this model using linear algebra approach
X = np.vstack([np.ones_like(days),
days,
np.sin(2 * np.pi * days / 365.),
np.cos(2 * np.pi * days / 365.)]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
np.dot(X.T, y))
# Plot the best model and find the corrected trend:
print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))
plt.scatter(days, y, alpha=0.1)
plt.plot(np.sort(days), combined_model(theta_best, np.sort(days)), '-k');