Breakout: Linear Models and Generalized Linear Models

Here we'll get a bit of practice constructing and fitting models to data.


In [1]:
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn plotting defaults
import seaborn as sns; sns.set()

1. Download and Clean the Data

Since we're in Oslo, let's take a look at daily temperatures measured in Oslo. I found this data at the following website (uncomment this code to download the data):


In [2]:
# !curl -O http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NOOSLO.txt
# !mv NOOSLO.txt data

We'll read it this way:


In [3]:
data = pd.read_csv('../data/NOOSLO.txt', delim_whitespace=True,
                   names=['month', 'day', 'year', 'degF'])
data.describe()


Out[3]:
month day year degF
count 7618.000000 7618.000000 7618.000000 7618.000000
mean 6.489367 15.708847 1997.234970 40.977330
std 3.437654 8.799580 115.026933 18.944501
min 1.000000 1.000000 200.000000 -99.000000
25% 4.000000 8.000000 2000.000000 30.900000
50% 7.000000 16.000000 2005.000000 41.700000
75% 9.000000 23.000000 2009.000000 55.000000
max 12.000000 31.000000 2014.000000 75.700000

Cleaning the Data

Notice that there's some craziness going on here. First of all, some of the years seem to have been mis-typed as 200. We'll want to filter those out. Also, some of the temperatures are reported as -99. This is a common value used to indicate missing data. Let's remove both of those and re-check the data description:


In [4]:
# Filter bad years
data = data[data.year > 200]

# Filter missing data
data = data[data.degF > -99]

data.describe()


Out[4]:
month day year degF
count 7573.000000 7573.000000 7573.000000 7573.000000
mean 6.464149 15.702496 2004.581011 41.809098
std 3.425024 8.799017 5.639991 15.617010
min 1.000000 1.000000 1995.000000 -8.700000
25% 3.000000 8.000000 2000.000000 31.000000
50% 6.000000 16.000000 2005.000000 41.800000
75% 9.000000 23.000000 2009.000000 55.000000
max 12.000000 31.000000 2014.000000 75.700000

Looks much better! The next thing we'll want to do is to combine the month, day, and year columns into a single date index. We'll do this using Pandas to_datetime functionality:


In [5]:
# Create a date index
YMD = 10000 * data.year + 100 * data.month + data.day
data.index = pd.to_datetime(YMD, format='%Y%m%d').astype('datetime64[ns]')
data.head()


Out[5]:
month day year degF
1995-01-01 1 1 1995 24.0
1995-01-02 1 2 1995 22.4
1995-01-03 1 3 1995 9.3
1995-01-04 1 4 1995 6.1
1995-01-05 1 5 1995 26.2

2. Inspect and visualize the data

  • convert the fahrenheit measurement to centigrade, and add a new column with this value
  • use the dataframe's plot() method to plot the temperature with time
  • add a column to the dataframe which contains the day of the year (from 0 to 365). You can use the data.index.dayofyear attribute.
  • scatter-plot the day of year vs the temperature.

In [6]:
# Convert Fahrenheit to Celsius
data['degC'] = (data.degF - 32) / 1.8
data.head()


Out[6]:
month day year degF degC
1995-01-01 1 1 1995 24.0 -4.444444
1995-01-02 1 2 1995 22.4 -5.333333
1995-01-03 1 3 1995 9.3 -12.611111
1995-01-04 1 4 1995 6.1 -14.388889
1995-01-05 1 5 1995 26.2 -3.222222

In [7]:
data.degC.plot();



In [8]:
data['dayofyear'] = data.index.dayofyear

x = data['dayofyear'].values
y = data['degC'].values

plt.scatter(x, y, alpha=0.1);


3. Simple model: Line of Best-fit

Here we'll practice doing a simple model, even though it will not fit our data well: fit a line of best-fit to the data with the following model:

$$ y(x) = \theta_0 + \theta_1 x $$

You can use a cost function of your choice (squared deviation is the best motivated) and use either the scipy minimization or the matrix algebra formalism.

Once you've computed the best model, plot this fit over the original data. Note that the best possible fit with this model is not very good – the model is highly biased!

Solution via Iterative Solver:


In [9]:
from scipy import optimize

def linear_model(theta, x):
    return theta[0] + theta[1] * x

def linear_cost(theta, x, y):
    return np.sum((y - linear_model(theta, x)) ** 2)

theta_guess = [0, 0]
theta_best = optimize.fmin(linear_cost, theta_guess, args=(x, y))
print(theta_best)


Optimization terminated successfully.
         Current function value: 546750.895997
         Iterations: 84
         Function evaluations: 161
[ 2.41527357  0.01673266]

In [10]:
xfit = np.linspace(0, 365)

plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');


Solution via matrix methods


In [11]:
X = np.vstack([np.ones_like(x),
               x]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');


4. More Complex model: sum of sinusoids

Let's try a more complex model fit to our data. The data looks like it's sinusoidal, so we can fit a model that looks like this:

$$ y(x) = \theta_0 + \theta_1 \sin \left(\frac{2\pi x}{365}\right) + \theta_2 \cos\left(\frac{2\pi x}{365}\right) $$

Note that this is still a linear model (i.e. Linear in the model parameters $\theta$), so it can also be solved via either iterative minimization, or via matrix methods.

Solution via iterative solver


In [12]:
def seasonal_model(theta, x):
    phi = 2 * np.pi * x / 365
    return theta[0] + theta[1] * np.sin(phi) + theta[2] * np.cos(phi)

def seasonal_cost(theta, x, y):
    return np.sum((y - seasonal_model(theta, x)) ** 2)

theta_guess = [0, 0, 0]
theta_best = optimize.fmin(seasonal_cost, theta_guess, args=(x, y))
print(theta_best)


Optimization terminated successfully.
         Current function value: 118328.777864
         Iterations: 157
         Function evaluations: 285
[  5.37023125  -3.27780921 -10.42314371]

In [13]:
xfit = np.linspace(0, 365)

plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');


Solution via matrix methods


In [14]:
X = np.vstack([np.ones_like(x),
               np.sin(2 * np.pi * x / 365),
               np.cos(2 * np.pi * x / 365)]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');


What we've done above is a simple model to fit the annual temperature trends. Here we'll ask how the temperature varied from year to year.

  • add a column to the data with the de-trended temperature (that is, the true temperature minus the seasonal model from above)
  • from the index, compute the total number of days since the beginning of the dataset at each point (hint: try converting the dates to integer type: this will give you the time in nanoseconds)
  • fit a linear model to the detrended data over the entire period. What slope do you see?
  • How many degrees per year change does this slope correspond to?

In [15]:
data['detrended_temp'] = y - seasonal_model(theta_best, x)
data.head()


Out[15]:
month day year degF degC dayofyear detrended_temp
1995-01-01 1 1 1995 24.0 -4.444444 1 0.663377
1995-01-02 1 2 1995 22.4 -5.333333 2 -0.173739
1995-01-03 1 3 1995 9.3 -12.611111 3 -7.402863
1995-01-04 1 4 1995 6.1 -14.388889 4 -9.135122
1995-01-05 1 5 1995 26.2 -3.222222 5 2.073915

In [16]:
data.detrended_temp.plot();



In [17]:
days = data.index.astype(int)
y_detrended = data.detrended_temp
days = (days - days[0]) / (days[1] - days[0])
plt.scatter(days, y_detrended, alpha=0.1);



In [18]:
X = np.vstack([np.ones_like(days),
               days]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y_detrended))

print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))

plt.scatter(days, y_detrended, alpha=0.1)
plt.plot(days, linear_model(theta_best, days), '-k');


slope = 0.033 degrees per year

So it looks like over the period in question, the average detrended temperature in oslo has been increasing ever so slightly.

This is certainly not a smoking gun for climate change (it's simply too short a baseline to say anything definitive) but this is the type of analysis that you can do to discover this sort of trend.

6. Thinking about the model & the fit

It's important to think about the assumptions that we put into our models. Out of the three assumptions we mentioned in the lecture, which ones hold for this dataset? Which ones are probably suspect?

Our assumptions for least squares linear regression were:

  1. The error distribution of the observed points is Gaussian.
  2. The errors in observed points are independently drawn.
  3. The model which is most likely to have led to our data is the optimal model.

In this case, the "error distribution" of (1) refers to the deviation of the points from the model. We can actually compute and plot this rather easily:


In [19]:
plt.hist(y_detrended, 100);


That does not look very Gaussian! For a more careful analysis, we'd have to have some sort of better model for this.

Assumption (2) is that all the deviations are independent. This is also not true! We can see that there is a strong correlation between the detrended y values. In words, if it was cold yesterday, it's likely to be cold today as well.


In [20]:
plt.scatter(y_detrended[:-1], y_detrended[1:]);


Assumption (3) is an assumption of frequentism; a Bayesian would disagree with this assumption in most cases. That discussion is beyond the scope of this problem set.

So in summary, our model very obviously breaks two of our three assumptions! We'll have to be more careful if we want to publish this.

7. Bonus: Fitting the model all at once

Parts 2 and 3 were approached as two separate steps, but it's possible to build a single linear model which encompasses both a seasonal variation and an annual trend. See if you can find the above temperature slope in a single step!


In [21]:
# Define a combined model: offset + seasonal variation + linear trend
# (note: leap years might throw this off)
def combined_model(theta, days):
    return (theta[0] +
            theta[1] * days +
            theta[2] * np.sin(2 * np.pi * days / 365.) +
            theta[3] * np.cos(2 * np.pi * days / 365.))

# Solve this model using linear algebra approach
X = np.vstack([np.ones_like(days),
               days,
               np.sin(2 * np.pi * days / 365.),
               np.cos(2 * np.pi * days / 365.)]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))

# Plot the best model and find the corrected trend:
print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))

plt.scatter(days, y, alpha=0.1)
plt.plot(np.sort(days), combined_model(theta_best, np.sort(days)), '-k');


slope = 0.032 degrees per year