Breakout: Linear Models and Generalized Linear Models

Here we'll get a bit of practice constructing and fitting models to data.


In [ ]:
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn plotting defaults
import seaborn as sns; sns.set()

1. Download and Clean the Data

Since we're in Oslo, let's take a look at daily temperatures measured in Oslo. I found this data at the following website (uncomment this code to download the data):


In [ ]:
# !curl -O http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NOOSLO.txt
# !mv NOOSLO.txt data

We'll read it this way:


In [ ]:
data = pd.read_csv('data/NOOSLO.txt', delim_whitespace=True,
                   names=['month', 'day', 'year', 'degF'])
data.describe()

Cleaning the Data

Notice that there's some craziness going on here. First of all, some of the years seem to have been mis-typed as 200. We'll want to filter those out. Also, some of the temperatures are reported as -99. This is a common value used to indicate missing data. Let's remove both of those and re-check the data description:


In [ ]:
# Filter bad years
data = data[data.year > 200]

# Filter missing data
data = data[data.degF > -99]

data.describe()

Looks much better! The next thing we'll want to do is to combine the month, day, and year columns into a single date index. We'll do this using Pandas to_datetime functionality:


In [ ]:
# Create a date index
YMD = 10000 * data.year + 100 * data.month + data.day
data.index = pd.to_datetime(YMD, format='%Y%m%d').astype('datetime64[ns]')
data.head()

2. Inspect and visualize the data

  • convert the fahrenheit measurement to centigrade, and add a new column with this value
  • use the dataframe's plot() method to plot the temperature with time
  • add a column to the dataframe which contains the day of the year (from 0 to 365). You can use the data.index.dayofyear attribute.
  • scatter-plot the day of year vs the temperature.

3. Simple model: Line of Best-fit

Here we'll practice doing a simple model, even though it will not fit our data well: fit a line of best-fit to the data with the following model:

$$ y(x) = \theta_0 + \theta_1 x $$

You can use a cost function of your choice (squared deviation is the best motivated) and use either the scipy minimization or the matrix algebra formalism.

Once you've computed the best model, plot this fit over the original data. Note that the best possible fit with this model is not very good – the model is highly biased!

4. More Complex model: sum of sinusoids

Let's try a more complex model fit to our data. The data looks like it's sinusoidal, so we can fit a model that looks like this:

$$ y(x) = \theta_0 + \theta_1 \sin \left(\frac{2\pi x}{365}\right) + \theta_2 \cos\left(\frac{2\pi x}{365}\right) $$

Note that this is still a linear model (i.e. Linear in the model parameters $\theta$), so it can also be solved via either iterative minimization, or via matrix methods.

What we've done above is a simple model to fit the annual temperature trends. Here we'll ask how the temperature varied from year to year.

  • add a column to the data with the de-trended temperature (that is, the true temperature minus the seasonal model from above)
  • from the index, compute the total number of days since the beginning of the dataset at each point (hint: try converting the dates to integer type: this will give you the time in nanoseconds)
  • fit a linear model to the detrended data over the entire period. What slope do you see?
  • How many degrees per year change does this slope correspond to?

6. Thinking about the model & the fit

It's important to think about the assumptions that we put into our models. Out of the three assumptions we mentioned in the lecture, which ones hold for this dataset? Which ones are probably suspect?

7. Bonus: Fitting the model all at once

Parts 2 and 3 were approached as two separate steps, but it's possible to build a single linear model which encompasses both a seasonal variation and an annual trend. See if you can find the above temperature slope in a single step!