Breakout: Linear Models and Generalized Linear Models

Here we'll get a bit of practice constructing and fitting models to data.



In [1]:

    
from __future__ import print_function, division

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# use seaborn plotting defaults
import seaborn as sns; sns.set()

1. Download and Clean the Data

Since we're in Oslo, let's take a look at daily temperatures measured in Oslo. I found this data at the following website (uncomment this code to download the data):



In [2]:

    
# !curl -O http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NOOSLO.txt
# !mv NOOSLO.txt data

We'll read it this way:



In [3]:

    
data = pd.read_csv('../data/NOOSLO.txt', delim_whitespace=True,
                   names=['month', 'day', 'year', 'degF'])
data.describe()









    Out[3]:






  
    
      
      month
      day
      year
      degF
    
  
  
    
      count
       7618.000000
       7618.000000
       7618.000000
       7618.000000
    
    
      mean
          6.489367
         15.708847
       1997.234970
         40.977330
    
    
      std
          3.437654
          8.799580
        115.026933
         18.944501
    
    
      min
          1.000000
          1.000000
        200.000000
        -99.000000
    
    
      25%
          4.000000
          8.000000
       2000.000000
         30.900000
    
    
      50%
          7.000000
         16.000000
       2005.000000
         41.700000
    
    
      75%
          9.000000
         23.000000
       2009.000000
         55.000000
    
    
      max
         12.000000
         31.000000
       2014.000000
         75.700000

Cleaning the Data

Notice that there's some craziness going on here. First of all, some of the years seem to have been mis-typed as 200. We'll want to filter those out. Also, some of the temperatures are reported as -99. This is a common value used to indicate missing data. Let's remove both of those and re-check the data description:



In [4]:

    
# Filter bad years
data = data[data.year > 200]

# Filter missing data
data = data[data.degF > -99]

data.describe()









    Out[4]:






  
    
      
      month
      day
      year
      degF
    
  
  
    
      count
       7573.000000
       7573.000000
       7573.000000
       7573.000000
    
    
      mean
          6.464149
         15.702496
       2004.581011
         41.809098
    
    
      std
          3.425024
          8.799017
          5.639991
         15.617010
    
    
      min
          1.000000
          1.000000
       1995.000000
         -8.700000
    
    
      25%
          3.000000
          8.000000
       2000.000000
         31.000000
    
    
      50%
          6.000000
         16.000000
       2005.000000
         41.800000
    
    
      75%
          9.000000
         23.000000
       2009.000000
         55.000000
    
    
      max
         12.000000
         31.000000
       2014.000000
         75.700000

Looks much better! The next thing we'll want to do is to combine the month, day, and year columns into a single date index. We'll do this using Pandas to_datetime functionality:



In [5]:

    
# Create a date index
YMD = 10000 * data.year + 100 * data.month + data.day
data.index = pd.to_datetime(YMD, format='%Y%m%d').astype('datetime64[ns]')
data.head()

2. Inspect and visualize the data

convert the fahrenheit measurement to centigrade, and add a new column with this value
use the dataframe's plot() method to plot the temperature with time
add a column to the dataframe which contains the day of the year (from 0 to 365). You can use the data.index.dayofyear attribute.
scatter-plot the day of year vs the temperature.



In [6]:

    
# Convert Fahrenheit to Celsius
data['degC'] = (data.degF - 32) / 1.8
data.head()



In [7]:

    
data.degC.plot();



In [8]:

    
data['dayofyear'] = data.index.dayofyear

x = data['dayofyear'].values
y = data['degC'].values

plt.scatter(x, y, alpha=0.1);

3. Simple model: Line of Best-fit

Here we'll practice doing a simple model, even though it will not fit our data well: fit a line of best-fit to the data with the following model:

$$ y(x) = \theta_0 + \theta_1 x $$

You can use a cost function of your choice (squared deviation is the best motivated) and use either the scipy minimization or the matrix algebra formalism.

Once you've computed the best model, plot this fit over the original data. Note that the best possible fit with this model is not very good – the model is highly biased!

Solution via Iterative Solver:



In [9]:

    
from scipy import optimize

def linear_model(theta, x):
    return theta[0] + theta[1] * x

def linear_cost(theta, x, y):
    return np.sum((y - linear_model(theta, x)) ** 2)

theta_guess = [0, 0]
theta_best = optimize.fmin(linear_cost, theta_guess, args=(x, y))
print(theta_best)









    



Optimization terminated successfully.
         Current function value: 546750.895997
         Iterations: 84
         Function evaluations: 161
[ 2.41527357  0.01673266]



In [10]:

    
xfit = np.linspace(0, 365)

plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');

Solution via matrix methods



In [11]:

    
X = np.vstack([np.ones_like(x),
               x]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, linear_model(theta_best, xfit), '-k');

4. More Complex model: sum of sinusoids

Let's try a more complex model fit to our data. The data looks like it's sinusoidal, so we can fit a model that looks like this:

$$ y(x) = \theta_0 + \theta_1 \sin \left(\frac{2\pi x}{365}\right) + \theta_2 \cos\left(\frac{2\pi x}{365}\right) $$

Note that this is still a linear model (i.e. Linear in the model parameters $\theta$), so it can also be solved via either iterative minimization, or via matrix methods.

Solution via iterative solver



In [12]:

    
def seasonal_model(theta, x):
    phi = 2 * np.pi * x / 365
    return theta[0] + theta[1] * np.sin(phi) + theta[2] * np.cos(phi)

def seasonal_cost(theta, x, y):
    return np.sum((y - seasonal_model(theta, x)) ** 2)

theta_guess = [0, 0, 0]
theta_best = optimize.fmin(seasonal_cost, theta_guess, args=(x, y))
print(theta_best)









    



Optimization terminated successfully.
         Current function value: 118328.777864
         Iterations: 157
         Function evaluations: 285
[  5.37023125  -3.27780921 -10.42314371]



In [13]:

    
xfit = np.linspace(0, 365)

plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');

Solution via matrix methods



In [14]:

    
X = np.vstack([np.ones_like(x),
               np.sin(2 * np.pi * x / 365),
               np.cos(2 * np.pi * x / 365)]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))
plt.scatter(x, y, alpha=0.1)
plt.plot(xfit, seasonal_model(theta_best, xfit), '-k');

5. Annual Temperature Trends

What we've done above is a simple model to fit the annual temperature trends. Here we'll ask how the temperature varied from year to year.

add a column to the data with the de-trended temperature (that is, the true temperature minus the seasonal model from above)
from the index, compute the total number of days since the beginning of the dataset at each point (hint: try converting the dates to integer type: this will give you the time in nanoseconds)
fit a linear model to the detrended data over the entire period. What slope do you see?
How many degrees per year change does this slope correspond to?



In [15]:

    
data['detrended_temp'] = y - seasonal_model(theta_best, x)
data.head()









    Out[15]:






  
    
      
      month
      day
      year
      degF
      degC
      dayofyear
      detrended_temp
    
  
  
    
      1995-01-01
       1
       1
       1995
       24.0
       -4.444444
       1
       0.663377
    
    
      1995-01-02
       1
       2
       1995
       22.4
       -5.333333
       2
      -0.173739
    
    
      1995-01-03
       1
       3
       1995
        9.3
      -12.611111
       3
      -7.402863
    
    
      1995-01-04
       1
       4
       1995
        6.1
      -14.388889
       4
      -9.135122
    
    
      1995-01-05
       1
       5
       1995
       26.2
       -3.222222
       5
       2.073915



In [16]:

    
data.detrended_temp.plot();



In [17]:

    
days = data.index.astype(int)
y_detrended = data.detrended_temp
days = (days - days[0]) / (days[1] - days[0])
plt.scatter(days, y_detrended, alpha=0.1);



In [18]:

    
X = np.vstack([np.ones_like(days),
               days]).T

theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y_detrended))

print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))

plt.scatter(days, y_detrended, alpha=0.1)
plt.plot(days, linear_model(theta_best, days), '-k');









    



slope = 0.033 degrees per year

So it looks like over the period in question, the average detrended temperature in oslo has been increasing ever so slightly.

This is certainly not a smoking gun for climate change (it's simply too short a baseline to say anything definitive) but this is the type of analysis that you can do to discover this sort of trend.

6. Thinking about the model & the fit

It's important to think about the assumptions that we put into our models. Out of the three assumptions we mentioned in the lecture, which ones hold for this dataset? Which ones are probably suspect?

Our assumptions for least squares linear regression were:

The error distribution of the observed points is Gaussian.
The errors in observed points are independently drawn.
The model which is most likely to have led to our data is the optimal model.

In this case, the "error distribution" of (1) refers to the deviation of the points from the model. We can actually compute and plot this rather easily:



In [19]:

    
plt.hist(y_detrended, 100);

That does not look very Gaussian! For a more careful analysis, we'd have to have some sort of better model for this.

Assumption (2) is that all the deviations are independent. This is also not true! We can see that there is a strong correlation between the detrended y values. In words, if it was cold yesterday, it's likely to be cold today as well.



In [20]:

    
plt.scatter(y_detrended[:-1], y_detrended[1:]);

Assumption (3) is an assumption of frequentism; a Bayesian would disagree with this assumption in most cases. That discussion is beyond the scope of this problem set.

So in summary, our model very obviously breaks two of our three assumptions! We'll have to be more careful if we want to publish this.

7. Bonus: Fitting the model all at once

Parts 2 and 3 were approached as two separate steps, but it's possible to build a single linear model which encompasses both a seasonal variation and an annual trend. See if you can find the above temperature slope in a single step!



In [21]:

    
# Define a combined model: offset + seasonal variation + linear trend
# (note: leap years might throw this off)
def combined_model(theta, days):
    return (theta[0] +
            theta[1] * days +
            theta[2] * np.sin(2 * np.pi * days / 365.) +
            theta[3] * np.cos(2 * np.pi * days / 365.))

# Solve this model using linear algebra approach
X = np.vstack([np.ones_like(days),
               days,
               np.sin(2 * np.pi * days / 365.),
               np.cos(2 * np.pi * days / 365.)]).T
theta_best = np.linalg.solve(np.dot(X.T, X),
                             np.dot(X.T, y))

# Plot the best model and find the corrected trend:
print("slope = {0:.2g} degrees per year".format(365 * theta_best[1]))

plt.scatter(days, y, alpha=0.1)
plt.plot(np.sort(days), combined_model(theta_best, np.sort(days)), '-k');









    



slope = 0.032 degrees per year

	month	day	year	degF
count	7618.000000	7618.000000	7618.000000	7618.000000
mean	6.489367	15.708847	1997.234970	40.977330
std	3.437654	8.799580	115.026933	18.944501
min	1.000000	1.000000	200.000000	-99.000000
25%	4.000000	8.000000	2000.000000	30.900000
50%	7.000000	16.000000	2005.000000	41.700000
75%	9.000000	23.000000	2009.000000	55.000000
max	12.000000	31.000000	2014.000000	75.700000

	month	day	year	degF
count	7573.000000	7573.000000	7573.000000	7573.000000
mean	6.464149	15.702496	2004.581011	41.809098
std	3.425024	8.799017	5.639991	15.617010
min	1.000000	1.000000	1995.000000	-8.700000
25%	3.000000	8.000000	2000.000000	31.000000
50%	6.000000	16.000000	2005.000000	41.800000
75%	9.000000	23.000000	2009.000000	55.000000
max	12.000000	31.000000	2014.000000	75.700000

	month	day	year	degF
1995-01-01	1	1	1995	24.0
1995-01-02	1	2	1995	22.4
1995-01-03	1	3	1995	9.3
1995-01-04	1	4	1995	6.1
1995-01-05	1	5	1995	26.2