In regression problems, we want to find a function that describes the relationship between observations and outputs. Our observations are recorded as rows in the matrix $X$, our response values as the column vector $Y$, and our estimate linear parameters as the column vector $\beta$. We think of our linear regression model as having a structural part and a random part: $Y= X\beta + \epsilon$ where $\epsilon$ ~ $N(0,1)$. The first two terms are structural and capture the linear relationship. The last is noise or variance in the data.

Linear regression doesn't require that the residuals $|r=Y - pred(X)|$ are normally distributed, but if they are, then the least-squares solution has many nice theoretical propoerties. Let's generate some fake data where the predictor and response are linearly related, and the residuals are normally distributed.


In [2]:
import numpy as np
from sklearn import cross_validation as cv
import matplotlib.pyplot as plt

%matplotlib inline
# use 100 samples
n_pts =100

beta= np.array([3.4])
X = np.expand_dims(np.linspace(0.001, 10, n_pts), axis=1)
def lin_resp_norm_err(beta, X):
    return np.dot(X, beta[:, np.newaxis]) + np.expand_dims(np.random.randn(n_pts), axis=1)
Y_lin_norm_err = lin_resp_norm_err(beta, X)

Let's split out data into test and train sets so that we can examine how our linear model performs on held out data.


In [3]:
X_train, X_test, Y_train, Y_test = cv.train_test_split(X, Y_lin_norm_err, test_size=0.33, random_state=42)

We will now train a linear model with an ordinary least squares predictor. We will calculate the mean squre error on the hold out set as well as the $R^{2}$ value for the fit.


In [4]:
from sklearn import linear_model
lin_regr = linear_model.LinearRegression()
lin_regr.fit(X_train, Y_train)

residuals = lin_regr.predict(X_test) - Y_test
mean_error = np.mean(residuals ** 2)
regr_score = lin_regr.score(X_test, Y_test)

plt.scatter(X, Y_lin_norm_err)
plt.plot(X, lin_regr.predict(X), color='red')
plt.text(1, 30, '$r^{2}=%s$' % regr_score)
plt.xlabel('Predictor Variable X')
plt.ylabel('Response Variable Y')
plt.show()


The norms of the residuals follow a normal distribution.


In [5]:
# the histogram of the data with histtype='step'
plt.hist(residuals, bins=25)
l = plt.show()


We commonly think of linear regression containing a structural ($X\beta$) part and a random ($\epsilon$) part. In generalized linear models, we instead focus on the structural component, the link function, and the response distribution. We think of $g(\mu)=X\beta$ where $g$ is the link function, $X\beta$ is the structural component, and $\mu$ is the conditional mean of the distribution at that point in feature space. Note that we can pad the feature vectors of our observations with a '1' to capture the offset in the model, exactly like we do in normal linear regression.

If the link function, $g$, is the identity, the repsonse funciton is linear, and the response function is conditionally normally distributed, we are back again at normal linear regression. Here is a pretty great explanation someone wrote about glms overs on stack overflow.

Count data is fundamentally different from continuous data-- it is always discrete (integer values) and nonnegative. Let's grab some real data to analyze. Here, we are grabbing grouped individual data where there is an indicator for geographic "cell" number, residence (1=Suva, 2=Urban, or 3=Rural), education level of the mother (1=none, 2=primary, 3=upper primary, 4=secondary), The mean number of children ever born, variance in children ever born, number of owmen in the cell.


In [56]:
import urllib2
import pandas as pd
import StringIO 

response = urllib2.urlopen('http://data.princeton.edu/wws509/datasets/ceb.dat')
children_ever_born = pd.read_csv(StringIO.StringIO(response.read()), sep='\s*')


Out[56]:
mean
1 0.50
2 1.14
3 0.90
4 0.73
5 1.17
6 0.85
7 1.05
8 0.69
9 0.97
10 0.96
11 0.97
12 0.74
13 3.10
14 2.67
15 2.04
16 1.73
17 4.54
18 2.65
19 2.68
20 2.29
21 2.44
22 2.71
23 2.47
24 2.24
25 4.08
26 3.67
27 2.90
28 2.00
29 4.17
30 3.33
... ...
41 4.70
42 5.36
43 4.60
44 3.80
45 5.06
46 5.59
47 4.50
48 2.00
49 5.62
50 5.06
51 3.92
52 2.60
53 5.36
54 5.88
55 5.00
56 5.33
57 6.46
58 6.34
59 5.74
60 2.50
61 6.60
62 6.74
63 5.38
64 2.00
65 6.52
66 7.51
67 7.54
69 7.48
70 7.81
71 5.80

70 rows × 1 columns


In [62]:
sm.GLM(children_ever_born['mean'], children_ever_born[['dur', 'res', 'educ', 'y']], family=sm.families.Poisson())


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-62-52854b3f6a5b> in <module>()
----> 1 sm.GLM(children_ever_born['mean'], children_ever_born[['dur', 'res', 'educ', 'y']], family=sm.families.Poisson())

/home/juliet/anaconda/lib/python2.7/site-packages/statsmodels/genmod/generalized_linear_model.pyc in __init__(self, endog, exog, family, offset, exposure, missing)
    189         self._check_inputs(family, offset, exposure, endog)
    190         super(GLM, self).__init__(endog, exog, missing=missing,
--> 191                                   offset=self.offset, exposure=self.exposure)
    192         if offset is None:
    193             delattr(self, 'offset')

/home/juliet/anaconda/lib/python2.7/site-packages/statsmodels/base/model.pyc in __init__(self, endog, exog, **kwargs)
    135     def __init__(self, endog, exog=None, **kwargs):
    136         super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
--> 137         self.initialize()
    138 
    139     def initialize(self):

/home/juliet/anaconda/lib/python2.7/site-packages/statsmodels/genmod/generalized_linear_model.pyc in initialize(self)
    207                         'deviance' : [np.inf]}
    208 
--> 209         self.pinv_wexog = np.linalg.pinv(self.exog)
    210         self.normalized_cov_params = np.dot(self.pinv_wexog,
    211                                         np.transpose(self.pinv_wexog))

/home/juliet/anaconda/lib/python2.7/site-packages/numpy/linalg/linalg.pyc in pinv(a, rcond)
   1582     a, wrap = _makearray(a)
   1583     _assertNoEmpty2d(a)
-> 1584     a = a.conjugate()
   1585     u, s, vt = svd(a, 0)
   1586     m = u.shape[0]

AttributeError: 'str' object has no attribute 'conjugate'

If we plot the predictor and response variables we can see what we would expect out of the poisson distribution. As the mean increases, so does the variance in the data.


In [28]:
X_train, X_test, Y_p_train, Y_p_test = cv.train_test_split(mean, Y_poisson, test_size=0.33, random_state=42)
plt.scatter(X_test, Y_p_test)


Out[28]:
<matplotlib.collections.PathCollection at 0x7f5578446350>

In [31]:
glm_poisson = sm.GLM(Y_p_train, X_train, family=sm.families.Poisson())
res = glm_poisson.fit()


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-31-811d80777d58> in <module>()
      1 glm_poisson = sm.GLM(Y_p_train, X_train, family=sm.families.Poisson())
----> 2 res = glm_poisson.fit()

/home/juliet/anaconda/lib/python2.7/site-packages/statsmodels/genmod/generalized_linear_model.pyc in fit(self, maxiter, method, tol, scale)
    389         dev = self.family.deviance(self.endog, mu)
    390         if np.isnan(dev):
--> 391             raise ValueError("The first guess on the deviance function "
    392                              "returned a nan.  This could be a boundary "
    393                              " problem and should be reported.")

ValueError: The first guess on the deviance function returned a nan.  This could be a boundary  problem and should be reported.

In [51]:



Out[51]:
dur res educ mean var n y
1 0-4 Suva none 0.50 1.14 8 4.00
2 0-4 Suva lower 1.14 0.73 21 23.94
3 0-4 Suva upper 0.90 0.67 42 37.80
4 0-4 Suva sec+ 0.73 0.48 51 37.23
5 0-4 urban none 1.17 1.06 12 14.04
6 0-4 urban lower 0.85 1.59 27 22.95
7 0-4 urban upper 1.05 0.73 39 40.95
8 0-4 urban sec+ 0.69 0.54 51 35.19
9 0-4 rural none 0.97 0.88 62 60.14
10 0-4 rural lower 0.96 0.81 102 97.92
11 0-4 rural upper 0.97 0.80 107 103.79
12 0-4 rural sec+ 0.74 0.59 47 34.78
13 5-9 Suva none 3.10 1.66 10 31.00
14 5-9 Suva lower 2.67 0.99 30 80.10
15 5-9 Suva upper 2.04 1.87 24 48.96
16 5-9 Suva sec+ 1.73 0.68 22 38.06
17 5-9 urban none 4.54 3.44 13 59.02
18 5-9 urban lower 2.65 1.51 37 98.05
19 5-9 urban upper 2.68 0.97 44 117.92
20 5-9 urban sec+ 2.29 0.81 21 48.09
21 5-9 rural none 2.44 1.93 70 170.80
22 5-9 rural lower 2.71 1.36 117 317.07
23 5-9 rural upper 2.47 1.30 81 200.07
24 5-9 rural sec+ 2.24 1.19 21 47.04
25 10-14 Suva none 4.08 1.72 12 48.96
26 10-14 Suva lower 3.67 2.31 27 99.09
27 10-14 Suva upper 2.90 1.57 20 58.00
28 10-14 Suva sec+ 2.00 1.82 12 24.00
29 10-14 urban none 4.17 2.97 18 75.06
30 10-14 urban lower 3.33 2.99 43 143.19
... ... ... ... ... ... ... ...
41 15-19 urban none 4.70 7.40 23 108.10
42 15-19 urban lower 5.36 2.97 42 225.12
43 15-19 urban upper 4.60 3.83 20 92.00
44 15-19 urban sec+ 3.80 0.70 5 19.00
45 15-19 rural none 5.06 4.91 114 576.84
46 15-19 rural lower 5.59 3.23 86 480.74
47 15-19 rural upper 4.50 3.29 30 135.00
48 15-19 rural sec+ 2.00 0.00 1 2.00
49 20-24 Suva none 5.62 4.15 21 118.02
50 20-24 Suva lower 5.06 4.64 18 91.08
51 20-24 Suva upper 3.92 4.08 12 47.04
52 20-24 Suva sec+ 2.60 4.30 5 13.00
53 20-24 urban none 5.36 7.19 22 117.92
54 20-24 urban lower 5.88 4.44 25 147.00
55 20-24 urban upper 5.00 4.33 13 65.00
56 20-24 urban sec+ 5.33 0.33 3 15.99
57 20-24 rural none 6.46 8.20 117 755.82
58 20-24 rural lower 6.34 5.72 68 431.12
59 20-24 rural upper 5.74 5.20 23 132.02
60 20-24 rural sec+ 2.50 0.50 2 5.00
61 25-29 Suva none 6.60 12.40 47 310.20
62 25-29 Suva lower 6.74 11.66 27 181.98
63 25-29 Suva upper 5.38 4.27 8 43.04
64 25-29 Suva sec+ 2.00 0.00 1 2.00
65 25-29 urban none 6.52 11.45 46 299.92
66 25-29 urban lower 7.51 10.53 45 337.95
67 25-29 urban upper 7.54 12.60 13 98.02
69 25-29 rural none 7.48 11.34 195 1458.60
70 25-29 rural lower 7.81 7.57 59 460.79
71 25-29 rural upper 5.80 7.07 10 58.00

70 rows × 7 columns


In [50]:



Out[50]:
dur res educ mean var n y
1 0-4 Suva none 0.50 1.14 8 4.00
2 0-4 Suva lower 1.14 0.73 21 23.94
3 0-4 Suva upper 0.90 0.67 42 37.80
4 0-4 Suva sec+ 0.73 0.48 51 37.23
5 0-4 urban none 1.17 1.06 12 14.04
6 0-4 urban lower 0.85 1.59 27 22.95
7 0-4 urban upper 1.05 0.73 39 40.95
8 0-4 urban sec+ 0.69 0.54 51 35.19
9 0-4 rural none 0.97 0.88 62 60.14
10 0-4 rural lower 0.96 0.81 102 97.92
11 0-4 rural upper 0.97 0.80 107 103.79
12 0-4 rural sec+ 0.74 0.59 47 34.78
13 5-9 Suva none 3.10 1.66 10 31.00
14 5-9 Suva lower 2.67 0.99 30 80.10
15 5-9 Suva upper 2.04 1.87 24 48.96
16 5-9 Suva sec+ 1.73 0.68 22 38.06
17 5-9 urban none 4.54 3.44 13 59.02
18 5-9 urban lower 2.65 1.51 37 98.05
19 5-9 urban upper 2.68 0.97 44 117.92
20 5-9 urban sec+ 2.29 0.81 21 48.09
21 5-9 rural none 2.44 1.93 70 170.80
22 5-9 rural lower 2.71 1.36 117 317.07
23 5-9 rural upper 2.47 1.30 81 200.07
24 5-9 rural sec+ 2.24 1.19 21 47.04
25 10-14 Suva none 4.08 1.72 12 48.96
26 10-14 Suva lower 3.67 2.31 27 99.09
27 10-14 Suva upper 2.90 1.57 20 58.00
28 10-14 Suva sec+ 2.00 1.82 12 24.00
29 10-14 urban none 4.17 2.97 18 75.06
30 10-14 urban lower 3.33 2.99 43 143.19
... ... ... ... ... ... ... ...
41 15-19 urban none 4.70 7.40 23 108.10
42 15-19 urban lower 5.36 2.97 42 225.12
43 15-19 urban upper 4.60 3.83 20 92.00
44 15-19 urban sec+ 3.80 0.70 5 19.00
45 15-19 rural none 5.06 4.91 114 576.84
46 15-19 rural lower 5.59 3.23 86 480.74
47 15-19 rural upper 4.50 3.29 30 135.00
48 15-19 rural sec+ 2.00 0.00 1 2.00
49 20-24 Suva none 5.62 4.15 21 118.02
50 20-24 Suva lower 5.06 4.64 18 91.08
51 20-24 Suva upper 3.92 4.08 12 47.04
52 20-24 Suva sec+ 2.60 4.30 5 13.00
53 20-24 urban none 5.36 7.19 22 117.92
54 20-24 urban lower 5.88 4.44 25 147.00
55 20-24 urban upper 5.00 4.33 13 65.00
56 20-24 urban sec+ 5.33 0.33 3 15.99
57 20-24 rural none 6.46 8.20 117 755.82
58 20-24 rural lower 6.34 5.72 68 431.12
59 20-24 rural upper 5.74 5.20 23 132.02
60 20-24 rural sec+ 2.50 0.50 2 5.00
61 25-29 Suva none 6.60 12.40 47 310.20
62 25-29 Suva lower 6.74 11.66 27 181.98
63 25-29 Suva upper 5.38 4.27 8 43.04
64 25-29 Suva sec+ 2.00 0.00 1 2.00
65 25-29 urban none 6.52 11.45 46 299.92
66 25-29 urban lower 7.51 10.53 45 337.95
67 25-29 urban upper 7.54 12.60 13 98.02
69 25-29 rural none 7.48 11.34 195 1458.60
70 25-29 rural lower 7.81 7.57 59 460.79
71 25-29 rural upper 5.80 7.07 10 58.00

70 rows × 7 columns


In [ ]: