Generalized Additive Models (GAMs) are smooth semi-parametric models of the form:
$$ g(\mathbb{E}[y|X]) = \beta_0 + f_1(X_1) + f_2(X_2, X3) + \ldots + f_M(X_N) $$where X.T = [X_1, X_2, ..., X_N] are independent variables, y is the dependent variable, and g() is the link function that relates our predictor variables to the expected value of the dependent variable.
The feature functions f_i() are built using penalized B splines, which allow us to automatically model non-linear relationships without having to manually try out many different transformations on each variable.
GAMs extend generalized linear models by allowing non-linear functions of features while maintaining additivity. Since the model is additive, it is easy to examine the effect of each X_i on Y individually while holding all other predictors constant.
The result is a very flexible model, where it is easy to incorporate prior knowledge and control overfitting.
where $$ g(\mu|X) = \beta_0 + f_1(X_1) + f_2(X_2, X3) + \ldots + f_M(X_N) $$
So we can see that a GAM has 3 components:
distribution from the exponential familylink function $g(\cdot)$functional form with an additive structure $\beta_0 + f_1(X_1) + f_2(X_2, X3) + \ldots + f_M(X_N)$Specified via: GAM(distribution='...')
Currently you can choose from the following:
'normal''binomial''poisson''gamma''inv_gauss'We specify this using: GAM(link='...')
Link functions take the distribution mean to the linear prediction. So far, the following are available:
'identity''logit''inverse''log''inverse-squared'Speficied in GAM(terms=...) or more simply GAM(...)
In pyGAM, we specify the functional form using terms:
l() linear terms: for terms like $X_i$s() spline termsf() factor termste() tensor productsintercept With these, we can quickly and compactly build models like:
In [12]:
from pygam import GAM, s, te
GAM(s(0, n_splines=200) + te(3,1) + s(2), distribution='poisson', link='log')
Out[12]:
which specifies that we want a:
Note:
GAM(..., intercept=True) so models include an intercept by default.
in pyGAM you can build custom models by specifying these 3 elements, or you can choose from common models:
LinearGAM identity link and normal distributionLogisticGAM logit link and binomial distributionPoissonGAM log link and Poisson distributionGammaGAM log link and gamma distributionInvGauss log link and inv_gauss distributionThe benefit of the common models is that they have some extra features, apart from reducing boilerplate code.
In [58]:
from pygam import PoissonGAM, s, te
from pygam.datasets import chicago
X, y = chicago(return_X_y=True)
gam = PoissonGAM(s(0, n_splines=200) + te(3, 1) + s(2)).fit(X, y)
and plot a 3D surface:
In [60]:
import matplotlib.pyplot as plt
from mpl_toolkits import mplot3d
plt.ion()
plt.rcParams['figure.figsize'] = (12, 8)
In [61]:
XX = gam.generate_X_grid(term=1, meshgrid=True)
Z = gam.partial_dependence(term=1, X=XX, meshgrid=True)
ax = plt.axes(projection='3d')
ax.plot_surface(XX[0], XX[1], Z, cmap='viridis')
Out[61]:
For simple interactions it is sometimes useful to add a by-variable to a term
In [10]:
from pygam import LinearGAM, s
from pygam.datasets import toy_interaction
X, y = toy_interaction(return_X_y=True)
gam = LinearGAM(s(0, by=1)).fit(X, y)
gam.summary()
In [17]:
from pygam import LinearGAM, s, f
from pygam.datasets import wage
X, y = wage(return_X_y=True)
## model
gam = LinearGAM(s(0) + s(1) + f(2))
gam.gridsearch(X, y)
## plotting
plt.figure();
fig, axs = plt.subplots(1,3);
titles = ['year', 'age', 'education']
for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
if i == 0:
ax.set_ylim(-30,30)
ax.set_title(titles[i]);
Even though our model allows coefficients, our smoothing penalty reduces us to just 19 effective degrees of freedom:
In [4]:
gam.summary()
With LinearGAMs, we can also check the prediction intervals:
In [56]:
from pygam import LinearGAM
from pygam.datasets import mcycle
X, y = mcycle(return_X_y=True)
gam = LinearGAM(n_splines=25).gridsearch(X, y)
XX = gam.generate_X_grid(term=0, n=500)
plt.plot(XX, gam.predict(XX), 'r--')
plt.plot(XX, gam.prediction_intervals(XX, width=.95), color='b', ls='--')
plt.scatter(X, y, facecolor='gray', edgecolors='none')
plt.title('95% prediction interval');
And simulate from the posterior:
In [57]:
# continuing last example with the mcycle dataset
for response in gam.sample(X, y, quantity='y', n_draws=50, sample_at_X=XX):
plt.scatter(XX, response, alpha=.03, color='k')
plt.plot(XX, gam.predict(XX), 'r--')
plt.plot(XX, gam.prediction_intervals(XX, width=.95), color='b', ls='--')
plt.title('draw samples from the posterior of the coefficients')
Out[57]:
In [11]:
from pygam import LogisticGAM, s, f
from pygam.datasets import default
X, y = default(return_X_y=True)
gam = LogisticGAM(f(0) + s(1) + s(2)).gridsearch(X, y)
fig, axs = plt.subplots(1, 3)
titles = ['student', 'balance', 'income']
for i, ax in enumerate(axs):
XX = gam.generate_X_grid(term=i)
pdep, confi = gam.partial_dependence(term=i, width=.95)
ax.plot(XX[:, i], pdep)
ax.plot(XX[:, i], confi, c='r', ls='--')
ax.set_title(titles[i]);
We can then check the accuracy:
In [8]:
gam.accuracy(X, y)
Out[8]:
Since the scale of the Binomial distribution is known, our gridsearch minimizes the Un-Biased Risk Estimator (UBRE) objective:
In [9]:
gam.summary()
In [10]:
from pygam import PoissonGAM
from pygam.datasets import faithful
X, y = faithful(return_X_y=True)
gam = PoissonGAM().gridsearch(X, y)
plt.hist(faithful(return_X_y=False)['eruptions'], bins=200, color='k');
plt.plot(X, gam.predict(X), color='r')
plt.title('Best Lambda: {0:.2f}'.format(gam.lam[0][0]));
GAMs with a Normal distribution suffer from the limitation of an assumed constant variance. Sometimes this is not an appropriate assumption, because we'd like the variance of our error distribution to vary.
In this case we can resort to modeling the expectiles of a distribution.
Expectiles are intuitively similar to quantiles, but model tail expectations instead of tail mass. Although they are less interpretable, expectiles are much faster to fit, and can also be used to non-parametrically model a distribution.
In [52]:
from pygam import ExpectileGAM
from pygam.datasets import mcycle
X, y = mcycle(return_X_y=True)
# lets fit the mean model first by CV
gam50 = ExpectileGAM(expectile=0.5).gridsearch(X, y)
# and copy the smoothing to the other models
lam = gam50.lam
# now fit a few more models
gam95 = ExpectileGAM(expectile=0.95, lam=lam).fit(X, y)
gam75 = ExpectileGAM(expectile=0.75, lam=lam).fit(X, y)
gam25 = ExpectileGAM(expectile=0.25, lam=lam).fit(X, y)
gam05 = ExpectileGAM(expectile=0.05, lam=lam).fit(X, y)
In [55]:
XX = gam50.generate_X_grid(term=0, n=500)
plt.scatter(X, y, c='k', alpha=0.2)
plt.plot(XX, gam95.predict(XX), label='0.95')
plt.plot(XX, gam75.predict(XX), label='0.75')
plt.plot(XX, gam50.predict(XX), label='0.50')
plt.plot(XX, gam25.predict(XX), label='0.25')
plt.plot(XX, gam05.predict(XX), label='0.05')
plt.legend()
Out[55]:
We fit the mean model by cross-validation in order to find the best smoothing parameter lam and then copy it over to the other models.
This practice makes the expectiles less likely to cross.
In [27]:
from pygam import GAM
from pygam.datasets import trees
X, y = trees(return_X_y=True)
gam = GAM(distribution='gamma', link='log')
gam.gridsearch(X, y)
plt.scatter(y, gam.predict(X))
plt.xlabel('true volume')
plt.ylabel('predicted volume')
Out[27]:
We can check the quality of the fit by looking at the Pseudo R-Squared:
In [28]:
gam.summary()
With GAMs we can encode prior knowledge and control overfitting by using penalties and constraints.
Available penalties
Availabe constraints
We can inject our intuition into our model by using monotonic and concave constraints:
In [29]:
from pygam import LinearGAM, s
from pygam.datasets import hepatitis
X, y = hepatitis(return_X_y=True)
gam1 = LinearGAM(s(0, constraints='monotonic_inc')).fit(X, y)
gam2 = LinearGAM(s(0, constraints='concave')).fit(X, y)
fig, ax = plt.subplots(1, 2)
ax[0].plot(X, y, label='data')
ax[0].plot(X, gam1.predict(X), label='monotonic fit')
ax[0].legend()
ax[1].plot(X, y, label='data')
ax[1].plot(X, gam2.predict(X), label='concave fit')
ax[1].legend()
Out[29]:
In [30]:
from pygam import LogisticGAM, s, f
from pygam.datasets import toy_classification
X, y = toy_classification(return_X_y=True, n=5000)
gam = LogisticGAM(s(0) + s(1) + s(2) + s(3) + s(4) + f(5))
gam.fit(X, y)
Out[30]:
Since GAMs are additive, it is also super easy to visualize each individual feature function, f_i(X_i). These feature functions describe the effect of each X_i on y individually while marginalizing out all other predictors:
In [31]:
plt.figure()
for i, term in enumerate(gam.terms):
if term.isintercept:
continue
plt.plot(gam.partial_dependence(term=i))
Link functions take the distribution mean to the linear prediction. These are the canonical link functions for the above distributions:
Callbacks are performed during each optimization iteration. It's also easy to write your own.
You can check a callback by inspecting:
In [32]:
_ = plt.plot(gam.logs_['deviance'])
In [33]:
from pygam import LinearGAM
from pygam.datasets import mcycle
X, y = mcycle()
gam = LinearGAM()
gam.gridsearch(X, y)
XX = gam.generate_X_grid(term=0)
m = X.min()
M = X.max()
XX = np.linspace(m - 10, M + 10, 500)
Xl = np.linspace(m - 10, m, 50)
Xr = np.linspace(M, M + 10, 50)
plt.figure()
plt.plot(XX, gam.predict(XX), 'k')
plt.plot(Xl, gam.confidence_intervals(Xl), color='b', ls='--')
plt.plot(Xr, gam.confidence_intervals(Xr), color='b', ls='--')
_ = plt.plot(X, gam.confidence_intervals(X), color='r', ls='--')
Simon N. Wood, 2006
Generalized Additive Models: an introduction with R
Hastie, Tibshirani, Friedman
The Elements of Statistical Learning
http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
James, Witten, Hastie and Tibshirani
An Introduction to Statistical Learning
http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Sixth%20Printing.pdf
Paul Eilers & Brian Marx, 1996 Flexible Smoothing with B-splines and Penalties http://www.stat.washington.edu/courses/stat527/s13/readings/EilersMarx_StatSci_1996.pdf
Kim Larsen, 2015
GAM: The Predictive Modeling Silver Bullet
http://multithreaded.stitchfix.com/assets/files/gam.pdf
Deva Ramanan, 2008
UCI Machine Learning: Notes on IRLS
http://www.ics.uci.edu/~dramanan/teaching/ics273a_winter08/homework/irls_notes.pdf
Paul Eilers & Brian Marx, 2015
International Biometric Society: A Crash Course on P-splines
http://www.ibschannel2015.nl/project/userfiles/Crash_course_handout.pdf
Keiding, Niels, 1991
Age-specific incidence and prevalence: a statistical perspective