This tutorial comes from datarobot's blog post on multi-regression using statsmodel. I only fixed the broken links to the data.

*This is part of a series of blog posts showing how to do common statistical learning techniques with Python. We provide only a small amount of background on the concepts and techniques we cover, so if you’d like a more thorough explanation check out Introduction to Statistical Learning or sign up for the free online course run by the book's authors here.*

Earlier we covered Ordinary Least Squares regression with a single variable. In this posting we will build upon that by extending Linear Regression to multiple input variables giving rise to Multiple Regression, the workhorse of statistical learning.

We first describe Multiple Regression in an intuitive way by moving from a straight line in a single predictor case to a 2d plane in the case of two predictors. Next we explain how to deal with categorical variables in the context of linear regression. The final section of the post investigates basic extensions. This includes interaction terms and fitting non-linear relationships using polynomial regression.

In Ordinary Least Squares Regression with a single variable we described the relationship between the predictor and the response with a straight line. In the case of multiple regression we extend this idea by fitting a $p$-dimensional hyperplane to our $p$ predictors.

We can show this for two predictor variables in a three dimensional plot. In the following example we will use the advertising dataset which consists of the *sales* of products and their advertising budget in three different media *TV*, *radio*, *newspaper*.

```
In [1]:
```import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline
df_adv = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
X = df_adv[['TV', 'Radio']]
y = df_adv['Sales']
df_adv.head()

```
Out[1]:
```

The multiple regression model describes the response as a weighted sum of the predictors:

$Sales = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio$

This model can be visualized as a 2-d plane in 3-d space:

The plot above shows data points above the hyperplane in white and points below the hyperplane in black. The color of the plane is determined by the corresonding predicted *Sales* values (blue = low, red = high). The Python code to generate the 3-d plot can be found in the appendix.

Just as with the single variable case, calling `est.summary`

will give us detailed information about the model fit. You can find a description of each of the fields in the tables below in the previous blog post here.

```
In [2]:
```X = df_adv[['TV', 'Radio']]
y = df_adv['Sales']
## fit a OLS model with intercept on TV and Radio
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()

```
Out[2]:
```

`'+'`

symbol.

```
In [3]:
```# import formula api as alias smf
import statsmodels.formula.api as smf
# formula: response ~ predictor + predictor
est = smf.ols(formula='Sales ~ TV + Radio', data=df_adv).fit()

```
In [4]:
```import pandas as pd
df = pd.read_csv('http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data', index_col=0)
# copy data and separate predictors and response
X = df.copy()
y = X.pop('chd')
df.head()

```
Out[4]:
```

*famhist* holds if the patient has a family history of coronary artery disease. The percentage of the response *chd* (chronic heart disease ) for patients with absent/present family history of coronary artery disease is:

```
In [5]:
```# compute percentage of chronic heart disease for famhist
y.groupby(X.famhist).mean()

```
Out[5]:
```

`pd.Categorical`

.

```
In [9]:
```import statsmodels.formula.api as smf
# encode df.famhist as a numeric via pd.Factor
df['famhist_ord'] = pd.Categorical(df.famhist).labels
est = smf.ols(formula="chd ~ famhist_ord", data=df).fit()

```
```

`C()`

function.

After we performed dummy encoding the equation for the fit is now:

$ \hat{y} = \text{Intercept} + C(famhist)[T.Present] \times I(\text{famhist} = \text{Present})$

where $I$ is the indicator function that is 1 if the argument is true and 0 otherwise.

Hence the estimated percentage with chronic heart disease when *famhist* == present is 0.2370 + 0.2630 = 0.5000 and the estimated percentage with chronic heart disease when *famhist* == absent is 0.2370.

This same approach generalizes well to cases with more than two levels. For example, if there were entries in our dataset with *famhist* equal to 'Missing' we could create two 'dummy' variables, one to check if *famhis* equals present, and another to check if *famhist* equals 'Missing'.

```
In [7]:
```df = pd.read_csv('https://raw.githubusercontent.com/statsmodels/statsmodels/master/statsmodels/datasets/randhie/src/randhie.csv')
df["logincome"] = np.log1p(df.income)
df[['mdvis', 'logincome', 'hlthp']].tail()

```
Out[7]:
```

*hlthp* is a binary variable we can visualize the linear regression model by plotting two lines: one for `hlthp == 0`

and one for `hlthp == 1`

.

```
In [8]:
```plt.scatter(df.logincome, df.mdvis, alpha=0.3)
plt.xlabel('Log income')
plt.ylabel('Number of visits')
income_linspace = np.linspace(df.logincome.min(), df.logincome.max(), 100)
est = smf.ols(formula='mdvis ~ logincome + hlthp', data=df).fit()
plt.plot(income_linspace, est.params[0] + est.params[1] * income_linspace + est.params[2] * 0, 'r')
plt.plot(income_linspace, est.params[0] + est.params[1] * income_linspace + est.params[2] * 1, 'g')
short_summary(est)

```
```

`logincome`

).

```
In [10]:
```plt.scatter(df.logincome, df.mdvis, alpha=0.3)
plt.xlabel('Log income')
plt.ylabel('Number of visits')
est = smf.ols(formula='mdvis ~ hlthp * logincome', data=df).fit()
plt.plot(income_linspace, est.params[0] + est.params[1] * 0 + est.params[2] * income_linspace +
est.params[3] * 0 * income_linspace, 'r')
plt.plot(income_linspace, est.params[0] + est.params[1] * 1 + est.params[2] * income_linspace +
est.params[3] * 1 * income_linspace, 'g')
short_summary(est)

```
```

`*`

in the formula means that we want the interaction term in addition each term separately (called main-effects). If you want to include just an interaction, use `:`

instead. This is generally avoided in analysis because it is almost always the case that, if a variable is important due to an interaction, it should have an effect by itself.

To summarize what is happening here:

- If we include the category variables without interactions we have two lines, one for hlthp == 1 and one for hlthp == 0, with all having the same slope but different intercepts.
- If we include the interactions, now each of the lines can have a different slope. This captures the effect that variation with income may be different for people who are in poor health than for people who are in better health.

Despite its name, linear regression can be used to fit non-linear functions. A linear regression model is linear in the model parameters, not necessarily in the predictors. If you add non-linear transformations of your predictors to the linear regression model, the model will be non-linear in the predictors.

A very popular non-linear regression technique is Polynomial Regression, a technique which models the relationship between the response and the predictors as an n-th order polynomial. The higher the order of the polynomial the more "wigglier" functions you can fit. Using higher order polynomial comes at a price, however. First, the computational complexity of model fitting grows as the number of adaptable parameters grows. Second, more complex models have a higher risk of **overfitting**. Overfitting refers to a situation in which the model fits the idiosyncrasies of the training data and loses the ability to generalize from the seen to predict the unseen.

To illustrate polynomial regression we will consider the Boston housing dataset. We'll look into the task to predict median house values in the Boston area using the predictor *lstat*, defined as the "proportion of the adults without some high school education and proportion of male workes classified as laborers" (see Hedonic House Prices and the Demand for Clean Air, Harrison & Rubinfeld, 1978).

We can clearly see that the relationship between *medv* and *lstat* is non-linear: the blue (straight) line is a poor fit; a better fit can be obtained by including higher order terms.

```
In [11]:
```# load the boston housing dataset - median house values in the Boston area
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Boston.csv')
# plot lstat (% lower status of the population) against median value
plt.figure(figsize=(6 * 1.618, 6))
plt.scatter(df.lstat, df.medv, s=10, alpha=0.3)
plt.xlabel('lstat')
plt.ylabel('medv')
# points linearlyd space on lstats
x = pd.DataFrame({'lstat': np.linspace(df.lstat.min(), df.lstat.max(), 100)})
# 1-st order polynomial
poly_1 = smf.ols(formula='medv ~ 1 + lstat', data=df).fit()
plt.plot(x.lstat, poly_1.predict(x), 'b-', label='Poly n=1 $R^2$=%.2f' % poly_1.rsquared,
alpha=0.9)
# 2-nd order polynomial
poly_2 = smf.ols(formula='medv ~ 1 + lstat + I(lstat ** 2.0)', data=df).fit()
plt.plot(x.lstat, poly_2.predict(x), 'g-', label='Poly n=2 $R^2$=%.2f' % poly_2.rsquared,
alpha=0.9)
# 3-rd order polynomial
poly_3 = smf.ols(formula='medv ~ 1 + lstat + I(lstat ** 2.0) + I(lstat ** 3.0)', data=df).fit()
plt.plot(x.lstat, poly_3.predict(x), 'r-', alpha=0.9,
label='Poly n=3 $R^2$=%.2f' % poly_3.rsquared)
plt.legend()

```
Out[11]:
```