```
In [ ]:
```%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from utils import decorate
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot

```
In [ ]:
```%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')

```
In [ ]:
```brfss.head()

```
In [ ]:
```rows = brfss['_VEGESU1'] > 8
brfss.loc[rows, '_VEGESU1'] = 8

We can use SciPy to compute servings of vegetables as a function of income class.

```
In [ ]:
```from scipy.stats import linregress
subset = brfss.dropna(subset=['INCOME2', '_VEGESU1'])
xs = subset['INCOME2']
ys = subset['_VEGESU1']
res = linregress(xs, ys)
res

Increasing income class by 1 is associated with an increase of 0.07 vegetables per day.

So if we hypothesize that people with higher incomes eat more vegetables, this result would not get us too excited.

We can see what the regression looks like by plotting the line of best fit on top of the scatter plot.

```
In [ ]:
```x_jitter = xs + np.random.normal(0, 0.15, len(xs))
plt.plot(x_jitter, ys, 'o', markersize=1, alpha=0.02)
plt.xlabel('Income code')
plt.ylabel('Vegetable servings per day')
fx1 = np.array([xs.min(), xs.max()])
fy1 = res.intercept + res.slope * fx1
plt.plot(fx1, fy1, '-', color='C1');

Now let's do it the other way around, regressing income as a function of vegetable servings.

```
In [ ]:
```xs = subset['_VEGESU1']
ys = subset['INCOME2']
res = linregress(xs, ys)
res

Again, we can plot the line of best fit on top of the scatter plot.

```
In [ ]:
```y_jitter = ys + np.random.normal(0, 0.3, len(xs))
plt.plot(xs, y_jitter, 'o', markersize=1, alpha=0.02)
plt.ylabel('Income code')
plt.xlabel('Vegetable servings per day')
fx2 = np.array([xs.min(), xs.max()])
fy2 = res.intercept + res.slope * fx2
plt.plot(fx2, fy2, '-', color='C2');

The slope looks more impressive now. Each additional serving corresponds to 0.24 income codes, and each income code is several thousand dollars. So a result that seemed unimpressive in one direction seems more intruiging in the other direction.

But the primary point here is that regression is not symmetric. To see it more clearly, I'll plot both regression lines on top of the scatter plot.

The green line is income as a function of vegetables; the orange line is vegetables as a function of income.

```
In [ ]:
```y_jitter = ys + np.random.normal(0, 0.3, len(xs))
plt.plot(xs, y_jitter, 'o', markersize=1, alpha=0.02)
plt.ylabel('Income code')
plt.xlabel('Vegetable servings per day')
fx2 = np.array([xs.min(), xs.max()])
fy2 = res.intercept + res.slope * fx2
plt.plot(fx2, fy2, '-', color='C2')
plt.plot(fy1, fx1, '-', color='C1');

And here's the same thing the other way around.

```
In [ ]:
```xs = subset['INCOME2']
ys = subset['_VEGESU1']
res = linregress(xs, ys)
res
x_jitter = xs + np.random.normal(0, 0.15, len(xs))
plt.plot(x_jitter, ys, 'o', markersize=1, alpha=0.02)
plt.xlabel('Income code')
plt.ylabel('Vegetable servings per day')
fx1 = np.array([xs.min(), xs.max()])
fy1 = res.intercept + res.slope * fx1
plt.plot(fx1, fy1, '-', color='C1')
plt.plot(fy2, fx1, '-', color='C2');

```
In [ ]:
```import statsmodels.formula.api as smf
model = smf.ols('INCOME2 ~ _VEGESU1', data=brfss)
model

The result is an `OLS`

object, which we have to `fit`

:

```
In [ ]:
```results = model.fit()
results

`results`

contains a lot of information about the regression, which we can view using `summary`

.

```
In [ ]:
```results.summary()

`params`

, which is a Pandas Series containing the estimated parameters.

```
In [ ]:
```results.params

And `rsquared`

contains the coefficient of determination, $R^2$, which is pretty small in this case.

```
In [ ]:
```results.rsquared

We can confirm that $R^2 = \rho^2$:

```
In [ ]:
```np.sqrt(results.rsquared)

```
In [ ]:
```columns = ['INCOME2', '_VEGESU1']
brfss[columns].corr()

**Exercise:** Run this regression in the other direction and confirm that you get the same estimated slope we got from `linregress`

. Also confirm that $R^2$ is the same in either direction (which we know because correlation is the same in either direction).

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```%time gss = pd.read_hdf('gss.hdf5', 'gss')
gss.shape

```
In [ ]:
```gss.head()

```
In [ ]:
```gss.describe()

Let's explore the relationship between income and education, starting with simple regression:

```
In [ ]:
```model = smf.ols('realinc ~ educ', data=gss)
model

```
In [ ]:
```results = model.fit()
results.params

It looks like people with more education have higher incomes, about $3586 per additional year of education.

Now that we are using StatsModels, it is easy to add explanatory variables. For example, we can add `age`

to the model like this.

```
In [ ]:
```model = smf.ols('realinc ~ educ + age', data=gss)
results = model.fit()
results.params

It looks like the effect of `age`

is small, and adding it to the model has only a small effect on the estimated parameter for education.

But it's possible we are getting fooled by a nonlinear relationship. To see what the age effect looks like, I'll group by age and plot the mean income in each age group.

```
In [ ]:
```grouped = gss.groupby('age')
grouped

```
In [ ]:
```mean_income_by_age = grouped['realinc'].mean()
plt.plot(mean_income_by_age, 'o', alpha=0.5)
plt.xlabel('Age (years)')
plt.ylabel('Income (1986 $)');

Yeah, that looks like a nonlinear effect.

We can model it by adding a quadratic term to the model.

```
In [ ]:
```gss['age2'] = gss['age']**2

```
In [ ]:
```model = smf.ols('realinc ~ educ + age + age2', data=gss)
results = model.fit()
results.summary()

`age`

is substantially larger. And the coefficient of the quadratic term is negative, which is consistent with the observation that the relationship has downward curvature.

**Exercise:** To see what the relationship between income and education looks like, group the dataset by `educ`

and plot mean income at each education level.

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Maybe the relationship with education is nonlinear, too. Add a quadratic term for `educ`

to the model and summarize the results.

```
In [ ]:
```gss['educ2'] = gss['educ']**2

```
In [ ]:
```model = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss)
results = model.fit()
results.summary()

The parameters of a non-linear model can be hard to interpret, but maybe we don't have to. Sometimes it is easier to judge a model by its predictions rather than its parameters.

The results object provides a `predict`

method that takes a `DataFrame`

and uses the model to generate a prediction for each row. Here's how we can create the `DataFrame`

:

```
In [ ]:
```df = pd.DataFrame()
df['age'] = np.linspace(18, 85)
df['age2'] = df['age']**2

`age`

contains equally-spaced points from 18 to 85, and `age2`

contains those values squared.

Now we can set `educ`

to 12 years of education and generate predictions:

```
In [ ]:
```plt.plot(mean_income_by_age, 'o', alpha=0.5)
df['educ'] = 12
df['educ2'] = df['educ']**2
pred12 = results.predict(df)
plt.plot(df['age'], pred12, label='High school')
plt.xlabel('Age (years)')
plt.ylabel('Income (1986 $)')
plt.legend();

This plot shows the structure of the model, which is a parabola. We also plot the data as an average in each age group.

**Exercise:** Generate the same plot, but show predictions for three levels of education: 12, 14, and 16 years.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```formula = 'realinc ~ educ + educ2 + age + age2 + C(sex)'
results = smf.ols(formula, data=gss).fit()
results.params

The estimated parameter indicates that `sex=2`

, which indicates women, is associated with about \$4150 lower income, after controlling for age and education.

**Exercise**: Use `groupby`

to group respondents by `educ`

, then plot mean `realinc`

for each education level.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Make a `DataFrame`

with a range of values for `educ`

and constant `age=30`

. Compute `age2`

and `educ2`

accordingly.

Use this `DataFrame`

to generate predictions for each level of education, holding age constant. Generate and plot separate predictions for men and women.

Also plot the data for comparison.

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

Let's use logistic regression to see what factors are associated with support for gun control. The variable we'll use is `gunlaw`

, which represents the response to this question: "Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?"

Here are the values.

```
In [ ]:
```gss['gunlaw'].value_counts()

1 means yes, 2 means no, 0 means the question wasn't asked; 8 and 9 mean the respondent doesn't know or refused to answer.

First I'll replace 0, 8, and 9 with NaN

```
In [ ]:
```gss['gunlaw'].replace([0, 8, 9], np.nan, inplace=True)

`gunlaw`

on the left side of a regression, we have to recode it so 0 means no and 1 means yes.

```
In [ ]:
```gss['gunlaw'].replace(2, 0, inplace=True)

Here's what it looks like after recoding.

```
In [ ]:
```gss['gunlaw'].value_counts()

Now we can run a logistic regression model

```
In [ ]:
```results = smf.logit('gunlaw ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit()

Here are the results.

```
In [ ]:
```results.summary()

`sex=2`

is positive, which indicates that women are more likely to support gun control, at least for this question.

```
In [ ]:
```results.params

The other parameters are not easy to interpret, but again we can use the regression results to generate predictions, which makes it possible to visualize the model.

I'll make a `DataFrame`

with a range of ages and a fixed level of education, and generate predictions for men and women.

```
In [ ]:
```grouped = gss.groupby('age')
favor_by_age = grouped['gunlaw'].mean()
plt.plot(favor_by_age, 'o', alpha=0.5)
df = pd.DataFrame()
df['age'] = np.linspace(18, 89)
df['educ'] = 12
df['age2'] = df['age']**2
df['educ2'] = df['educ']**2
df['sex'] = 1
pred = results.predict(df)
plt.plot(df['age'], pred, label='Male')
df['sex'] = 2
pred = results.predict(df)
plt.plot(df['age'], pred, label='Female')
plt.xlabel('Age')
plt.ylabel('Probability of favoring gun law')
plt.legend();

Over the range of ages, women are more likely to support gun control than men, by about 15 percentage points.

**Exercise:** Generate a similar plot as a function of education, with constant `age=40`

.

```
In [ ]:
``````
# Solution goes here
```

**Exercise:** Use the variable `grass`

to explore support for legalizing marijuana. This variable record the response to this question: "Do you think the use of marijuana should be made legal or not?"

Recode

`grass`

for use with logistic regression.Run a regression model with age, education, and sex as explanatory variables.

Use the model to generate predictions for a range of ages, with education held constant, and plot the predictions for men and women. Also plot the mean level of support in each age group.

Use the model to generate predictions for a range of education levels, with age held constant, and plot the predictions for men and women. Also plot the mean level of support at each education level.

Note: This last graph might not look like a parabola. Why not?

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
``````
# Solution goes here
```

```
In [ ]:
```