In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style='white')
from utils import decorate
from thinkstats2 import Pmf, Cdf
import thinkstats2
import thinkplot
In [ ]:
%time brfss = pd.read_hdf('brfss.hdf5', 'brfss')
In [ ]:
brfss.head()
A few people report many vegetable servings per day. To simplify the visualization, I'm going to replace values greater than 8 with 8.
In [ ]:
rows = brfss['_VEGESU1'] > 8
brfss.loc[rows, '_VEGESU1'] = 8
We can use SciPy to compute servings of vegetables as a function of income class.
In [ ]:
from scipy.stats import linregress
subset = brfss.dropna(subset=['INCOME2', '_VEGESU1'])
xs = subset['INCOME2']
ys = subset['_VEGESU1']
res = linregress(xs, ys)
res
Increasing income class by 1 is associated with an increase of 0.07 vegetables per day.
So if we hypothesize that people with higher incomes eat more vegetables, this result would not get us too excited.
We can see what the regression looks like by plotting the line of best fit on top of the scatter plot.
In [ ]:
x_jitter = xs + np.random.normal(0, 0.15, len(xs))
plt.plot(x_jitter, ys, 'o', markersize=1, alpha=0.02)
plt.xlabel('Income code')
plt.ylabel('Vegetable servings per day')
fx1 = np.array([xs.min(), xs.max()])
fy1 = res.intercept + res.slope * fx1
plt.plot(fx1, fy1, '-', color='C1');
Now let's do it the other way around, regressing income as a function of vegetable servings.
In [ ]:
xs = subset['_VEGESU1']
ys = subset['INCOME2']
res = linregress(xs, ys)
res
Again, we can plot the line of best fit on top of the scatter plot.
In [ ]:
y_jitter = ys + np.random.normal(0, 0.3, len(xs))
plt.plot(xs, y_jitter, 'o', markersize=1, alpha=0.02)
plt.ylabel('Income code')
plt.xlabel('Vegetable servings per day')
fx2 = np.array([xs.min(), xs.max()])
fy2 = res.intercept + res.slope * fx2
plt.plot(fx2, fy2, '-', color='C2');
The slope looks more impressive now. Each additional serving corresponds to 0.24 income codes, and each income code is several thousand dollars. So a result that seemed unimpressive in one direction seems more intruiging in the other direction.
But the primary point here is that regression is not symmetric. To see it more clearly, I'll plot both regression lines on top of the scatter plot.
The green line is income as a function of vegetables; the orange line is vegetables as a function of income.
In [ ]:
y_jitter = ys + np.random.normal(0, 0.3, len(xs))
plt.plot(xs, y_jitter, 'o', markersize=1, alpha=0.02)
plt.ylabel('Income code')
plt.xlabel('Vegetable servings per day')
fx2 = np.array([xs.min(), xs.max()])
fy2 = res.intercept + res.slope * fx2
plt.plot(fx2, fy2, '-', color='C2')
plt.plot(fy1, fx1, '-', color='C1');
And here's the same thing the other way around.
In [ ]:
xs = subset['INCOME2']
ys = subset['_VEGESU1']
res = linregress(xs, ys)
res
x_jitter = xs + np.random.normal(0, 0.15, len(xs))
plt.plot(x_jitter, ys, 'o', markersize=1, alpha=0.02)
plt.xlabel('Income code')
plt.ylabel('Vegetable servings per day')
fx1 = np.array([xs.min(), xs.max()])
fy1 = res.intercept + res.slope * fx1
plt.plot(fx1, fy1, '-', color='C1')
plt.plot(fy2, fx1, '-', color='C2');
In [ ]:
import statsmodels.formula.api as smf
model = smf.ols('INCOME2 ~ _VEGESU1', data=brfss)
model
The result is an OLS
object, which we have to fit
:
In [ ]:
results = model.fit()
results
results
contains a lot of information about the regression, which we can view using summary
.
In [ ]:
results.summary()
One of the parts we're interested in is params
, which is a Pandas Series containing the estimated parameters.
In [ ]:
results.params
And rsquared
contains the coefficient of determination, $R^2$, which is pretty small in this case.
In [ ]:
results.rsquared
We can confirm that $R^2 = \rho^2$:
In [ ]:
np.sqrt(results.rsquared)
In [ ]:
columns = ['INCOME2', '_VEGESU1']
brfss[columns].corr()
Exercise: Run this regression in the other direction and confirm that you get the same estimated slope we got from linregress
. Also confirm that $R^2$ is the same in either direction (which we know because correlation is the same in either direction).
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
%time gss = pd.read_hdf('gss.hdf5', 'gss')
gss.shape
In [ ]:
gss.head()
In [ ]:
gss.describe()
Let's explore the relationship between income and education, starting with simple regression:
In [ ]:
model = smf.ols('realinc ~ educ', data=gss)
model
In [ ]:
results = model.fit()
results.params
It looks like people with more education have higher incomes, about $3586 per additional year of education.
Now that we are using StatsModels, it is easy to add explanatory variables. For example, we can add age
to the model like this.
In [ ]:
model = smf.ols('realinc ~ educ + age', data=gss)
results = model.fit()
results.params
It looks like the effect of age
is small, and adding it to the model has only a small effect on the estimated parameter for education.
But it's possible we are getting fooled by a nonlinear relationship. To see what the age effect looks like, I'll group by age and plot the mean income in each age group.
In [ ]:
grouped = gss.groupby('age')
grouped
In [ ]:
mean_income_by_age = grouped['realinc'].mean()
plt.plot(mean_income_by_age, 'o', alpha=0.5)
plt.xlabel('Age (years)')
plt.ylabel('Income (1986 $)');
Yeah, that looks like a nonlinear effect.
We can model it by adding a quadratic term to the model.
In [ ]:
gss['age2'] = gss['age']**2
In [ ]:
model = smf.ols('realinc ~ educ + age + age2', data=gss)
results = model.fit()
results.summary()
Now the coefficient associated with age
is substantially larger. And the coefficient of the quadratic term is negative, which is consistent with the observation that the relationship has downward curvature.
Exercise: To see what the relationship between income and education looks like, group the dataset by educ
and plot mean income at each education level.
In [ ]:
# Solution goes here
Exercise: Maybe the relationship with education is nonlinear, too. Add a quadratic term for educ
to the model and summarize the results.
In [ ]:
gss['educ2'] = gss['educ']**2
In [ ]:
model = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss)
results = model.fit()
results.summary()
The parameters of a non-linear model can be hard to interpret, but maybe we don't have to. Sometimes it is easier to judge a model by its predictions rather than its parameters.
The results object provides a predict
method that takes a DataFrame
and uses the model to generate a prediction for each row. Here's how we can create the DataFrame
:
In [ ]:
df = pd.DataFrame()
df['age'] = np.linspace(18, 85)
df['age2'] = df['age']**2
age
contains equally-spaced points from 18 to 85, and age2
contains those values squared.
Now we can set educ
to 12 years of education and generate predictions:
In [ ]:
plt.plot(mean_income_by_age, 'o', alpha=0.5)
df['educ'] = 12
df['educ2'] = df['educ']**2
pred12 = results.predict(df)
plt.plot(df['age'], pred12, label='High school')
plt.xlabel('Age (years)')
plt.ylabel('Income (1986 $)')
plt.legend();
This plot shows the structure of the model, which is a parabola. We also plot the data as an average in each age group.
Exercise: Generate the same plot, but show predictions for three levels of education: 12, 14, and 16 years.
In [ ]:
# Solution goes here
In [ ]:
formula = 'realinc ~ educ + educ2 + age + age2 + C(sex)'
results = smf.ols(formula, data=gss).fit()
results.params
The estimated parameter indicates that sex=2
, which indicates women, is associated with about \$4150 lower income, after controlling for age and education.
Exercise: Use groupby
to group respondents by educ
, then plot mean realinc
for each education level.
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
Exercise: Make a DataFrame
with a range of values for educ
and constant age=30
. Compute age2
and educ2
accordingly.
Use this DataFrame
to generate predictions for each level of education, holding age constant. Generate and plot separate predictions for men and women.
Also plot the data for comparison.
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
Let's use logistic regression to see what factors are associated with support for gun control. The variable we'll use is gunlaw
, which represents the response to this question: "Would you favor or oppose a law which would require a person to obtain a police permit before he or she could buy a gun?"
Here are the values.
In [ ]:
gss['gunlaw'].value_counts()
1 means yes, 2 means no, 0 means the question wasn't asked; 8 and 9 mean the respondent doesn't know or refused to answer.
First I'll replace 0, 8, and 9 with NaN
In [ ]:
gss['gunlaw'].replace([0, 8, 9], np.nan, inplace=True)
In order to put gunlaw
on the left side of a regression, we have to recode it so 0 means no and 1 means yes.
In [ ]:
gss['gunlaw'].replace(2, 0, inplace=True)
Here's what it looks like after recoding.
In [ ]:
gss['gunlaw'].value_counts()
Now we can run a logistic regression model
In [ ]:
results = smf.logit('gunlaw ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit()
Here are the results.
In [ ]:
results.summary()
Here are the parameters. The coefficient of sex=2
is positive, which indicates that women are more likely to support gun control, at least for this question.
In [ ]:
results.params
The other parameters are not easy to interpret, but again we can use the regression results to generate predictions, which makes it possible to visualize the model.
I'll make a DataFrame
with a range of ages and a fixed level of education, and generate predictions for men and women.
In [ ]:
grouped = gss.groupby('age')
favor_by_age = grouped['gunlaw'].mean()
plt.plot(favor_by_age, 'o', alpha=0.5)
df = pd.DataFrame()
df['age'] = np.linspace(18, 89)
df['educ'] = 12
df['age2'] = df['age']**2
df['educ2'] = df['educ']**2
df['sex'] = 1
pred = results.predict(df)
plt.plot(df['age'], pred, label='Male')
df['sex'] = 2
pred = results.predict(df)
plt.plot(df['age'], pred, label='Female')
plt.xlabel('Age')
plt.ylabel('Probability of favoring gun law')
plt.legend();
Over the range of ages, women are more likely to support gun control than men, by about 15 percentage points.
Exercise: Generate a similar plot as a function of education, with constant age=40
.
In [ ]:
# Solution goes here
Exercise: Use the variable grass
to explore support for legalizing marijuana. This variable record the response to this question: "Do you think the use of marijuana should be made legal or not?"
Recode grass
for use with logistic regression.
Run a regression model with age, education, and sex as explanatory variables.
Use the model to generate predictions for a range of ages, with education held constant, and plot the predictions for men and women. Also plot the mean level of support in each age group.
Use the model to generate predictions for a range of education levels, with age held constant, and plot the predictions for men and women. Also plot the mean level of support at each education level.
Note: This last graph might not look like a parabola. Why not?
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]:
# Solution goes here
In [ ]: