Analysis based on data from the CIRP Freshman Survey

Copyright Allen Downey

```
In [1]:
```%matplotlib inline
#import warnings
#warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import thinkbayes2
import thinkplot
import statsmodels.formula.api as smf

```
In [2]:
```df = pd.read_csv('heri17.csv', skiprows=2, index_col='year')
df[df.columns] /= 10
df.tail()

```
Out[2]:
```

```
In [3]:
```df['time'] = df.index - 1966
df['time2'] = df.time**2

```
In [4]:
```def MakeErrorModel(df, y, formula, n=100):
"""Makes a model that captures sample error and residual error.
df: DataFrame
y: Series
formula: string representation of the regression model
n: number of simulations to run
returns: (fittedvalues, sample_error, total_error)
"""
# make the best fit
df['y'] = y
results = smf.ols(formula, data=df).fit()
fittedvalues = results.fittedvalues
resid = results.resid
# permute residuals and generate hypothetical fits
fits = []
for i in range(n):
df['y'] = fittedvalues + np.random.permutation(results.resid)
fake_results = smf.ols(formula, data=df).fit()
fits.append(fake_results.fittedvalues)
# compute the variance of the fits
fits = np.array(fits)
sample_var = fits.var(axis=0)
# add sample_var and the variance of the residuals
total_var = sample_var + resid.var()
# standard errors are square roots of the variances
return fittedvalues, np.sqrt(sample_var), np.sqrt(total_var)

Plot a region showing a confidence interval.

```
In [5]:
```def FillBetween(fittedvalues, stderr, **options):
"""Fills in the 95% confidence interval.
fittedvalues: series
stderr: standard error
"""
low = fittedvalues - 2 * stderr
high = fittedvalues + 2 * stderr
thinkplot.FillBetween(fittedvalues.index, low, high, **options)

```
In [6]:
```def PlotModel(y, fittedvalues, sample_error, total_error, **options):
"""Plots confidence intervals and the actual data.
y: Series of actual data
fittedvalues: Series of fitted values
sample_error: Series of standard errors due to random sampling
total_error: Series representing total error due to sampling and random variation
options: dictional of options used to plot the data
"""
FillBetween(fittedvalues, total_error, color='0.9')
FillBetween(fittedvalues, sample_error, color='0.7')
thinkplot.Plot(fittedvalues, color='0.5')
thinkplot.Plot(y, **options)

```
In [7]:
```def Plot(df, y, formula, **options):
"""Run a model and plot the results.
df: DataFrame
y: Series of actual data
formula: Patsy string for the regression model
options: dictional of options used to plot the data
"""
fittedvalues, sample_error, total_error = MakeErrorModel(df, y, formula)
PlotModel(y, fittedvalues, sample_error, total_error, **options)
thinkplot.Config(xlim=[1965, 2017])

Seaborn provides aesthetic colors and graphical style.

```
In [8]:
```import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('talk', font_scale=1.3)
current_palette = sns.color_palette()
sns.palplot(current_palette)
BLUE, GREEN, RED, PURPLE, YELLOW, SKY = current_palette

```
```

Plot the fraction of respondents with no religious preference along with a quadratic model.

```
In [9]:
```formula = 'y ~ time + time2'
y = df.noneall
Plot(df, y, formula, color=BLUE, alpha=1)
thinkplot.Config(title='No religious preference',
ylabel='Percent', loc='upper left', ylim=[0, 33])
thinkplot.Save(root='heri17.1', clf=False, formats=['png'])

```
```

```
In [10]:
```ps = df.noneall / 100
odds = ps / (1-ps)
log_odds = np.log(odds)
log_odds
Plot(df, log_odds, formula, color=BLUE, label='None')
thinkplot.Config(ylabel='Log odds')

```
```

```
In [11]:
```attend = df.attendedall
Plot(df, attend, formula, color=GREEN, alpha=1)
thinkplot.Config(title='Attendance at religious services', ylabel='Percent',
ylim=[60,100])
thinkplot.Save(root='heri17.3', clf=False, formats=['png'])

```
```

Plot the gender gap along with a quadratic model.

```
In [12]:
```diff = df.nonemen - df.nonewomen
diff = diff.loc[1973:]
Plot(df, diff, formula, color=PURPLE, alpha=1)
thinkplot.Config(title='Gender gap', ylabel='Difference (percentage points)')
thinkplot.Save(root='heri17.2', clf=False, formats=['png'])

```
```

```
In [13]:
```diff = df.nonemen - df.nonewomen
diff = diff.loc[1986:]
Plot(df, diff, formula, color=PURPLE, label='Gender gap')
thinkplot.Config(ylabel='Difference (percentage points)')

```
```

A linear model for the most recent data suggests that the gap is probably still growing.

```
In [14]:
```diff = df.nonemen - df.nonewomen
diff = diff.loc[1986:]
Plot(df, diff, 'y ~ time', color=PURPLE, label='Gender gap')
thinkplot.Config(ylabel='Difference (percentage points)')

```
```

```
In [ ]:
```