THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
In random effects models, we are fitting a model that includes panel data, either multiple observations for an individual, or multiple observations like a group.
Why do we use a different approach in this case? Groups and individuals are great examples of cases where the linear regression assumption of "independence of observations" does not apply. Imagine if we have observations of students from several schools. Observations of students(level 1) are not all independent from each other; we can assume that some of the variation in the observations comes from unobserved variables that students share within the same school, and that school experiences differ from each other at the group level(level 2) in ways that aren't observed in our dataset. Multilevel models allow us to account for the variation between individuals and also the variation between groups.
Another use of Multilevel models is to model change, where we have observations from many individuals over time and we want to identify change over time. The individual observed events are grouped by the person in question.
The dataset used here is a classic pedagogical dataset, from the High School and Beyond study by the National Center for Education Statistics, which followed high school students starting in 1980, continuing through 1982, 1984, 1986, and 1992. The High School and Beyond study has its own wikipedia page, which includes 48 published studies based on the data.
This example is drawn from the work of Andrew Ho and John Willett, from class examples in the S-052 class at the Harvard Graduate School of Education. It also roughly follows the course of Chapters 3 and 4 of Singer, Judith D., Willett, John B. (2003) Applied Longitudinal Data Analysis: Modeling Change and Event Occurrence. Oxford University Press.
In this study, we want to know if catholic schools and public schools (and individual students in those schools) differ in their math achievement, when controlling for SES. In order to answer this question, we turn to a random effects model, which assumes that:
It's this final assumption that when satisfied allows us to make claims about the population of groups, and not just the groups represented in this dataset. The population model for a random effects model is:
$$y_{ij} = \beta_{0} + \beta_{1}X_{ij} + u_{j} + \epsilon_{ij}$$$$u_{j} \space \widetilde\space \space i.i.d. N(0, \sigma^{2}_{u})$$$$\epsilon_{ij} \space \widetilde\space \space i.i.d. N(0, \sigma^{2}_{\epsilon})$$
In [97]:
# THINGS TO IMPORT
# This is a baseline set of libraries I import by default if I'm rushed for time.
%matplotlib inline
import codecs # load UTF-8 Content
import json # load JSON files
import pandas as pd # Pandas handles dataframes
import numpy as np # Numpy handles lots of basic maths operations
import matplotlib.pyplot as plt # Matplotlib for plotting
import seaborn as sns # Seaborn for beautiful plots
from dateutil import * # I prefer dateutil for parsing dates
import math # transformations
import statsmodels.formula.api as smf # for doing statistical regression
import statsmodels.api as sm # access to the wider statsmodels library, including R datasets
from collections import Counter # Counter is useful for grouping and counting
import scipy
from patsy import dmatrices
In [2]:
# High School and Beyond Dataset
# https://nces.ed.gov/surveys/hsb/
import urllib2
import os.path
if(os.path.isfile("hsb.dta")!=True):
response = urllib2.urlopen("http://www.stata-press.com/data/mlmus3/hsb.dta")
if(response.getcode()==200):
f = open("hsb.dta","w")
f.write(response.read())
f.close()
hsb_df = pd.read_stata("hsb.dta")
print hsb_df[['mathach','ses']].describe()
print
print "CROSSTAB"
print pd.crosstab(hsb_df['sector'], [hsb_df['female'],hsb_df['minority']])
Multilevel models make sense in cases where we might expect there to be variation between groups (or in the time case, variation/differences between individuals across multible observations).
One way to explore within and between group variation is to do boxplots of the outcome by group. When looking at the first plot, we try to gain an intuitive sense of how much the outcome varies by group and how much it varies within groups. In this case, it's not obvious that there are many differences between groups, since so many of the error bars overlap, so we'll have to find another way to assert that difference.
In the second plot, we show the de-meaned math achievement, which allows us to look at the variation within schools, next to each other.
Note that the Pandas boxplot method only shows us the median line, which is why there's some jitter in the second plot. (Matplotlib apparently allows us to specify the mean with meanline=True, but I couldn't get the argument to pass through from Pandas.)
In [21]:
#generate de-meaned mathach
sgp = school_gp.to_dict()
def school_mathach(f):
return float(f.mathach) - sgp['mathach'][f.schoolid]
hsb_df['school_mathach'] = hsb_df.apply(school_mathach, 1)
#make the Side-by-Side Boxplot
fig = plt.figure(num=None, figsize=(8, 20), dpi=80, edgecolor='k')
ax = fig.add_subplot(121)
hsb_df.boxplot("mathach", by="schoolid", ax=ax, vert=False)
plt.title("School Math Achievement", fontsize="16")
ax2 = fig.add_subplot(122)
hsb_df.boxplot("school_mathach", by="schoolid", ax=ax2, vert=False)
plt.title("De-Meaned School Math Achievement", fontsize="16")
plt.show()
Another way to look at the group variation, let's compare two basic linear models, one of individual students, one with school means. In the following linear models, we see that there is there is indeed a relationship between SES and math achivement between school mean math achievement. However, just as the student-level model doesn't explain the group-level variation between schools, the school-level model doesn't explain the individual-level variation within schools. This is especially evident at high and low levels of SES or math achievement, where the group level model doesn't extend.
In [3]:
result = smf.ols(formula = "mathach ~ ses",
data = hsb_df).fit()
print "==========================================================="
print "MODEL 1: Regressing Student Math Achievement on Student SES"
print result.summary()
plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(hsb_df.ses, hsb_df.mathach, marker=".", color="c")
student_line, = plt.plot(hsb_df['ses'], result.predict(), "-", color="c")
#plt.title("Predicting Math Achievement from SES Across all 7185 students", fontsize="16")
school_gp = hsb_df.groupby("schoolid").aggregate(np.mean)
result = smf.ols(formula = "mathach ~ ses",
data = school_gp).fit()
print
print "==================================================================="
print "MODEL 2: Regressing Mean School Math Achievement on Mean School SES"
print result.summary()
#plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(school_gp.ses, school_gp.mathach, marker=".", color="r")
school_line, = plt.plot(school_gp.ses, result.predict(), "-", color="r")
plt.title("Predicting Math Achievement Scores from SES with Linear Regression", fontsize="16")
plt.legend([student_line, school_line], ['All Students', 'School Means'], fontsize="14")
plt.show()
In the fixed effects model, we add $u_{j}$ to the model, denoting the group level variance, absorbing all the variance between groups in order to better estimate the variance within groups.
$$y_{ij} = \beta_{0} + \beta_{1}X_{ij} + \mathbf{u_{j}} + \epsilon_{ij}$$$$\epsilon_{ij} \space \widetilde\space \space i.i.d. N(0, \sigma^{2}_{\epsilon})$$In practice, we can do this in several equivalent ways. I show two here:
In both cases, we effectively remove the variation between groups from the model. The resulting SES or demeaned_ses predictor models the within-school variation between students.
In the following models, notice how the slope for ses is the same as the slope for demeaned_ses in the two models.
In [31]:
# calculate the demeaned_ses for each student
def demeaned_ses(f):
return f.ses - school_gp.to_dict()['ses'][f['schoolid']]
# add the school mean SES to the dataframe for each student
def schoolmean_ses(f):
return school_gp.to_dict()['ses'][f['schoolid']]
hsb_df['demeaned_ses'] = hsb_df.apply(demeaned_ses, axis=1)
hsb_df['schoolmean_ses'] = hsb_df.apply(schoolmean_ses, axis=1)
In [49]:
result_school_covary = smf.ols(formula = "mathach ~ ses + schoolmean_ses",
data = hsb_df).fit()
print "MODEL: Regressing Student Math Achievement on De-meaned Student SES"
print result_school_covary.params
result = smf.ols(formula = "mathach ~ demeaned_ses",
data = hsb_df).fit()
print
print "MODEL: Regressing Student Math Achievement on De-meaned Student SES"
print result.params
print
print "Notice how the slope for *ses* is the same as the slope for *demeaned_ses* in the two models"
print
plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(hsb_df.demeaned_ses, hsb_df.mathach, marker=".", color="darkgrey")
student_line, = plt.plot(hsb_df['demeaned_ses'], result.predict(), "-", color="darkgrey")
plt.title("Predicting Math Achievement Scores from De-meaned SES", fontsize="16")
plt.xlabel("De-meaned Socio-Economic Status", fontsize="14")
plt.ylabel("Math Achivement", fontsize="14")
plt.show()
We can plot the fixed effects model on a school-level basis using the model that uses the school mean as a covariate and then plotting the model for prototypical values of individual schools.
It's important to note however that this model makes no claims about the population of all schools -- it just models the relationship between student math achievement and student SES, holding constant the variation in SES between schools.
In [240]:
# highlight the maximum, and minimum
max_school = school_gp[school_gp['ses'] == school_gp.ses.max()].index[0]
min_school = school_gp[school_gp['ses'] == school_gp.ses.min()].index[0]
hsb_df['fixed_preds'] = result_school_covary.predict()
plt.figure(num=None, figsize=(12, 6), dpi=80, edgecolor='k')
for schoolid in hsb_df.schoolid.unique():
if(schoolid!=max_school and schoolid!=min_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="lightgrey")
for schoolid in hsb_df.schoolid.unique():
if(schoolid == max_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="r")
maxline, = plt.plot(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].fixed_preds, "-", color="r")
elif(schoolid == min_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="b")
minline, = plt.plot(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].fixed_preds, "-", color="b")
plt.legend([maxline, minline], ['School with Max SES', 'School with Min SES'], fontsize="12")
plt.title("Fixed Effects Model Predicting Math Achievement Scores from SES & School Mean SES", fontsize="16")
plt.xlabel("Socio-Economic Status", fontsize="14")
plt.ylabel("Math Achivement", fontsize="14")
plt.show()
A random effects model makes the assumption that the variance attributable to groups is normal in the population (and centered on 0), allowing us to make claims about groups, and not just individual observations. In the school example, this would allow us to make claims about schools as well as the individuals within them. In a timeseries example, this is equally important, because it's only in the random effects world that we would be able to make claims about the population of things that we're observing over time. Correspondingly, our population model for the random effects model, although it has the same terms of the fixed effects model, has an additional assumption about $u_{j}$:
$$y_{ij} = \beta_{0} + \beta_{1}X_{ij} + u_{j} + \epsilon_{ij}$$$$\mathbf{u_{j} \space \widetilde\space \space i.i.d. N(0, \sigma^{2}_{u})}$$$$\epsilon_{ij} \space \widetilde\space \space i.i.d. N(0, \sigma^{2}_{\epsilon})$$
In [289]:
##http://statsmodels.sourceforge.net/devel/mixed_linear.html
md = smf.mixedlm("mathach ~ ses", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
In the above results, the ses coefficient of 2.390 is:
In [298]:
plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
result = smf.ols(formula = "mathach ~ ses",
data = hsb_df).fit()
print "MODEL 1: Regressing Student Math Achievement on Student SES"
plt.scatter(hsb_df.ses, hsb_df.mathach, marker=".", color="c")
student_line, = plt.plot(hsb_df['ses'], result.predict(), "-", color="c")
school_gp = hsb_df.groupby("schoolid").aggregate(np.mean)
result = smf.ols(formula = "mathach ~ ses",
data = school_gp).fit()
print result.summary()
print
print "MODEL 2: Regressing Mean School Math Achievement on Mean School SES"
print result.summary()
#plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
plt.scatter(school_gp.ses, school_gp.mathach, marker=".", color="r")
school_line, = plt.plot(school_gp.ses, result.predict(), "-", color="r")
result = smf.ols(formula = "mathach ~ demeaned_ses",
data = hsb_df).fit()
print "MODEL 3: Regressing Student Math Achievement on De-meaned Student SES"
print result.summary()
#plt.figure(num=None, figsize=(12, 6), dpi=80, facecolor='w', edgecolor='k')
demeaned_line, = plt.plot(hsb_df['demeaned_ses'], result.predict(), "-", color="darkgrey")
print
print "MODEL 4: Regressing Student Math Achievement on Student SES Grouped by School in a Random Effects Model"
md = smf.mixedlm("mathach ~ ses", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
def predict(x, key, result):
return result.params.Intercept + result.params['ses']*x
ses = np.linspace(hsb_df.ses.min(), hsb_df.ses.max(), 100)
preds = [predict(x, 'ses',result) for x in ses]
multi_line, = plt.plot(ses, preds, "-", color="m")
plt.title("Predicting Math Achievement Scores from SES (schools=160) (students=7185)", fontsize="16")
plt.legend([student_line, school_line, multi_line, demeaned_line], ['All Students (Total)', 'School Means (Between)', "Random Effects", "De-Meaned (within group, Fixed)"])
plt.show()
In a random effects model, The intraclass correlation can be interpreted as:
Intraclass correlation is apparently still being debated, as outlined in the Wikipedia page for intraclass correlation, and some people avoid this measure entirely.
The statsmodels MixedLM model doesn't include within it any analysis of residuals, so of we want to consider the intraclass correlation in the model, we have to do it ourselves. I've written a method to collect the individual and group residuals.
Note that I think the calculations presented here are correct, but I have only run them against a single test case, so you may want to doublecheck my work before lifting this code.
In [357]:
##http://statsmodels.sourceforge.net/devel/mixed_linear.html
md = smf.mixedlm("mathach ~ ses", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
#store the model results to a variable
models = {}
m = "Model1"
models[m] = {}
models[m]['result'] = result
def individual_residuals(f):
observed_individual = f.mathach
predicted_individual = result.params.Intercept + result.params['ses']*f.ses
return observed_individual - predicted_individual
def group_residuals(f):
observed_group = school_gp.to_dict()['mathach'][f.schoolid]
predicted_group = result.params.Intercept + result.params['ses']*f.schoolmean_ses
return predicted_group - observed_group
group_count = school_gp.count()[0]
indiv_count = hsb_df.count()[0]
resid_u = hsb_df.apply(group_residuals, 1)
models[m]["sigma_u"] = np.std(resid_u)
models[m]["sigma_u_err"] = models[m]["sigma_u"]/math.sqrt(group_count)
resid_e = hsb_df.apply(individual_residuals, 1)
models[m]["sigma_e"] = np.std(resid_e)
models[m]["sigma_e_err"] = models[m]["sigma_e"]/math.sqrt(indiv_count)
models[m]["icc"] = math.pow(models[m]["sigma_u"],2)/(math.pow(models[m]["sigma_u"],2) + math.pow(models[m]["sigma_e"],2))
models[m]["icc_err"] = icc/math.sqrt(group_count)
print " stdev stderr"
print "sigma_u (between group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_u"],
'e':models[m]["sigma_u_err"]}
print "sigma_e (within group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_e"],
'e':models[m]["sigma_e_err"]}
print "intraclass correlation: %(i).04f %(e).04f" % {'i':models[m]["icc"],
'e':models[m]["icc_err"]}
print
print "Z-Test of intraclass correlation:"
print " H0: icc = 0 in the population"
print " test-statistic: z=icc/SE(icc)"
print " decision rule: z>z_crit"
print " critical value: 1.96"
print " z = %(z).04f" %{'z':models[m]["icc"] /models[m]["icc_err"]}
In this case, we see that there is a low intraclass correlation, suggesting that most of the variation in math achievement scores is within schools, but that there is a significant difference between the math achievement of schools on average in the model as well (as indicated by the Z test).
If we are interested in differences between the proportion of between-group or within group variance accounted for by competing models, we can generate a pseudo-$R^{2}$ measure by comparing the between and within group variances to those of a baseline model.
In [358]:
# now generate the baseline model
md = smf.mixedlm("mathach ~ 1", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
def individual_residuals(f):
observed_individual = f.mathach
predicted_individual = result.params.Intercept
return observed_individual - predicted_individual
def group_residuals(f):
observed_group = school_gp.to_dict()['mathach'][f.schoolid]
predicted_group = result.params.Intercept
return predicted_group - observed_group
group_count = school_gp.count()[0]
indiv_count = hsb_df.count()[0]
m = "Model0"
models[m] = {}
models[m]['result'] = result
resid_u = hsb_df.apply(group_residuals, 1)
models[m]["sigma_u"] = np.std(resid_u)
models[m]["sigma_u_err"] = models[m]["sigma_u"]/math.sqrt(group_count)
resid_e = hsb_df.apply(individual_residuals, 1)
models[m]["sigma_e"] = np.std(resid_e)
models[m]["sigma_e_err"] = models[m]["sigma_e"]/math.sqrt(indiv_count)
models[m]["icc"] = math.pow(models[m]["sigma_u"],2)/(math.pow(models[m]["sigma_u"],2) + math.pow(models[m]["sigma_e"],2))
models[m]["icc_err"] = icc/math.sqrt(group_count)
print " stdev stderr"
print "sigma_u (between group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_u"],
'e':models[m]["sigma_u_err"]}
print "sigma_e (within group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_e"],
'e':models[m]["sigma_e_err"]}
print "intraclass correlation: %(i).04f %(e).04f" % {'i':models[m]["icc"],
'e':models[m]["icc_err"]}
print
print "Z-Test of intraclass correlation:"
print " H0: icc = 0 in the population"
print " test-statistic: z=icc/SE(icc)"
print " decision rule: z>z_crit"
print " critical value: 1.96"
print " z = %(z).04f" %{'z':models[m]["icc"] /models[m]["icc_err"]}
In [359]:
m0 = "Model0"
m1 = "Model1"
r2_u = math.pow(models[m0]['sigma_u'], 2) - math.pow(models[m1]['sigma_u'], 2)/math.pow(models[m0]['sigma_u'], 2)
print "Pseudo R^2 for group variation: %(r).03f%%" % {'r':r2_u}
r2_e = math.pow(models[m0]['sigma_e'], 2) - math.pow(models[m1]['sigma_e'], 2)/math.pow(models[m0]['sigma_e'], 2)
print "Pseudo R^2 for individual variation: %(r).03f%%" % {'r':r2_e}
In the above pseudo $R^{2}$ calculations, we see that our model of math achievement on SES accounts for 8.44% of the between-group variation and 46.43% of the within-group variation. This is consistent with our intraclass correlation, which shows that in the model there is much more within-group variation than betwen-group variation.
At the beginning of this example, we asked if Catholic Schools and Students in Catholic Schools have different Math Achievement from Public Schools, when Controlling For SES? To answer this question, we add another predictor, a "level-2," group-level predictor that contains information about schools rather than individual students.
In [360]:
# in this dataset, sector refers to whether the school is catholic(1) or public(0)
from patsy import dmatrices
md = smf.mixedlm("mathach ~ ses + sector", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
def individual_residuals(f):
observed_individual = f.mathach
predicted_individual = result.params.Intercept + result.params['ses']*f.ses + result.params['sector']*f.sector
return observed_individual - predicted_individual
def group_residuals(f):
observed_group = school_gp.to_dict()['mathach'][f.schoolid]
predicted_group = result.params.Intercept + result.params['ses']*f.schoolmean_ses + result.params['sector']*f.sector
return predicted_group - observed_group
group_count = school_gp.count()[0]
indiv_count = hsb_df.count()[0]
m = "Model2"
models[m] = {}
models[m]['result'] = result
resid_u = hsb_df.apply(group_residuals, 1)
models[m]["sigma_u"] = np.std(resid_u)
models[m]["sigma_u_err"] = models[m]["sigma_u"]/math.sqrt(group_count)
resid_e = hsb_df.apply(individual_residuals, 1)
models[m]["sigma_e"] = np.std(resid_e)
models[m]["sigma_e_err"] = models[m]["sigma_e"]/math.sqrt(indiv_count)
models[m]["icc"] = math.pow(models[m]["sigma_u"],2)/(math.pow(models[m]["sigma_u"],2) + math.pow(models[m]["sigma_e"],2))
models[m]["icc_err"] = icc/math.sqrt(group_count)
print " stdev stderr"
print "sigma_u (between group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_u"],
'e':models[m]["sigma_u_err"]}
print "sigma_e (within group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_e"],
'e':models[m]["sigma_e_err"]}
print "intraclass correlation: %(i).04f %(e).04f" % {'i':models[m]["icc"],
'e':models[m]["icc_err"]}
print
print "Z-Test of intraclass correlation:"
print " H0: icc = 0 in the population"
print " test-statistic: z=icc/SE(icc)"
print " decision rule: z>z_crit"
print " critical value: 1.96"
print " z = %(z).04f" %{'z':models[m]["icc"] /models[m]["icc_err"]}
print
m0 = "Model0"
m1 = "Model2"
r2_u = math.pow(models[m0]['sigma_u'], 2) - math.pow(models[m1]['sigma_u'], 2)/math.pow(models[m0]['sigma_u'], 2)
print "Pseudo R^2 for group variation: %(r).03f%%" % {'r':r2_u}
r2_e = math.pow(models[m0]['sigma_e'], 2) - math.pow(models[m1]['sigma_e'], 2)/math.pow(models[m0]['sigma_e'], 2)
print "Pseudo R^2 for individual variation: %(r).03f%%" % {'r':r2_e}
In [361]:
# in this dataset, sector refers to whether the school is catholic(1) or public(0)
from patsy import dmatrices
md = smf.mixedlm("mathach ~ ses + sector + sector:ses", data=hsb_df, groups=hsb_df["schoolid"])
result = md.fit()
print result.summary()
def individual_residuals(f):
observed_individual = f.mathach
predicted_individual = result.params.Intercept + result.params['ses']*f.ses + result.params['sector']*f.sector + result.params['sector:ses']*f.sector*f.ses
return observed_individual - predicted_individual
def group_residuals(f):
observed_group = school_gp.to_dict()['mathach'][f.schoolid]
predicted_group = result.params.Intercept + result.params['ses']*f.schoolmean_ses + result.params['sector']*f.sector + result.params['sector:ses']*f.sector*f.ses
return predicted_group - observed_group
group_count = school_gp.count()[0]
indiv_count = hsb_df.count()[0]
m = "Model3"
models[m] = {}
models[m]['result'] = result
resid_u = hsb_df.apply(group_residuals, 1)
models[m]["sigma_u"] = np.std(resid_u)
models[m]["sigma_u_err"] = models[m]["sigma_u"]/math.sqrt(group_count)
resid_e = hsb_df.apply(individual_residuals, 1)
models[m]["sigma_e"] = np.std(resid_e)
models[m]["sigma_e_err"] = models[m]["sigma_e"]/math.sqrt(indiv_count)
models[m]["icc"] = math.pow(models[m]["sigma_u"],2)/(math.pow(models[m]["sigma_u"],2) + math.pow(models[m]["sigma_e"],2))
models[m]["icc_err"] = icc/math.sqrt(group_count)
print " stdev stderr"
print "sigma_u (between group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_u"],
'e':models[m]["sigma_u_err"]}
print "sigma_e (within group variation): %(s).04f %(e).04f" % {'s':models[m]["sigma_e"],
'e':models[m]["sigma_e_err"]}
print "intraclass correlation: %(i).04f %(e).04f" % {'i':models[m]["icc"],
'e':models[m]["icc_err"]}
print
print "Z-Test of intraclass correlation:"
print " H0: icc = 0 in the population"
print " test-statistic: z=icc/SE(icc)"
print " decision rule: z>z_crit"
print " critical value: 1.96"
print " z = %(z).04f" %{'z':models[m]["icc"] /models[m]["icc_err"]}
print
m0 = "Model0"
m1 = "Model3"
r2_u = math.pow(models[m0]['sigma_u'], 2) - math.pow(models[m1]['sigma_u'], 2)/math.pow(models[m0]['sigma_u'], 2)
print "Pseudo R^2 for group variation: %(r).02f%%" % {'r':r2_u}
r2_e = math.pow(models[m0]['sigma_e'], 2) - math.pow(models[m1]['sigma_e'], 2)/math.pow(models[m0]['sigma_e'], 2)
print "Pseudo R^2 for individual variation: %(r).02f%%" % {'r':r2_e}
In [400]:
#step one: find prototypical values of a catholic and a public school with an SES of 0.
school_gp['p_abs_ses']=school_gp[np.isclose(school_gp.sector, 0.)].ses.map(lambda x: abs(x))
school_gp['c_abs_ses']=school_gp[np.isclose(school_gp.sector, 1.)].ses.map(lambda x: abs(x))
#public school with SES closest to 0: 1946
print school_gp[(np.isclose(school_gp.p_abs_ses,school_gp.p_abs_ses.min())) & (np.isclose(school_gp.sector, 0.))].ses
#catholic school with SES closest to 0: 5650
print school_gp[(np.isclose(school_gp.c_abs_ses,school_gp.c_abs_ses.min())) & (np.isclose(school_gp.sector, 1.))].ses
p_school = 1946
c_school = 5650
def predict(f):
return result.params.Intercept + result.params['ses']*f.ses + result.params['sector']*f.sector + result.params['sector:ses']*f.sector*f.ses
hsb_df['interaction_preds'] = hsb_df.apply(predict, 1)
plt.figure(num=None, figsize=(12, 6), dpi=80, edgecolor='k')
# PLOT A PREDICTION OF INDIVIDUAL MATH ACHIEVEMENT SCORES
# FOR TWO SCHOOLS
for schoolid in hsb_df.schoolid.unique():
if(schoolid!=max_school and schoolid!=min_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="lightgrey")
for schoolid in hsb_df.schoolid.unique():
if(schoolid == p_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="r")
p_line, = plt.plot(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].interaction_preds, "-", color="r")
elif(schoolid == c_school):
plt.scatter(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].mathach, marker=".", color="b")
c_line, = plt.plot(hsb_df[hsb_df.schoolid == schoolid].ses, hsb_df[hsb_df.schoolid == schoolid].interaction_preds, "-", color="b")
plt.legend([c_line, p_line], ['Students in a Catholic School with Mean SES', 'Students in a Public School with Mean SES'], fontsize="12")
plt.suptitle("Predicting Individual Math Achievement Scores from SES & Sector", fontsize="16")
plt.title("in a Multi-Level Random Effects Model, where SES=0", fontsize="16")
plt.xlabel("Socio-Economic Status", fontsize="14")
plt.ylabel("Math Achivement", fontsize="14")
plt.show()
In [403]:
# PLOT SCHOOL MEAN CATHOLIC AND PUBLIC SCHOOL MATH ACHIEVEMENT
plt.figure(num=None, figsize=(12, 6), dpi=80, edgecolor='k')
plt.scatter(hsb_df.ses, hsb_df.mathach, marker=".", color="lightgrey")
plt.scatter(school_gp[school_gp.sector==0.].ses, school_gp[school_gp.sector==0.].mathach, color="r")
plt.scatter(school_gp[school_gp.sector==1.].ses, school_gp[school_gp.sector==1.].mathach, color="b")
school_gp['interaction_preds'] = school_gp.apply(predict, 1)
c_line, = plt.plot(school_gp[np.isclose(school_gp.sector, 1.)].ses, school_gp[np.isclose(school_gp.sector, 1.)].interaction_preds, "-", color="b")
p_line, = plt.plot(school_gp[np.isclose(school_gp.sector, 0.)].ses, school_gp[np.isclose(school_gp.sector, 0.)].interaction_preds, "-", color="r")
plt.suptitle("Predicting School Math Achievement Scores from SES & Sector", fontsize="16")
plt.title("in a Multi-Level Random Effects Model", fontsize="16")
plt.legend([c_line, p_line], ['Catholic Schools', 'Public Schools'], fontsize="12")
plt.xlabel("Socio-Economic Status", fontsize="14")
plt.ylabel("Math Achivement", fontsize="14")
plt.show()
In the random effects model we specified, the answer is yes. We see a significant difference between catholic and public school math achievement on average in the population (a claim supported by our RE model), and that this relationship is mediated by SES. here's where we would then go on to identify prototypical values and talk about the difference between a public and catholic schools at various levels of SES, etc
In [ ]: