Mediation Analysis

Written by Jin Cheong & Luke Chang

A mediation analysis is conducted when a researcher is interested in the mechanism underlying how variable X has an effect on variable Y. It attempts to make a causal inference that a direct effect might be better explained by an indirect effect through a mediating variable.

Consider the instance below where X has an effect c on Y:

1) $Y = \beta_1 + c \cdot X $

In this model, there may be a third variable M which mediates the effect of X on Y. In other words, the variable M is partially responsible for the effect X has on Y.

To conduct a mediation analysis one estimates two additional models: 1) the effect of X on M, and 2) the effect of X and M on Y.

2) $M = \beta_2 + a \cdot X $

3) $Y = \beta_3 + c' \cdot X + b \cdot M $

Now the direct effect of X on Y, denoted as C,can be broken down into two parts:

$c = a \cdot b + c' $

$ a \cdot b $ is the indirect effect of X on Y via the mediator. $c'$ is the remaining direct effect of X on Y controlling for M.

This relationship is depicted below. Note that M and Y both also have error included in the model.

Question Why does X not have error included in the model?

Answer Because X is only a regressor not an outcome variable in the models and standard regression does not estimate error on regressors. See orthogonal regression for a technique that models error on both X and Y.

Here are a few examples of mediations. Can you think of more?

1) The effect of failure on depressed feelings is mediated by internalization of failure.

2) Effect of CBT treatment is mediated by changes in cognition.

3) Effect of food intake on weight gain is mediated by metabolism.

For more information there is a nice tutorial on mediation by David Kenny, one of the author's of the original mediation paper.

Simulate a mediation

In this section we will simulate a mediation.

This is a case in which the true effect of X on Y is positive, but appears negative without testing for mediation.



In [ ]:

    
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy import stats

def sobel_test(a, b, se_a, se_b):
    '''
    Sobel test for significance of mediation
    
    Args: 
        a: coefficient from X to mediator variable, M
        b: coefficient from M to Y 
        se_a: Standard error of A
        se_b: Standard error fo B

    Returns: 
        t: Sobel's test statistic
        pval : Two-tailed probability assuming normal distribution
    '''
    
    SE = np.sqrt( (a**2)*(se_a**2) + (b**2)*(se_b**2))
    t = (a*b) / SE
    n = 100000000
    pval = stats.t.sf(np.abs(t), n-1)*2
    return t, pval



In [ ]:

    
# set random seed so everyone gets same results
np.random.seed(1)

# Determine effects
a = -3 # effect of x to M
b = 3 # effect of M to y
cq = .1 #  effect of x on y controlling for M

# Create a random data x
x = np.random.rand(100) 
m = x * a + np.random.rand(100)

# Create Y
y = np.dot(np.array([x,m]).T,[cq,b]) + np.random.rand(100)



In [ ]:

    
plt.scatter(x,y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('X -> Y')



In [ ]:

    
plt.scatter(x,m)
plt.xlabel('X')
plt.ylabel('M')
plt.title('X -> M')



In [ ]:

    
plt.scatter(m,y)
plt.xlabel('M')
plt.ylabel('Y')
plt.title('M -> Y')

1) Test effect of X on Y



In [ ]:

    
X = pd.DataFrame({'Intercept':np.ones(len(x)),'X':x})
lm1 = smf.OLS(y,X).fit()
print lm1.summary()
ec = lm1.params[1] # save total effect c to ec

2) Test effect of X on M



In [ ]:

    
lm2 = smf.OLS(m,X).fit()
print lm2.summary()
ea = lm2.params[1] # Save the effect of X on M, a, to ea
sea = lm2.bse[1]

3) Test effect of X and M on Y



In [ ]:

    
X['M'] = m
lm3 = smf.OLS(y,X).fit()
print lm3.summary()
ecq,eb = lm3.params[1:3]
seb = lm3.bse[2]

Show how the effect is broken down to direct and indirect effects

Recall how the overall effect C was decomposed to indirect effect (a*b) and direct effect (c')

$c = a \cdot b + c' $



In [ ]:

    
print('c : %.2f') % ec
print('a : %.2f') % ea
print('b : %.2f') % eb
print('c\' : %.2f') % ecq



In [ ]:

    
print('Total effect C: %.2f') % ec
print('is decomposed into the indirect(mediated) effect a*b: %.2f') % (ea*eb)
print('plus the direct effect c\': %.2f') % ecq
print('which adds up to %.2f') % (ea*eb+ecq)

Run a Sobel Test for Significance of Mediation

One way to test the significance of a mediation is to perform a Sobel test, where the indirect effect(a*b) is divided by an estimated standard error of the two. This assumes that the product would be normally distributed which may not always be the case.

An alternative method is to bootstrap with replacement on the observed data to generate a 95% confidence interval. You can try this by writing a for-loop that resamples from the data and generate a distribution of the indirect effects(a*b). If the confidence interval does not include 0, it can be considered as significant.



In [ ]:

    
t,p = sobel_test(ea,eb,sea,seb)
print('Sobel\'s test of significance t = %2.2f') % t
print('Two-tailed p-value p = %2.5f ') % p

Moderation Analysis

In a moderation analysis, the moderator modifies or changes the relationship between two variables, akin to an interaction term. Moderation is slightly different from an interaction due to the additional constraint that there is a causal relationship from X to Y, BUT not from Z to Y. Therefore, a moderation implies an interaction exists but an interaction does not imply a moderation.

Here is a schematic representation of a moderation relationship. This diagram hypothesize that Stress has a causal relationship to Depression but the effect of Stress is different for people with high or low Social Support

This can be reprsented by an interaction,

The pictures have been retrieved from here

Here are a few examples of moderations. Can you think of more?

1) The effect of compliments on future grades is moderated by growth mindset (Carol Dweck)

2) Effect of favorability on government behavior is moderated by political affiliation.

3) Effect of pressure on performance is moderated by confidence (choking vs boosting).

For more information look at homepage of Kenny who started all this.

Example

Here we examine whether the effect of buying books (Buy) on enjoyment of reading (Enjoy) is moderated by frequency of reading (Read).

The moderation effect exists if there is an interaction of buying and reading on enjoyment.



In [ ]:

    
df = pd.DataFrame({
    'Enjoy':[4, 16, 4, 12, 9, 5, 15, 21, 3, 4, 8, 11, 7, 5, 8, 19, 11, 9, 9, 13, 11, 21, 18, 12, 15, 3, 2, 10, 7, 9, 5, 6, 9, 12, 9, 5, 17, 15, 9, 7, 5, 10, 10, 6, 7, 9, 12, 2, 1, 5, 7, 5, 8, 5, 5, 11, 8, 9, 13, 9, 19, 8, 21, 1, 11, 8, 6, 23, 2, 9, 13, 4, 10, 12, 5, 7, 10, 11, 12, 13],
    'Buy':[5, 8, 0, 8, 3, 0, 8, 9, 8, 1, 7, 2, 2, 9, 3, 8, 8, 9, 5, 7, 7, 9, 6, 7, 8, 4, 4, 3, 1, 4, 1, 5, 5, 2, 9, 5, 7, 8, 2, 4, 1, 4, 0, 1, 0, 8, 2, 4, 0, 0, 0, 1, 2, 2, 2, 7, 5, 1, 9, 9, 8, 1, 7, 2, 5, 2, 4, 9, 1, 6, 3, 0, 7, 5, 2, 3, 1, 8, 6, 4],
    'Read':[0, 8, 0, 5, 4, 9, 6, 9, 1, 3, 3, 3, 8, 0, 9, 8, 6, 0, 5, 6, 1, 7, 7, 5, 7, 2, 0, 5, 1, 7, 7, 4, 6, 5, 3, 1, 7, 6, 0, 4, 0, 9, 5, 9, 2, 3, 5, 2, 5, 2, 9, 1, 1, 7, 9, 3, 0, 4, 4, 3, 8, 8, 8, 2, 3, 7, 1, 8, 6, 1, 7, 0, 3, 2, 5, 3, 8, 6, 9, 7] 
})

Importance of centering variables for interaction

The interaction effect can be VERY different if you don't center your variables



In [ ]:

    
df['Interaction'] = df.Read*df.Buy
np.corrcoef(df.Buy,df.Interaction)



In [ ]:

    
plt.scatter(df.Buy,df.Interaction)
plt.xlabel('Read * Buy')
plt.ylabel('Buy')



In [ ]:

    
center = lambda x: (x - x.mean())
df[['Read_Centered','Buy_Centered']] = df[['Read','Buy']].apply(center)
df['Interaction_Centered'] = df['Read_Centered'] * df['Buy_Centered']
np.corrcoef(df.Buy,df.Interaction_Centered)



In [ ]:

    
plt.scatter(df.Buy,df.Interaction_Centered)
plt.xlabel('Read * Buy')
plt.ylabel('Buy')



In [ ]:

    
mod = smf.ols(formula = "Enjoy ~ Buy + Read + Interaction", data = df).fit()
print mod.summary()



In [ ]:

    
mod = smf.ols(formula = "Enjoy ~ Buy + Read + Interaction_Centered", data = df).fit()
print mod.summary()



In [ ]:

Exercises

Exercise 1 - Mediation

For exercise 1, let's revisit the salary dataset. ('../Data/salary_exercise.csv')

These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

sx = Sex, coded 1 for female and 0 for male
rk = Rank, coded
- 1 for assistant professor,
- 2 for associate professor, and
- 3 for full professor
yr = Number of years in current rank
dg = Highest degree, coded 1 if doctorate, 0 if masters
yd = Number of years since highest degree was earned
sl = Academic year salary, in dollars.

Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

Make sure you check the data for missing values or other errors before beginning your analysis.

Question:

Evaluate whether academic rank mediates the relationship between salary and gender



In [ ]:

Exercise 2 - Moderation

For Exercise 2, we will use the alcohol dataset again ('../Data/student-mat.csv').

Below is the information about the dataset.

Attribute Information:
1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2 sex - student's sex (binary: 'F' - female or 'M' - male)
3 age - student's age (numeric: from 15 to 22)
4 address - student's home address type (binary: 'U' - urban or 'R' - rural)
5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 Ã¢â‚¬â€œ 5th to 9th grade, 3 Ã¢â‚¬â€œ secondary education or 4 Ã¢â‚¬â€œ higher education)
8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 Ã¢â‚¬â€œ 5th to 9th grade, 3 Ã¢â‚¬â€œ secondary education or 4 Ã¢â‚¬â€œ higher education)
9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)

Question:

The amount you go out seems to be associated with increased daily alcohol consumption. Does being in a relationship dampen this association? Are there any other variables that appear to moderate this relationship?



In [ ]:



In [ ]: