Written by Jin Cheong & Luke Chang

A mediation analysis is conducted when a researcher is interested in the **mechanism** underlying how variable X has an effect on variable Y. It attempts to make a causal inference that a direct effect might be better explained by an indirect effect through a mediating variable.

Consider the instance below where X has an effect **c** on Y:

In this model, there may be a third variable **M** which mediates the effect of X on Y. In other words, the variable M is partially responsible for the effect X has on Y.

￼To conduct a mediation analysis one estimates two additional models: 1) the effect of X on M, and 2) the effect of X *and* M on Y.

Now the direct effect of X on Y, denoted as **C**,can be broken down into two parts:

$ a \cdot b $ is the indirect effect of X on Y via the mediator. $c'$ is the remaining direct effect of X on Y controlling for M.

This relationship is depicted below. Note that M and Y both also have error included in the model.

** Question** Why does X not have error included in the model?

** Answer** Because X is only a regressor not an outcome variable in the models and standard regression does not estimate error on regressors. See orthogonal regression for a technique that models error on both X and Y.

<img src="Figures/mediation1.png",width=500,align='center'>

1) The effect of failure on depressed feelings is mediated by internalization of failure.

2) Effect of CBT treatment is mediated by changes in cognition.

3) Effect of food intake on weight gain is mediated by metabolism.

```
In [ ]:
```%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.api as sm
from scipy import stats
def sobel_test(a, b, se_a, se_b):
'''
Sobel test for significance of mediation
Args:
a: coefficient from X to mediator variable, M
b: coefficient from M to Y
se_a: Standard error of A
se_b: Standard error fo B
Returns:
t: Sobel's test statistic
pval : Two-tailed probability assuming normal distribution
'''
SE = np.sqrt( (a**2)*(se_a**2) + (b**2)*(se_b**2))
t = (a*b) / SE
n = 100000000
pval = stats.t.sf(np.abs(t), n-1)*2
return t, pval

```
In [ ]:
```# set random seed so everyone gets same results
np.random.seed(1)
# Determine effects
a = -3 # effect of x to M
b = 3 # effect of M to y
cq = .1 # effect of x on y controlling for M
# Create a random data x
x = np.random.rand(100)
m = x * a + np.random.rand(100)
# Create Y
y = np.dot(np.array([x,m]).T,[cq,b]) + np.random.rand(100)

```
In [ ]:
```plt.scatter(x,y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('X -> Y')

```
In [ ]:
```plt.scatter(x,m)
plt.xlabel('X')
plt.ylabel('M')
plt.title('X -> M')

```
In [ ]:
```plt.scatter(m,y)
plt.xlabel('M')
plt.ylabel('Y')
plt.title('M -> Y')

```
In [ ]:
```X = pd.DataFrame({'Intercept':np.ones(len(x)),'X':x})
lm1 = smf.OLS(y,X).fit()
print lm1.summary()
ec = lm1.params[1] # save total effect c to ec

```
In [ ]:
```lm2 = smf.OLS(m,X).fit()
print lm2.summary()
ea = lm2.params[1] # Save the effect of X on M, a, to ea
sea = lm2.bse[1]

```
In [ ]:
```X['M'] = m
lm3 = smf.OLS(y,X).fit()
print lm3.summary()
ecq,eb = lm3.params[1:3]
seb = lm3.bse[2]

```
In [ ]:
```print('c : %.2f') % ec
print('a : %.2f') % ea
print('b : %.2f') % eb
print('c\' : %.2f') % ecq

```
In [ ]:
```print('Total effect C: %.2f') % ec
print('is decomposed into the indirect(mediated) effect a*b: %.2f') % (ea*eb)
print('plus the direct effect c\': %.2f') % ecq
print('which adds up to %.2f') % (ea*eb+ecq)

One way to test the significance of a mediation is to perform a Sobel test, where the indirect effect(a*b) is divided by an estimated standard error of the two. This assumes that the product would be normally distributed which may not always be the case.

An alternative method is to bootstrap with replacement on the observed data to generate a 95% confidence interval. You can try this by writing a for-loop that resamples from the data and generate a distribution of the indirect effects(a*b). If the confidence interval does not include 0, it can be considered as significant.

```
In [ ]:
```t,p = sobel_test(ea,eb,sea,seb)
print('Sobel\'s test of significance t = %2.2f') % t
print('Two-tailed p-value p = %2.5f ') % p

In a moderation analysis, the moderator modifies or changes the relationship between two variables, akin to an interaction term. Moderation is slightly different from an interaction due to the additional constraint that there is a causal relationship from X to Y, BUT not from Z to Y. Therefore, a moderation implies an interaction exists but an interaction does not imply a moderation.

Here is a schematic representation of a moderation relationship. This diagram hypothesize that Stress has a causal relationship to Depression but the effect of Stress is different for people with high or low Social Support

<img src="Figures/moderator2.gif",width=500,align='center'>

This can be reprsented by an interaction,

<img src="Figures/moderator3.jpeg",width=500,align='center'>

The pictures have been retrieved from here

1) The effect of compliments on future grades is moderated by growth mindset (Carol Dweck)

2) Effect of favorability on government behavior is moderated by political affiliation.

3) Effect of pressure on performance is moderated by confidence (choking vs boosting).

```
In [ ]:
```df = pd.DataFrame({
'Enjoy':[4, 16, 4, 12, 9, 5, 15, 21, 3, 4, 8, 11, 7, 5, 8, 19, 11, 9, 9, 13, 11, 21, 18, 12, 15, 3, 2, 10, 7, 9, 5, 6, 9, 12, 9, 5, 17, 15, 9, 7, 5, 10, 10, 6, 7, 9, 12, 2, 1, 5, 7, 5, 8, 5, 5, 11, 8, 9, 13, 9, 19, 8, 21, 1, 11, 8, 6, 23, 2, 9, 13, 4, 10, 12, 5, 7, 10, 11, 12, 13],
'Buy':[5, 8, 0, 8, 3, 0, 8, 9, 8, 1, 7, 2, 2, 9, 3, 8, 8, 9, 5, 7, 7, 9, 6, 7, 8, 4, 4, 3, 1, 4, 1, 5, 5, 2, 9, 5, 7, 8, 2, 4, 1, 4, 0, 1, 0, 8, 2, 4, 0, 0, 0, 1, 2, 2, 2, 7, 5, 1, 9, 9, 8, 1, 7, 2, 5, 2, 4, 9, 1, 6, 3, 0, 7, 5, 2, 3, 1, 8, 6, 4],
'Read':[0, 8, 0, 5, 4, 9, 6, 9, 1, 3, 3, 3, 8, 0, 9, 8, 6, 0, 5, 6, 1, 7, 7, 5, 7, 2, 0, 5, 1, 7, 7, 4, 6, 5, 3, 1, 7, 6, 0, 4, 0, 9, 5, 9, 2, 3, 5, 2, 5, 2, 9, 1, 1, 7, 9, 3, 0, 4, 4, 3, 8, 8, 8, 2, 3, 7, 1, 8, 6, 1, 7, 0, 3, 2, 5, 3, 8, 6, 9, 7]
})

```
In [ ]:
```df['Interaction'] = df.Read*df.Buy
np.corrcoef(df.Buy,df.Interaction)

```
In [ ]:
```plt.scatter(df.Buy,df.Interaction)
plt.xlabel('Read * Buy')
plt.ylabel('Buy')

```
In [ ]:
```center = lambda x: (x - x.mean())
df[['Read_Centered','Buy_Centered']] = df[['Read','Buy']].apply(center)
df['Interaction_Centered'] = df['Read_Centered'] * df['Buy_Centered']
np.corrcoef(df.Buy,df.Interaction_Centered)

```
In [ ]:
```plt.scatter(df.Buy,df.Interaction_Centered)
plt.xlabel('Read * Buy')
plt.ylabel('Buy')

```
In [ ]:
```mod = smf.ols(formula = "Enjoy ~ Buy + Read + Interaction", data = df).fit()
print mod.summary()

```
In [ ]:
```mod = smf.ols(formula = "Enjoy ~ Buy + Read + Interaction_Centered", data = df).fit()
print mod.summary()

```
In [ ]:
```

For exercise 1, let's revisit the salary dataset. ('../Data/salary_exercise.csv')

These are the salary data used in Weisberg's book, consisting of observations on six variables for 52 tenure-track professors in a small college. The variables are:

- sx = Sex, coded 1 for female and 0 for male
- rk = Rank, coded
- 1 for assistant professor,
- 2 for associate professor, and
- 3 for full professor

- yr = Number of years in current rank
- dg = Highest degree, coded 1 if doctorate, 0 if masters
- yd = Number of years since highest degree was earned
- sl = Academic year salary, in dollars.

Reference: S. Weisberg (1985). Applied Linear Regression, Second Edition. New York: John Wiley and Sons. Page 194.

**Make sure you check the data for missing values or other errors before beginning your analysis.**

Evaluate whether academic rank mediates the relationship between salary and gender

```
In [ ]:
```

For Exercise 2, we will use the alcohol dataset again ('../Data/student-mat.csv').

Below is the information about the dataset.

Attribute Information:

1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

2 sex - student's sex (binary: 'F' - female or 'M' - male)

3 age - student's age (numeric: from 15 to 22)

4 address - student's home address type (binary: 'U' - urban or 'R' - rural)

5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)

6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 Ã¢â‚¬â€œ 5th to 9th grade, 3 Ã¢â‚¬â€œ secondary education or 4 Ã¢â‚¬â€œ higher education)

8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 Ã¢â‚¬â€œ 5th to 9th grade, 3 Ã¢â‚¬â€œ secondary education or 4 Ã¢â‚¬â€œ higher education)

9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')

11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')

12 guardian - student's guardian (nominal: 'mother', 'father' or 'other')

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

The amount you go out seems to be associated with increased daily alcohol consumption. Does being in a relationship dampen this association? Are there any other variables that appear to moderate this relationship?

```
In [ ]:
```

```
In [ ]:
```