mediation-and-covariance-models.Rmd

---
title: 'Analysing mediation'
---

```{r}
library(tidyverse)
library(broom)
library(pander)
source('diagram.R')
```

# Mediation and covariance modelling

## Mediation {- #mediation}

Mediation is a complex topic, and the key message to take on — before starting
to analyse your data — is that mediation analayses make many strong assumptions
abou the data. These assumptions can often be pretty unreasonable, when spelled
out, so be cautious in the interpretation of you data.

Put differently, mediation is a correlational technique aiming to provide a
causal interpretation of data; caveat emptor.

### Mediation with multiple regression {-}

One common (if outdated) way to analyse mediation is via the 3 steps described
by Baron and Kenny -@baron1986moderator (also see @zhao_reconsidering_2010).

Let's say we have a hypothesised situation such as this:

```{r}
knit_gv("
Lateness -> Crashes
Lateness -> Speeding
Speeding -> Crashes
")
```

Baron and Kenny propose 3 steps to establishing mediation. These steps
correspond to three separate regression models:

### Mediation Steps {-}

#### Step 1 (check distal variable predicts mediator) {-}

That is, show Lateness predicts Crashes

#### Step 2 (check distal variable predict mediator) {-}

That is, show Lateness predicts Speeding

#### Step 3 (check for mediation) {-}

That is, show Speeding predicts Crashes, controlling for Lateness

An additional step, which allows us to test whether the effect is _completely_
mediated, also uses the final regression model:

#### Step 4 (check for total mediation) {-}

That is, check if Lateness still predicts crashes, controlling for Lateness   


### Mediation example after Baron and Kenny {-}

Using simulated data, we can work through the steps.

```{r, include=F, echo=F}
set.seed(12345)
N <- 200
smash <- data_frame(person=1:N, lateness = rpois(N, 10), speed = rnorm(N, 30+.9*lateness, 10), crashes = rpois(N, 2+.3*speed+.3*lateness))
```

```{r}
smash %>% glimpse
```

Step 1: does lateness predict crashes?

```{r}
step1 <- lm(crashes ~ lateness, data=smash)
tidy(step1) %>% pander()
```

Step 2: Does lateness predict speed?

```{r}
step2 <- lm(speed ~ lateness, data=smash)
tidy(step2, conf.int = T) %>% pander()
```

The coefficient for `lateness` is statistically significant, so we would say
yes.

Step 3: Does speed predict crashes, controlling for lateness?

```{r}
step3 <- lm(crashes ~ lateness+speed, data=smash)
tidy(step3) %>% pander()
```

The coefficient for speed is statistically significant, so we can say mediation
does occur.

Step 4: In the same model, does lateness predict crashes, controlling for speed?
That is to say, is the mediation via speed _total_?

Here, the coefficient is still statistically significant. According to the Baron
and Kenny steps, this would indicate the mediation is _partial_, although the
fact the p value falls one side or another of .05 is not necessarily the best
way to express this (see below for ways to calculate the proportion of the
effect which is mediated).

We should alse be concerned here with the degree to which predictor and mediator
are measured with error — if they are noisy measures, then the proportion of the
effect which appears to be mediated will be reduced artificially (see the SEM
chapter for more on this).

## Testing the indirect effect {-}

Baron and Kenny also introduced conventions for labelling some of the
coefficients from the regressions described above.

Specifically, the described `a` as the path from the predictor to the mediator,
`b` as the path from the mediator to the outcome, and `c'` (`c` prime) as the
path from predictor to outcome, controlling for the mediator. As shown here:

```{r}
knit_gv("
Lateness -> Crashes[label='c`']
Lateness -> Speeding[label=a]
Speeding -> Crashes[label=b]
")
```

Subsequent authors wished to provide a test for whether the path through `a` and
`b` --- the indirect effect --- was statistically significant.
@preacher_spss_2004 published SPSS macros for computing this indirect effect and
providing a non-parametric (bootstrapped) test of this term. The same approach
is now implemented in a number of R packages.

The `mediation::mediate` function accepts the 2nd and 3rd regression models from
the 'Baron and Kenny' steps, along with arguments which identify which variables
are the predictor and the mediator. From this, the function calculates the
indirect effect, and the proportion of the total effect mediated. This is
accompanied by a bootstrapped standard-error, and asociated p value.

For example, using the models we ran above, we can say:

```{r}
set.seed(1234)
crashes.mediation <- mediation::mediate(step2, step3, treat = "lateness", mediator = "speed")
summary(crashes.mediation)

```

From this output, we can see that the indirect effect is statistically
significant, and that around hald of the total effect is mediated via speed.
Because lateness in iteself is not a plausable cause of a crash, this suggest
that other factors (perhaps distraction, inattention) might be important in
mediating this residual direct effect.

## Mediation using Path models

An even more flexible approach to mediation can be taken using path models, a
type of [structural equation model](#covariance-modelling) which are covered in
more detail in the next section.

Using the `lavaan` package, path/SEM models can specify multiple variables to be
outcomes, and fit these models simultaneously. For example, we can fit both step
2 and step 3 in a single model, as in the example below:

```{r}
library(lavaan)

smash.model <- '
  crashes ~ speed + lateness
  speed ~ lateness
'

smash.model.fit <- sem(smash.model, data=smash)
summary(smash.model.fit)
```

The summary output gives us coefficients which correspond to the regression
coefficients in the step 2 and step 3 models --- but this time, from a single
model.

We can also use `lavaan` to compute the indirect effects by labelling the
relevant parameters, using the `*` and `:=` operators. See the
[`lavaan` syntax guide for mediation](http://lavaan.ugent.be/tutorial/mediation.html)
for more detail.

Note that the `*` operator does not have the same meaning as in formulas for
linear models in R --- in `lavaan`, it means 'apply a constraint'.

```{r}
smash.model <- '
  crashes ~ B*speed + C*lateness
  speed ~ A*lateness

  # computed parameters, see http://lavaan.ugent.be/tutorial/mediation.html
  indirect := A*B
  total := C + (A*B)
  proportion := indirect/total
'

smash.model.fit <- sem(smash.model, data=smash)
summary(smash.model.fit)
```

We can again get a bootstrap interval for the indirect effect, and print a table
of just these computed effects like so:

```{r, error=F, warning=F}
set.seed(1234)
smash.model.fit <- sem(smash.model, data=smash, test="bootstrap", bootstrap=100)

parameterEstimates(smash.model.fit) %>%
  filter(op == ":=") %>%
  select(label, est, contains("ci")) %>%
  pander::pander()
```

Comparing these results with the `mediation::mediate()` output, we get similar
results. In both cases, it's possible to increase the number of bootstrap
resamples if needed to increase the precision of the interval (the default is
1000, but 5000 might be a good target for publication).