-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathmultiple-comparisons.Rmd
536 lines (416 loc) · 20.9 KB
/
multiple-comparisons.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
---
title: 'Multiple comparisons'
---
```{r, include=FALSE}
library(tidyverse)
library(pander)
library(lmerTest)
panderOptions('table.split.table', Inf)
```
# Multiple comparisons {#multiple-comparisons}
People get confused about multiple comparisons and worry about 'doing things
right'.
There are many different tests and procedures, and thousands of pages of
tutorials and guides each of which recommends a slightly different approach.
Textbooks typically describe the tests themselves in detail, and list the
assumptions they make, but it sometimes feels like nobody will give a straight
answer.
The inconsistency arises because different researchers have different
priorities. It might help to re-address what a _p_ value is, and what it is
_for_.
### _p_ values and 'false discoveries' {-}
In frequentist statistics, _p_ values are defined as \*the probability of
obtaining a test statistic at least as large as that observed, if the null
hypothesis is true. That is, it's the chance that the data we have collected are
atypical and will mislead us into thinking there is a difference, when the true
effect is zero.
Let's pretend we are creative researchers and, over the course of our career, we
will develop 100 hypotheses, each of which we test in an experiment, represented
by squares in the plot below:
```{r, echo=F}
x <- c(experiments=100)
waffle::waffle(x, colors=c("lightgrey"))
```
We are concientious and use [sample size calculations](#power) for our studies,
setting our desired power = .8, and use _p_ = .05 as our criterion for rejecting
the null hypotheses. As is common, we report tests which reject the null
hypothesis _as if our predictions had been supported, and the alternative
hypotheses were true_ (this is a bad idea, but run with it for the moment).
Let's be generous and say that, in reality, 50% of our hypotheses are true (the
other 50% are nice ideas, but are not correct). Because we set our power to be
.8 in our sample size calculation, this means that over the course of our career
we will detect around 40 'true' effects, and publish them. These are shown in
blue in the figure below:
```{r, echo=F}
waffle::waffle(c(40, 60), colors=c("blue", "lightgrey")) + guides(fill=F)
```
Because we set our alpha level to 0.05, we will also have some false positives,
shown in red:
```{r, echo=F}
waffle::waffle(c(40, 5, 55), colors=c("blue", "red", "lightgrey")) + guides(fill=F)
```
But what that means is that _for the effects we publish as supporting our
hypotheses_ (i.e. the blue and red squares) then we will be making false claims
5/40 = 12.5% of the time. This is obviously much higher than the nominal alpha
level implies. What's more, if we're not hotshot theorists (or work in an area
where theories are less rigorously developed) and only 20% of our hypotheses are
in fact true then we will make even more false claims: 5/20 = 25%.
### Multiple tests on the same data {-}
Another problem arises when we run multiple statistical tests on the same data.
The most common case is that we have multiple experimental conditions and want
to test whether any of the differences between them are significant. If we had
experimental 3 groups, then there are 3 possible comparisons between them, for
example (A-B, A-C, B-C). This can be compounded if we have multiple outcome
measurements (for example, measuring depression _and_ anxiety in a clincal
trial; see @feise2002multiple).
<!-- TODO Add wonka ticket image -->
This works like a lucky dip or lottery: if you buy more tickets you have a
larger change of winning a prize. In this case, with three comparisons between
the groups (tickets) we have a 15% chance of winning a prize, rather than the 5%
we intended.
Assuming each of our 100 experiments allows for 3 tests, any of which would be
'interesting' if significant (and we would therefore publish them), then our
plot looks like this:
```{r, echo=F}
x <- c(40, 15, 100-55)
waffle::waffle(x, colors=c("blue", "red", "lightgrey")) + guides(fill=F)
```
And our 'false discovery rate' (at the level of our published papers) is now
over one third: 15/40 = 37.5%.
##### False discovery rates for single experiments {-}
We can also think about a false discovery rate for findings presented within a
particular paper, or set of analyses. Imagine we have a 2x4 factorial design,
and so we have 8 experimental groups, and 28 possible pairwise combinations:
```{r, echo=F, results="asis"}
expand.grid(A=1:4, B=0:1) %>%
mutate(Cell.No=row_number()) %>%
select(Cell.No, A, B) %>%
pander(caption="8 cells in the 2x4 design")
df <- combn(1:8, 2) %>% t %>%
as_data_frame %>%
rownames_to_column(var="Comparison")
snip.middle <- function(df, n=6) {
setdots <- function(x) NA
mid <- df[1,] %>% mutate_all(funs(setdots))
bind_rows(head(df, n=n), mid, tail(df, n=n))
}
combn(1:8, 2) %>% t %>%
as_data_frame %>%
rownames_to_column(var="Comparison") %>%
rename(`Cells`=V1, ` `=V2) %>%
mutate(` `="vs.") %>%
select(Comparison, `Cells`, ` `, ` `) %>%
snip.middle(n=3) %>%
pander(caption="Pairwise comparisons", missing="...")
```
Assuming there are 'real' differences between only one-third of these pairings,
then we have the same problem as we did when considering all the experiments in
a researchers' career: Using _p_ < .05 as our criterion, and power of .8 then we
will report around 9 significant differences, but 1 or 2 of these will false
discoveries (16%).
```{r, echo=F, include=F}
disc <- 28 * .8 * .333
fa <- 28 * .05
fdr <- fa/(disc+fa)
fd <- (disc+fa) * fdr
disc + fa
fd
fdr
```
### What to do about it? {-}
There are two decisions you need to make: First, which technique do you want to
use to mitigate the risks of multiple comparisons? Second, what is the 'family'
of tests you are going to apply this to?
[If you find the whole language around null hypothesis testing and p values
unhelpful, and the detail of multiple comparison adjustment confusing, there is
another way: Multiple comparison problems are largely a non-issue for Bayesian
analyses [@gelman2012we], and recent developments in the software make simple
models like Anova and regression
[easy to implement in a Bayesian framework in R](#bayes-mcmc).]{.explainer}
#### Which technique? {-}
Inevitably, it depends!
- If your worry is ever making a type 1 error then you need to _control the
familywise error rate_. The Bonferroni or Tukey correction are likely the
best solutions (see details below). Both have the disadvantage of increasing
the type 2 error rate: you will falsely accept the null more frequently, and
so neglect findings which are 'true' and might be theoretically or
clinically relevant.
- If your main concern is about the number of 'false discoveries' you make,
and you think it's import to you to avoid rejecting 'good' hypotheses, then
you need a technique to control the _false discovery rate_. The Benjamini
and Hochberg method, often abbreviated as BH or FDR correction, is what you
need.
Principled arguments can be made for both approaches; there is no 'correct
choice'. Which you pick will depend on which concerns are most pressing to you,
and also (inevitably) what is conventional in your field.
[It's worth noting that in specific fields (for example neuroscience, where huge
numbers of comparisons must be made between voxels or regions in brain imaging
studies) more sophisticated methods to control for multiple comparisons are
often commonplace and expected [@nichols2003controlling]. This guide only deals
with some of the simpler but common cases.]{.explainer}
#### How big is the family? {-}

Whichever technique we employ we do need to determine how many tests we want to
correct for having run; i.e. what size the 'family' of tests is. Unfortuntately,
deciding which set of tests constitutes a 'family' is an unresolved problem, and
reasonable people will come up with different solutions. As @feise2002multiple
put it:
> It is unclear how wide the operative term "family" should be [...] Does
> "family" include tests that were performed, but not published? Does it include
> a meta-analysis upon those tests? Should future papers on the same data set be
> accounted for in the first publication? Should each researcher have a
> career-wise adjusted p-value, or should there be a discipline-wise adjusted
> p-value? Should we publish an issue-wise adjusted p-value and a
> year-end-journal-wise adjusted p-value? Should our studies examine only one
> association at a time, thereby wasting valuable resources? No statistical
> theory provides answers for these practical issues, because it is impossible
> to formally account for an infinite number of potential inferences.
The 'right' answer, then, will vary from case to case, and local conventions in
particular disciplines will dictate what is normal and expected by supervisors
and reviewers. However, there are a few common cases for which we can make some
recommendations on what might constitue 'good practice'. We do not claim that
these are the 'correct' answers, but if you deviate from these recommendations
it would be worth having a reasonable justification to hand!
##### Multiple outcome variables {-}
In a clinical trial or other applied study you might want to measure more than
one variable to detect changes across a variety of domains (e.g. both depression
and anxiety).
Although no firm consensus exists, it is common practice to:
- Analyse multiple outcomes without correction (and indicate this in the text)
- Designate one outcome from the study to be 'primary', and report this both
in pre-trial protocols and reports of the results.
See @feise2002multiple for more details.
##### Multiple 'timepoints' for the outcome {-}
Another common case is where the same outcome variable is measured on multiple
occasions to track progress over time. For example, an RCT might measure outcome
at baseline, at the end of treatment, and again 12 months later.
In the cases where there are only a few measurements of outcome made then you
could choose between:
1. Bonferroni correction of _p_ values for the between-group difference at each
timepoint or
2. Designating one of the timepoints as the 'primary' outcome.
In practice option 2 is more common, and probably makes more sense in applied
settings where it is the longer-term prognosis of participants wich matters
most.
##### Pairwise comparisons in factorial designs {-}
Let's say you have a complex complex factorial design and so multiple pairwise
comparisons and other contrasts are possible.
- If you are only interested in a small number of the possible pairwise
comparisons or specific contrasts then specify this up front. Either report
the comparisons without correction (noting this in the text) or use
FDR/Bonferroni correction for this small family of tests.
- If you are only interested in specific main effects or interactions from the
full factorial model, then you should specify this in advance and report
these F-tests or t-tests and associated _p_ values without correction.
- It is perfectly reasonable to be interested only in specific pairwise
comparisons contained within an Anova F-test. It is normal to report the F
from the Anova, but it is _not_ necessary for the F test to be significant
before reporting the pairwise comparison of interest (with or without
FDR/Bonferroni correction).
- If you have not specified the comparisons of interest in advance, then use
FDR or Tukey correction for the complete set of pairwise tests. If you feel
able to claim that some of these pairwise comparisons would _never_ have
been of interest then you could reduce the family size for this correction
accordingly, but you should make this justification in your report.
### Practical examples {- #contrasts-examples}
#### Factorial Anova and pairwise comparisons {-}
In the [Anova cookbook](#howell-factorial-example), we used a dataset from
Howell's textbook which recorded `Recall` among young v.s. older adults (`Age`)
for each of 5 conditions:
```{r, echo=F}
eysenck <- readRDS("data/eysenck.Rdata")
eysenck %>%
ggplot(aes(Condition, Recall, group=Age, color=Age)) +
stat_summary(geom="pointrange", fun.data = mean_cl_boot) +
ylab("Recall (95% CI)") + xlab("")
```
In an ideal world we would have published a trial protocol before collecting
data, or at the least
[specified which comparisons were of interest to us](#register-predictions).
However for the purposes of this example I'll assume you didn't do this, and
need to address the potential for mutliple comparisons accordingly.
The design of the study suggests that we were interested in the effect of the
experimental conditions on recall, and we might have predicted that these
experimental manipulations would affect older and younder adults differently.
Because we didn't pre-specify our analyses, we should acknowledge that we would
likly report any differences between any of the 5 experimental conditions (10
possible comparisons) and any differences between younger and older adults in
any of those conditions (45 possible pairwise comparisons).
Because replicating this experiment is relatively cheap (testing might only take
30 minutes per participant), and we are confident that other labs would want to
replicate any significant findings we report, we are less concerned with the
absolute rate of type 1 errors, but would like to limit our 'false discovery
rate' (we have a repututation to maintain). We set our acceptable false
discovery rate to 5%.
We run an Anova model, including main effects for `Age` and `Condition` and
their interaction:
```{r}
eysenck.model <- lm(Recall ~ Age * Condition,
data=eysenck)
car::Anova(eysenck.model, type=3) %>%
pander()
```
We report that there is are significant main effects for Age, Condition, and a
significant interaction for `Age:Condition`. At this point, if we hadn't already
plotted the raw data, we would certainly want to do that, or to
[make predictions for each cell in the design and plot them](#understanding-interactions).
Because the plot of the raw data (see above) suggested that the counting and
rhyming conditions produced lower recall rates than the other conditions, we
might want to report the relevant pairwise tests. We can use the the `lsmeans()`
function from the `lsmeans::` package to compute these:
```{r}
# use the lsmeans function which returns an lsm.list object.
eysenck.lsm <- lsmeans::lsmeans(eysenck.model, pairwise~Condition)
# the `contrasts` element in this list is what we want for now
eysenck.lsm$contrasts
```
By default Tukey correction is applied for multiple comparisons, which is a
reasonable choice. If you wanted to adjust for the false discovery rate instead,
you can use the `adjust` argument:
```{r, eval=F}
# not run, just shown as an example
lsmeans::lsmeans(eysenck.model, pairwise~Condition, adjust="fdr")
```
The plot also suggested that older and younger adults appeared to differ for the
'adjective', 'imagery', and 'intention' conditions. We can compute these
detailed pairwise comparisons by including the interaction term, `Age:Condition`
in the call to `lsmeans()`:
```{r}
eysenck.age.condition.lsm <-
lsmeans::lsmeans(eysenck.model,
pairwise~Age:Condition,
adjust="fdr")
eysenck.age.condition.lsm$contrasts %>%
broom::tidy() %>%
snip.middle(., 4) %>%
pander(missing="...", caption="8 of the 45 possible contrasts, FDR corrected", split.tables=Inf)
```
You should note that the FDR adjusted p values do not represent probabilities in
the normal sense. Instead, the p value now indicates the _false discovery rate
at which the p value should be considered statistically significant_. So, for
example, if the adjusted p value = 0.09, then this indicates the contrast
_would_ be significant if the acceptable false discovery rate is 10%.
Researchers often set their acceptable false discovery rate to be 5% out of
habit or confusion with conventional levels for alpha, but it would be
reasonable to set it to 10% or even higher in some circumstances. Whatever level
you set, you need to be able to satisfy both yourself and a reviewer that the
result is of interest: the particular values for alpha and FDR you agree on are
arbitrary conventions, provided you both agree the finding is worthy of
consideration!
###### Extracting pairwise comparisons to display in your report {- #extract-contrasts}
In the code above we request FDR-adjusted p values, and then use the
`broom::tidy()` function to convert the table into a dataframe, and then use
`pander` to display the results as a table. For more information on extracting
results from models and other R objects and displaying them properly see the
section: [models are data too](#models-are-data-too).
##### Adjusting for the FDR by hand {-}
If you have a set of tests for which you want to adjust for the FDR it's easy to
use R's `p.adjust()` to do this for you.
```{r}
set.seed(5)
p.examples <-
data_frame(p = runif(4, min=.001, max=.07)) %>%
mutate(p.fdr = p.adjust(p, method="fdr"),
p.sig = ifelse(p < .05, "*", ""),
p.fdr.sig = ifelse(p.fdr < .05, "*", ""))
p.examples %>%
arrange(p) %>%
pander(caption="Randomly generated 'p values', with and without FDR correction applied.")
```
We can plot these values to see the effect the adjustment has:
```{r, echo=F, message=F, warning=F, fig.cap="Example of the adjustments made to p values by FDR and Bonferroni methods. Unadjusted p values are shown in black. Note how conservative Bonferroni is with even this small number of comparisons to correct for."}
p.examples %>%
select(-matches("sig")) %>%
mutate(`Bonferroni` = p.adjust(p, method="bonferroni")) %>%
arrange(`p`) %>%
mutate(rank=1:n()) %>%
reshape2::melt(id.var=c("p","rank") ) %>%
ggplot(aes(x=factor(rank), y=value, group=rank, color=variable)) +
geom_hline(yintercept=0.05) +
geom_line(color="grey", arrow=arrow(length = unit(0.25, "cm"))) +
geom_point(aes(y=p), colour="black") +
geom_segment(aes(x = rank, y = p, xend=rank, yend=value), color="grey",
arrow=arrow(length = unit(0.25, "cm"))) +
geom_point() +
ylab("Original/Corrected p value") +
xlab("Ranking of p values") +
facet_wrap(~variable) +
guides(color=F) + theme(aspect.ratio = 2)
```
#### Other contrasts {- #contrasts}
The `lsmeans::` package provides tools to compute contrasts for Anova and other
linear models:
```{r, message=F}
library(lsmeans)
```
We can see a number of these different contrasts below. First, we run a model
model including the factor Days:
```{r}
slope.model <- lmer(Reaction ~ factor(Days) + (1|Subject),
data=lme4::sleepstudy)
```
And then we create the `lsmeans` object:
```{r}
slope.model.lsm <- lsmeans::lsmeans(slope.model, ~Days)
```
We can use this `lsm` object to calculate various contrasts we might want to
see. First, consecutive contrasts where each level is compared with the next:
```{r}
lsmeans::contrast(slope.model.lsm, "consec")
```
Polynomial contrasts allow us to test whether the increase in RT's over the days
is linear or curved:
```{r}
lsmeans::contrast(slope.model.lsm, "poly", adjust="fdr", max.degree=3)
```
If we don't specify a contrasts argument, or a type of correction, we will get
Tukey contrasts for every pairwise comparison:
```{r, eval=F}
# 45 tests - results not shown because similar to those shown above
lsmeans::contrast(slope.model.lsm, "tukey")
```
If we want to compare particular levels of `Days` with a specific reference
level we can use the `trt.vs.ctrl` contrast type. In this example we are
comparing each day with day 0, because day 0 is the first in the list of days
and we specified `ref=1` (try running the comparisons against day 5).
```{r}
lsmeans::contrast(slope.model.lsm, "trt.vs.ctrl", ref=1, adjust="fdr")
```
And if we want to compare each level against the grand mean of the sample, the
`eff` contrast type is short for `effect`. A way of thinking of this (in
English) is that we are asking 'what is the effect of being on a particular day,
relative the other average of all days'.
```{r}
lsmeans::contrast(slope.model.lsm, "eff", adjust="fdr")
```
To list all the availble contrast types, or see the relevant help files, you can
type:
```{r, eval=F}
# help file describing all the contrasts families
help(pairwise.lsmc)
```
#### Contrasts for interactions {- #interaction-pairwise-contrasts}
Let's recall [our example of factorial Anova](#howell-factorial-example):
```{r, include=F}
eysenck <- readRDS("data/eysenck.Rdata")
eysenck.model <- lm(Recall~Age*Condition, data=eysenck)
```
```{r}
car::Anova(eysenck.model, type=3) %>%
pander
```
We can compute pairwise comparisons within the interaction as follows:
```{r}
eysenck.lsm <- lsmeans::lsmeans(eysenck.model, ~Age:Condition)
```
And the various contrasts, here comparing each cell agains the grand mean:
```{r}
lsmeans::contrast(eysenck.lsm, "eff", adjust="fdr")
```
Or all of the pairwise tests with Tukey adjustment:
```{r}
lsmeans::contrast(eysenck.lsm, "tukey") %>%
broom::tidy() %>%
snip.middle(4) %>%
pander(missing="...", caption="Some of the Tukey corrected pairwise contrasts (table abbreviated)")
```