-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathlinear-models.Rmd
665 lines (489 loc) · 24.1 KB
/
linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
```{r include=FALSE, cache=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, comment=NA)
```
---
title: 'Regression in R'
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=FALSE)
library(tidyverse)
```
# Regression {#linear-models-simple}
#### {- .tip}
This section assumes most readers will have done an introductory statistics
course and had some practice running multiple regression and or Anova in SPSS or
a similar package.
<!-- If you haven't, don't rush out to do that though XXX what should they do instead? Recommend some MOOC? XXX? -->
### Describing statistical models using formulae {- #formulae}
R requires that you are explicit about the statistical model you want to run but
provides a neat, concise way of describing models, called a `formula`. For
multiple regression and simple Anova, the formulas we write map closely onto the
underlying _linear model_. The formula syntax provides shortcuts to quickly
describe all the models you are likely to need.
Formulas have two parts: the left hand side and the right hand side, which are
separated by the tilde symbol: `~`. Here, the tilde just means 'is predicted
by'.
For example, this formula specifies a regression model where `height` is the
_outcome_, and `age` and `gender` are the _predictor_ variables.^[I avoid the
terms dependent/independent variables because they are confusing to many
students, and because they are misleading when discussing non-experimental
data.]
```
height ~ age + gender
```
There are lots more useful tricks to learn when writing formulas, which are
covered below. But in the interests of instant gratification let's work through
a simple example first:
### Running a linear model {-}
Linear models (including Anova and multiple regression) are run using the
`lm(...)` function, short for 'linear model'. We will use the `mtcars` dataset,
which is built into R, for our first example.
First, we have a quick look at the data. The pairs plot suggests that `mpg`
might be related to a number of the other variables including `disp` (engine
size) and `wt` (car weight):
```{r}
mtcars %>%
select(mpg, disp, wt) %>%
pairs
```
Before running any model, we should ask outselves: "what question we are trying
to answer?"
In this instance, we can see that both weight (`wt`) and engine size (`disp`)
are related to `mpg`, but they are also correlated with one another. We might
want to know, then, "are weight and engine size independent predictors of
`mpg`?" That is, if we know a car's weight, do we gain additional information
about it's `mpg` by measuring engine size?
To answer this, we could use multiple regression, including both `wt` and `disp`
as predictors of `mpg`. The formula for this model would be `mpg ~ wt + disp`.
The command below runs the model:
```{r}
lm(mpg ~ wt + disp, data=mtcars)
```
For readers used to wading through reams of SPSS output R might seem concise to
the point of rudeness. By default, the `lm` commands displays very little, only
repeating the formula and listing the coefficients for each predictor in the
model.
So what next? Unlike SPSS, we must be explicit and tell R exactly what we want.
The most convenient way to do this is to first store the results of the `lm()`
function:
```{r}
m.1 <- lm(mpg ~ wt + disp, data=mtcars)
```
This stores the results of the `lm()` function in a variable named `m.1`. As an
aside, this is a pretty terrible variable name — try to give descriptive names
to your variables because this will prevent errors and make your code easier to
read.
We can then use other functions to get more information about the model. For
example:
```{r}
summary(m.1)
```
Although still compact, the `summary` function provides some familiar output,
including the estimate, _SE_, and _p_ value for each parameter.
Take a moment to find the following statistics in the output above:
- The coefficients and p values for each predictor
- The _R_^2^ for the overall model. What % of variance in `mpg` is explained?
Answer the original question: 'accounting for weight (`wt`), does engine size
(`disp`) tell us anything extra about a car's `mpg`?'
### More on formulas {-}
Above we briefly introduced R's formula syntax. Formulas for linear models have
the following structure:
```
left_hand_side ~ right_hand_side
```
For linear models _the left side is our outcome_, which is must be a continous
variable. For categorical or binary outcomes you need to use `glm()` function,
rather than `lm()`. See the section on [generalised linear models](glm.html))
for more details.
_The right hand side of the formula lists our predictors_. In the example above
we used the `+` symbol to separate the predictors `wt` and `disp`. This told R
to simply add each predictor to the model. However, many times we want to
specify relationships _between_ our predictors, as well as between predictors
and outcomes.
For example, we might have an experiment with 2 categorical predictors, each
with 2 levels --- that is, a 2x2 between-subjects design.
Below, we define and run a linear model with both `vs` and `am` as predictors,
along with the interaction of `vs:am`. We save this model as `m.2`, and use the
`summary` command to print the coefficients.
```{r}
m.2 <- lm(mpg ~ vs + am + vs:am, data=mtcars)
summary(m.2)
```
We'd normally want to see the Anova table for this model, including the F-tests:
```{r}
car::Anova(m.2)
```
But before you do too much with Anova in R [read this section](anova.html).
#### Other formula shortcuts {-}
In addition to the `+` symbol, we can use other shortcuts to create linear
models.
As seen above, the colon (`:`) operator indicates the interaction between two
terms. So `a:b` is equivalent to creating a new variable in the data frame where
`a` is multiplied by `b`.
The `*` symbol indicates the expansion of other terms in the model. So, `a*b` is
the equivalent of `a + b + a:b`.
Finally, it's good to know that other functions can be used within R formulas to
save work. For example, if you wanted to transform your dependent variable then
`log(y) ~ x` will do what you might expect, and saves creating temporary
variables in your dataset.
The formula syntax is very powerful, and the above only shows the basics, but
you can read the `formulae` help pages in RStudio for more details.
##### {- .exercise}
Run the following models using the mtcars dataset:
- With `mpg` as the outcome, and with `cyl` and `hp` as predictors
- As above, but adding the interaction of `cyl` and `hp`.
- Repeat the model above, but write the formula a different way (make the
formula either more or less explicit, but retaining the same predictors in
the model).
## Factors and variable codings {-}
#### {- #factors-vs-linear-inputs}
If you store categorical data as numbers (e.g. groups `1`, `2`, `3` ...) it's
important to make sure your predictors are entered correctly into your models.
In general, R works in a 'regressiony' way and will assume variables in a
formula are linear predictors. So, a `group` variable coded `1`...`4` will be
entered as a single parameter where 4 is considered twice as large as 2, etc.
See below for example. In the first model `cyl` is entered as a 'linear slope';
in the second each value of `cyl` (4,5, or 6) is treated as a separate category.
The predictions from each model could be very different:
```{r, eval=F}
linear.model <- lm(mpg ~ cyl, data=mtcars)
categorical.model <- lm(mpg ~ factor(cyl), data=mtcars)
```
In the case of different experimental groups what you would normally want is for
`group` to be coded and entered as a number of categorical parameters in your
model. The most common way of doing this is to use 'dummy coding', and this is
what R will implement by default for
[character or factor variables](#character-and-factor).
To make sure your categorical variables are entered into your model as
categories (and not a slope) you can either:
- Convert the variable to a character or factor type in the dataframe or
- Specify that the variable is a factor when you run the model
For example, here we specify `cyl` is a factor within the model formula:
```{r}
lm(mpg ~ factor(cyl), data=mtcars)
```
Whereas here we convert to a factor in the original dataset:
```{r}
mtcars$cyl.factor <- factor(mtcars$cyl)
lm(mpg ~ cyl.factor, data=mtcars)
```
[Neither option is universally better, but if you have variables which are
*definitely* factors (i.e. should never be used as slopes) it's probably better
to convert them in the original dataframe, before you start
modelling]{.admonition}
## Model specification {- #parameterisation}
It's helpful to think about regression and other statistical models as if they
were machines that do work for us --- perhaps looms in a cloth factory. We feed
the machines raw materials, and they busy themselves producing the finished
cloth. The nature of the finished cloth is dependent on two factors: the raw
material we feed it, and the setup and configuration of the machine itself.
In regression (and Anova) the same is true: Our finished results are the
parameter estimates the model weaves from our raw data. The pattern we see
depends on the configuration of the machine, and it's important to realise the
same data can provide very different outputs depending on the setup of the
machine.
#### Equivalent models {-}
In some cases the 'setup' of the machine produces changes which, although they
appear very different, are in fact equivalent in some sense. Let's say our
weaving machine produces a lovely set of rugs, shown in the figure below:

Now imagine that we flip all the standard settings on the weaving machine. We
feed the same raw materials to the loom, but the results _look_ very different:

The second set of rugs are inversions of the first, _but the patterns remain the
same_. The same sort of thing happens when we recode variables before entering
them in our regression models. For example:
```{r}
coef(lm(mpg ~ wt, data=mtcars))
```
We can run a completely equivalent model if we 'flip' the weight (`wt`)
coefficient by multiplying by `-1`:
```{r}
mtcars$wt.reversed <- -1 * mtcars$wt
coef(lm(mpg ~ wt.reversed, data=mtcars))
```
These models are equivalent in all the important ways: the test statistics, p
values are all the same: only the sign of the coefficient for weight has
changed.
The same kind of thing happens when we choose a
[different _coding scheme_ for categorical variables (see section below)](#regression-coding):
although the parameter estimates change when the coding format thcnages, the
_underlying model is equivalent because it would make the same predictions for
new data_.
#### Non-equivalent models {-}
In the case above we saw models which were equivalent in the sense that they
produced identical predictions for new observations.
Now we need to stretch our rug analogy a little, but imagine the output of our
machine is now an image of the northern lights, as in image A below, but that by
changing some settings of the machine, we might instead produce image B:

You might reasonably ask why we would prefer image B or C to image A? The answer
is that, when running statistical models, we must remember that they are always
_simplifications_ of reality that are (hopefully) _useful_ to us.
For example, our goal might be to use images of the northern lights to track the
position of the aurora in the sky. If so, we might find that thresholding the
picture in this way makes it easier to see where the centre of mass of the light
is. By smoothing out many of the 'ripples' and suface level patterning in the
light, the overall shape becomes clearer. And in fact this is one of the
techniques computer vision systems do use to pre-process image inputs.
Likewise, we face similar problems when analysing psychological data. For
example, when we measure an outcome repeatedly over a period of seconds, days or
months we are probably interested in the overall _shape_ of the change, rather
than in the suface-level patterning of these changes.
Let's stretch the images analogy again and simplify the image above by taking a
single pixel high 'slice' through the centre of each image (A, B and C). In the
figure below I've stretched these slices vertically so that you can see the
banding in the colour, but each of these images is just a single pixel high:

Let's simplify this even further, and convert these images to greyscale:

We might think of these 3 images as follows:
- 'Raw' represents our raw data (image A above), and the greys represent the
value of our 'outcome' (intensity of light). The x-axis in this case is
position in the sky, but could just as well be time or some other continuous
variable.
- 'Threshold' represents some kind of categorical model for these data (akin
to image B above), in which we 'chunk' up the x-axis and make predictions
for each chunk.
- 'Smooth' (akin to image C above) represents some kind of linear model with
terms which represent the gradual changes in the outcome across the range of
the x-axis (like [slopes or polynomial terms in regression](#polynomials)).
Because a digital image is just a list of `intensity` values for each pixel we
can read the image into R and plot the raw values like any other:
```{r, fig.cap="Plot of the intensity of light in the single pixel slice from the Aurora image. Intensity of 1 corresponds to white, and 0 to black in the original image."}
intensity.vec <- as.vector(png::readPNG('media/aurora-1-px-raw.png'))
aurora.image <- data_frame(
Intensity = intensity.vec) %>%
mutate(x = row_number())
aurora.image %>%
ggplot(aes(x, Intensity)) +
geom_point(size=.5)
```
We can fit linear models to these data, just like any other. So we migth start
by predicting intensity with a simple slope:
```{r}
aurora.linear <- lm(Intensity ~ x, data=aurora.image)
```
And we could make predictions from this model and plot them against the
original:
```{r}
aurora.image %>%
mutate(linear.prediction = predict(aurora.linear)) %>%
reshape2::melt(id.var='x') %>%
ggplot(aes(x, value, group=variable, color=variable)) +
geom_point(size=.5)
```
As we can see, our predictions are pretty terrible, because the linear model
only allows for a simple slope over the range of `x`.
To improve the model, we can go in one of two ways:
1. Fit slopes and curves for `x`
2. Break `x` up into chunks
#### Chunks {-}
```{r}
# Create a new chunked x variable (a factor)
x.in.chunks <- cut(aurora.image$x, breaks=8)
# Run a model with these chunks as a factor
aurora.chunks <- lm(Intensity ~ x.in.chunks, data=aurora.image)
# Plot the predictions again
aurora.image %>%
mutate(
linear.prediction = predict(aurora.linear),
chunked.prediction = predict(aurora.chunks)
) %>%
reshape2::melt(id.var='x') %>%
ggplot(aes(x, value, group=variable, color=variable)) +
geom_point(size=.5)
```
That's somewhat better, although we can still see that the extremes of our
observed data are not well predicted by either the linear model (the flat line)
or the chunked model.
[Try cutting the `x` variable into more chunks. What are the pros and cons of
doing this? How many chunks would you need to reproduce the original data
faithfully?]{.exercise}
#### Slopes and curves {-}
An alternative strategy at this point is to try and fit smooth curves through
the data. One way of doing this
([explained in greater detail in the section on polynomials](#polynomials)) is
to fit multiple parameters to represent the initial slope, and then changes in
slope, across the values of `x`. In general, we need to fit one parameter for
each change in 'direction' we want our cuve to take.
For example, we can fit a curve with 3 changes of direction by fitting the
'third degree' polynomial:
```{r}
aurora.curve <- lm(Intensity ~ poly(x, 3), data=aurora.image)
aurora.image %>%
mutate(
curve.prediction = predict(aurora.curve)
) %>%
reshape2::melt(id.var='x') %>%
ggplot(aes(x, value, group=variable, color=variable)) +
geom_point(size=.5)
```
Or we could increase the number of parameters in our curve to allow a tighter
fit with the raw data and plot all the models together:
```{r}
aurora.curve <- lm(Intensity ~ poly(x, 7), data=aurora.image)
all.predictions <- aurora.image %>%
mutate(
linear.prediction = predict(aurora.linear),
chunked.prediction = predict(aurora.chunks),
curve.prediction = predict(aurora.curve)
) %>%
reshape2::melt(id.var='x')
all.predictions %>%
ggplot(aes(x, value, group=variable, color=variable)) +
geom_point(size=.5)
```
We can see that this curved model is a better approximation to the raw data than
our 'chunked' model in some places (e.g. x = 100), but worse in others (e.g. x =
625). Overall though, the R^2^ is much higher for the curves model here:
```{r}
summary(aurora.chunks)$r.squared
summary(aurora.curve)$r.squared
```
And this is the case even though our model contains only 8 parameters, and so is
just as parsimonious as the chunked model above.
```{r}
# count the number of parameters in the chunked and curved models
length(coef(aurora.chunks))
length(coef(aurora.curve))
```
[Try to plot a curve that fits even more closely to the data. There are 1200
pixels in our original image. How many parameters would you need for the model
to fit the image exactly? What happens in practice if you try and fit this
model?]{.exercise}
For fun, we can even plot our data back in image form and see which is closest
to matching the original:
```{r, echo=F}
all.predictions %>%
mutate(variable=stringr::str_replace_all(variable, "Intensity", "raw.image")) %>%
mutate(variable=forcats::fct_rev(variable)) %>%
rename(Intensity=value) %>%
ggplot(aes(x, factor(1), fill=Intensity)) +
geom_raster() +
facet_wrap(~variable, ncol=1) +
xlab("") + ylab("") + guides(fill=F) +
scale_x_continuous(breaks=NULL) +
scale_fill_gradient(low="black", high="white") + theme_minimal() +
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
```
[There is no 'right answer' here: each model has pros and cons. You need to
think about what the purpose of your model is, how you want to simplify your
data, and then set up your models appropriately.]{.admonition}
### Effect/dummy coding and contrasts {- #regression-coding}
TODO: Explain this:
```{r}
options(contrasts = c("contr.treatment", "contr.poly"))
lm(mpg~factor(cyl), data=mtcars)
options(contrasts = c("contr.sum", "contr.poly"))
lm(mpg~factor(cyl), data=mtcars)
```
### Centering (is often helpful) {- #centering}
When interpreting regression coefficients, and especially when interactions are
present in a model, it's often overlooked that the regression parameters are the
effect on the outcome of a 1-unit change in the predictor, _when all the other
predictors are zero_. This applies to the intercept too: it is the predicted
value of the outcome when _all_ of the predictors are zero.
This is unhelpful because it makes the intercept mostly meaningless, and the
other coefficients harder to interpret.
It's often a good idea to _center_ your predictors so that you can interpret the
intercept of the model as the average of your sample.
<!-- TODO FLESH THIS OUT -->
### Scaling inputs {- #scaling-regression-inputs}
Interpreting regression coefficients requires that we think in the _units_ of
the predictor.
For example, if we include 'age in years' in our model, then this `yob`
coefficient gives us the change in the outcome for each additional year.
However, we're often not interested in the effect of a single year. If we are
dealing with the effect of age in the general population, we're unlikely to care
about the effect of 1 year, and it might be more useful and natural to think
about 10-year differences in age. In contrast, if our research is on
adolescence, then the changes of the course of a year might be too crude, and we
may want to think about changes over months instead.
It's important to realise there is no general or 'correct' solution to this
problem. Regression models don't care about the scale of our variables (within
limits), but we do need to make choice about how we scale inputs. These choices
should aim to
- make regression coefficients easily interpretable and
- make results comparable across studies
These two goals will not always be 100% aligned, and there will be tradeoffs
needed as you select your strategy, which will normally be one of:
1. Putting coefficients on a 'natural' scale or relate to meaningful quantities
2. Standardising coefficients.
##### Using a 'natural' scale {-}
This will often mean just leaving your predictors 'as-is', but it might also
mean dividing your predictor by some number to put it into more convenient
units. For example, dividing age in years by 10 would mean that you can
interpret the coefficient as the change over a decade, which might be easier to
think about.
##### Standardising {-}
Gelman [@gelman2008scaling] recommends standardising coefficients by centering
and dividing by two standard deviations. This can be useful because binary
variables like sex (male/female) will then be on a _similar_ scale to numeric
inputs.
However, be cautious when standardising. You will sometimes see people interpret
standardised coefficients in terms of 'relative importance' of the predictors.
For example, they might say that if $\beta^{1}= .2$ and $\beta^{2} = .4$ then
$\beta^{2}$ is twice as important as $\beta^{1}$. Although this is appealing,
it's not always valid.
The main problem is that you don't always know whether you have a full range of
values of predictors in your sample. For example, imagine a case where the a
regression coefficient for `age` was linear, and = .5 in a sample from the
general population.
```{r, include=F}
set.seed(1234)
incomes <- data_frame(age = round(rnorm(1000, 40, 15)),
gender = sample(0:1, 1000, replace=T),
income = round(rnorm(1000, 30000, 1000) + 30*age + 1000 * gender)) %>%
mutate(gender=factor(gender, labels=c("F", "M")))
```
We can plot these data to show the effect of age, and gender:
```{r}
ggplot(incomes, aes(age, income, group=gender, color=gender)) + geom_point() + geom_smooth(se=F, method="lm")
```
Older people earn more than younger people, and men earn slighly more than women
(in this simulated dataset), but this gender gap doesn't change with age.
We can model this and print the effects of `age` and `gender`:
```{r}
m1 <- lm(income~age+gender, data=incomes)
coef(m1)
```
And we can standardize these effects using the `stadardize` function:
```{r}
coef(arm::standardize(m1))
```
Based on these standardised coefficients we migth say that age and gender are of
roughly equal importance in predicting income.
If we re-fit the model on a subset of the data, for example only individuals
under 40, the regression coefficients won't change much because the effect was
constant across the range of ages:
```{r}
younger.incomes <- incomes %>%
filter(age<40)
m2 <- lm(income~age+gender, data=younger.incomes)
coef(m2)
```
But, the standardised coefficients _do_ change, because we have restricted the
range of ages in the sample:
```{r}
coef(arm::standardize(m2))
```
The standardised effect of `age` is now roughly half that of `gender`.
[The take home message here is that standardisation can be useful to put
predictors on a similar scale, but it's not a panacea and can't be interpreted
as a simple measure of 'importance'. You still need to think!]{.admonition}
### Alternatives to rescaling {-}
A nice alternative to scaling the inputs of your regression is to set aside the
raw coefficients and instead make
[predictions for values of the predictors that are of theoretical or practical interest](#predictions-and-margins).
The section on [predictions and marginal effects](#predictions-and-margins) has
lots more detail on this.
### What next {-}
**_It is strongly recommended that you read the [section on Anova](anova.html)
before doing anything else._**
As noted above, R has a number of important differences in it's default
settings, as compared with packages like Stata or SPSS. These can make important
differences to the way you interpret the output of linear models, especially
Anova-type models with categorical predictors.