-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathpredictions-and-margins.Rmd
339 lines (253 loc) · 11 KB
/
predictions-and-margins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
---
title: 'Making predictions from models'
bibliography: bibliography.bib
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache=T, message=F, warning=F)
library(tidyverse)
library(pander)
library(tufte)
library(ggthemes)
theme_set(theme_tufte(base_size = 18))
```
# Making predictions {#predictions-and-margins}
Objectives of this section:
- Distingish predicted means (predictions) from predicted effects ('margins')
- Calculate both predictions and marginal effects for a `lm()`
- Plot predictions and margins
- Think about how to plot effects in meaningful ways
### Predictions vs margins {-}
Before we start, let's consider what we're trying to achieve in making
predictions from our models. We need to make a distinction between:
- Predicted means
- Predicted effects or _marginal effects_
Consider the example used in a previous section where we measured
`injury.severity` after road accidents, plus two predictor variables: `gender`
and `age`.
### Predicted means {-}
'Predicted means' (or predictions) refers to our best estimate for each category
of person we're interested in. For example, if `age` were categorical (i.e.
young vs. older people) then might have 4 predictions to calculate from our
model:
```{r, echo=F}
expand.grid(Age=c("Young", "Old"), Gender=c("Male", "Female")) %>% as_data_frame() %>%
mutate(mean="?") %>% pander()
```
And as before, we might plot these data:
```{r, fig.cap="Point and line plot of injury severity by age and gender.", echo=F}
load('data/injury.Rdata')
inter.df <- inter.df %>% filter(interaction=="Interaction")
means.plot <- inter.df %>%
ggplot(aes(older, severity.of.injury, group=female, color=female)) +
geom_point() +
geom_line() +
scale_color_discrete(name="") +
ylab("Injury severity") + xlab("")
means.plot
```
This plot uses the raw data, but these points could equally have been estimated
from a statistical model which adjusted for other predictors.
### _Effects_ (margins) {-}
Terms like: _predicted effects_, _margins_ or _marginal effects_ refer, instead,
to the effect of one predictor.
There may be more than one marginal effect because _the effect of one predictor
can change across the range of another predictor_.
Extending the example above, if we take the difference between men and women for
each category of age, we can plot these differences. The steps we need to go
through are:
- Reshape the data to be wide, including a separate column for injury scores
for men and women
- Subtract the score for men from that of women, to calculate the effect of
being female
- Plot this difference score
```{r, echo=T}
margins.plot <- inter.df %>%
# reshape the data to a wider format
reshape2::dcast(older~female) %>%
# calculate the difference between men and women for each age
mutate(effect.of.female = Female - Male) %>%
# plot the difference
ggplot(aes(older, effect.of.female, group=1)) +
geom_point() +
geom_line() +
ylab("Effect of being female") + xlab("") +
geom_hline(yintercept = 0)
margins.plot
```
As before, these differences use the raw data, but _could_ have been calculated
from a statistical model. In the section below we do this, making predictions
for means and marginal effects from a `lm()`.
### Continuous predictors {-}
In the examples above, our data were all categorical, which mean that it was
straightforward to identify categories of people for whom we might want to make
a prediction (i.e. young men, young women, older men, older women).
However, `age` is typically measured as a continuous variable, and we would want
to use a grouped scatter plot to see this:
```{r, echo=F, include=F}
# in this block we simulate some more data to illustrate the points below
set.seed(1234)
injuries <- expand.grid(female=0:1, age=30:50, person=1:100) %>%
as_data_frame() %>%
mutate(severity.of.injury = 50 + -10 * female + .2 * age + .4 * female*age +
rnorm(length(.$female), 0, 5),
age = age+rnorm(n(), 0, 1)) %>%
mutate(gender=factor(female, labels=c("Male", "Female"))) %>%
mutate(age.category = cut(age, breaks=2, labels=c("young", "older"))) %>%
sample_n(1000)
```
```{r}
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(size=1) +
scale_color_discrete(name="")
```
But to make predictions from this continuous data we need to fit a line through
the points (i.e. run a model). We can do this graphically by calling
`geom_smooth()` which attempts to fit a smooth line through the data we observe:
```{r, fig.cap="Scatter plot overlaid with smooth best-fit lines", message=F}
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(alpha=.2, size=1) +
geom_smooth(se=F)+
scale_color_discrete(name="")
```
And if we are confident that the relationships between predictor and outcome are
sufficiently _linear_, then we can ask ggplot to fit a straight line using
linear regression:
```{r, fig.cap="Scatter plot overlaid with smoothed lines (dotted) and linear predictions (coloured)", message=F}
injuries %>%
ggplot(aes(age, severity.of.injury, group=gender, color=gender)) +
geom_point(alpha = .1, size = 1) +
geom_smooth(se = F, linetype="dashed") +
geom_smooth(method = "lm", se = F) +
scale_color_discrete(name="")
```
What these plots illustrate is the steps a researcher might take _before_
fitting a regression model. The straight lines in the final plot represent our
best guess for a person of a given age and gender, assuming a linear regression.
We can read from these lines to make a point prediction for men and women of a
specific age, and use the information about our uncertainty in the prediction,
captured by the model, to estimate the likely error.
To make our findings simpler to communicate, we might want to make estimates at
specific ages and plot these. These ages could be:
- Values with biological or cultural meaning: for example 18 (new driver) v.s.
65 (retirement age)
- Statistical convention (e.g. median, 25th, and 75th centile, or mean +/- 1
SD)
We'll see examples of both below.
## Predicted means and margins using `lm()` {-}
The section above details two types of predictions: predictions for means, and
predictions for margins (effects). We can use the figure below as a way of
visualising the difference:
```{r, fig.width=4, fig.height=1.5, fig.cap="Example of predicted means vs. margins. Note, the margin plotted in the second panel is the difference between the coloured lines in the first. A horizontal line is added at zero in panel 2 by convention."}
gridExtra::grid.arrange(means.plot+ggtitle("Means"), margins.plot+ggtitle("Margins"), ncol=2)
```
### Running the model {-}
Lets say we want to run a linear model predicts injury severity from gender and
a categorical measurement of age (young v.s. old).
Our model formula would be: `severity.of.injury ~ age.category * gender`. Here
we fit it an request the Anova table which enables us to test the main effects
and interaction^[Because this is simulated data, the main effects and
interactions all have tiny p values.]:
```{r}
injurymodel <- lm(severity.of.injury ~ age.category * gender, data=injuries)
anova(injurymodel)
```
Having saved the regression model in the variable `injurymodel` we can use this
to make predictions for means and estimate marginal effects:
### Making predictions for means {-}
When making predictions, they key question to bear in mind is 'predictions for
what?' That is, what values of the predictor variables are we going to use to
estimate the outcome?
It goes like this:
1. Create a new dataframe which contains the values of the predictors we want to
make predictions at
2. Make the predictions using the `predict()` function.
3. Convert the output of `predict()` to a dataframe and plot the numbers.
#### Step 1: Make a new dataframe {-}
```{r}
prediction.data <- data_frame(
age.category = c("young", "older", "young", "older"),
gender = c("Male", "Male", "Female", "Female")
)
prediction.data
```
#### Step 2: Make the predictions {-}
The R `predict()` function has two useful arguments:
- `newdata`, which we set to our new data frame containing the predictor
values of interest
- `interval` which we here set to confidence^[This gives us the confidence
interval for the prediction, which is the range within which we would expect
the true value to fall, 95% of the time, if we replicated the study. We
could ask instead for the `prediction` interval, which would be the range
within which 95% of new observations with the same predictor values would
fall. For more on this see the section on
[confidence v.s. prediction intervals](confidence-vs-prediction-intervals.html)]
```{r}
injury.predictions <- predict(injurymodel, newdata=prediction.data, interval="confidence")
injury.predictions
```
### Making prdictions for margins (_effects_ of predictors) {-}
```{r}
library('tidyverse')
m <- lm(mpg~vs+wt, data=mtcars)
m.predictions <- predict(m, interval='confidence')
mtcars.plus.predictions <- bind_cols(
mtcars,
m.predictions %>% as_data_frame()
)
prediction.frame <- expand.grid(vs=0:1, wt=2) %>%
as_data_frame()
prediction.frame.plus.predictions <- bind_cols(
prediction.frame,
predict(m, newdata=prediction.frame, interval='confidence') %>% as_data_frame()
)
mtcars.plus.predictions %>%
ggplot(aes(vs, fit, ymin=lwr, ymax=upr)) +
stat_summary(geom="pointrange")
```
```{r}
prediction.frame.plus.predictions %>% ggplot(aes(vs, fit, ymin=lwr, ymax=upr)) + geom_pointrange()
```
```{r}
prediction.frame.plus.predictions
mtcars.plus.predictions %>% group_by(vs) %>%
summarise_each(funs(mean), fit, lwr, upr)
```
### Marginal effects {-}
What is the effect of being black or female on the chance of you getting
diabetes?
Two ways of computing, depending on which of these two you hate least:
- Calculate the effect of being black for someone who is 50% female (marginal
effect at the means, MEM)
- Calculate the effect first pretending someone is black, then pretending they
are white, and taking the difference between these estimate (average
marginal effect, AME)
```{r}
library(margins)
margins(m, at = list(wt = 1:2))
m2 <- lm(mpg~vs*wt, data=mtcars)
summary(m2)
m2.margins <- margins(m2, at = list(wt = 1.5:4.5))
summary(m2.margins)
summary(m2.margins) %>% as_data_frame() %>%
filter(factor=="vs") %>%
ggplot(aes(wt, AME)) +
geom_point() + geom_line()
```
## Predictions with continuous covariates {-}
- Run 2 x Continuous Anova
- Predict at different levels of X
## Visualising interactions {-}
<!-- check this:
https://strengejacke.wordpress.com/2013/10/31/visual-interpretation-of-interaction-terms-in-linear-models-with-ggplot-rstats/
Also - you can interpret main effect when there are interactions
http://www.theanalysisfactor.com/interpret-main-effects-interaction/
Show them `granova`?
-->
Steps this page will work through:
- Running the the model (first will be a 2x2 between Anova)
- Using `predict()`.
- Creating predictions at specific values
- Binding predictions and the original data together.
- Using GGplot to layer points, lines and error bars.