-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathgeneralized-linear-models.Rmd
193 lines (144 loc) · 6.54 KB
/
generalized-linear-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
title: 'Generalized linear models'
---
```{r, include=FALSE}
library(tidyverse)
```
# Generalized linear models
Linear regression is suitable for outcomes which are continuous numerical
scores. In practice this requirement is often relaxed slightly, for example for
data which are slightly skewed, or where scores are somewhat censored (e.g.
questionnaire scores which have a minium or maximum).
However, for some types of outcomes standard linear models are unsuitable.
Examples here include binary (zero or one) or count data (i.e. positive integers
representing frequencies), or proportions (e.g. proportion of product failures
per batch). This section is primarily concerned with binary outcomes, but many
of the same principles apply to these other types of outcome.
### Logistic regression {- #logistic-regression}
In R we fit logistic regression with the `glm()` function which is built into R,
or if we have a [multilevel model](#multilevel-models) with a binary outcome we
use `glmer()` from the `lme4::` package.
Fitting the model is very similar to linear regression, except we need to
specify the `family="binomial"` parameter to let R know what type of data we are
using.
Here we use the `titanic` dataset (you can download this from
[Kaggle](https://www.kaggle.com/c/titanic/data), although you need to sign up
for an account).
Before we start fitting models, it's best to plot the data to give us a feel for
what is happening.
[Figure 1 reveals that, across all fare categories, women were more likely to
survive the disaster than men. Ticket class also appears to be related to
outcome: those with third class tickets were less likely to survive than those
with first or second class tickets. However, differences in survival rates for
men and women differed across ticket classes: women with third class tickets
appear to have been less advantaged (compared to men) than women with first or
second class tickets.]{.apa-example}
```{r, message=F, fig.cap="Survival probabilities by Sex and ticket class."}
titanic <- read.csv('data/titanic.csv')
titanic %>%
ggplot(aes(factor(Pclass), Survived,
group=Sex, color=Sex)) +
stat_summary() +
stat_summary(geom="line") +
xlab("Ticket class")
```
Given the plot above, it seems reasonable to predict survival from `Sex` and
`Pclass`, and also to include the interaction between these variables.
To run a logistic regression we specify the model as we would with `lm()`, but
instead use `glm()` and specify the `family` parameter:
```{r}
m <- glm(Survived ~ Sex * factor(Pclass),
data=titanic, family = binomial(link="logit"))
```
#### {- #helper-function-logistic}
Because it can become repetitive to write out the `family` parameter in full
each time, I usually write a ['helper function'](#helper-functions) called
`logistic()` which simply calls `glm` with the right settings. For example:
```{r}
# define a helper function for logistic regression the '...'
# means 'all arguments', so this function passes all it's
# arguments on to the glm function, but sets the family correctly
logistic <- function(...) {
glm(..., family = binomial(link="logit"))
}
```
Which you can use like so:
```{r}
logistic(Survived ~ Sex * factor(Pclass), data=titanic)
```
#### Tests of parameters {-}
As with `lm()` models, we can use the `summary()` function to get p values for
parameters in `glm` objects:
```{r}
titanic.model <- logistic(Survived ~ Sex * factor(Pclass), data=titanic)
summary(titanic.model)
```
[You might have spotted in this table that `summary` reports *z* tests rather
than *t* tests for parameters in the `glm` model. These can be interepreted as
you would the t-test in a linear model, however.]{.explainer}
##### Tests of categorical predictorss {-}
Where there are categorical predictors we can also reuse the `car::Anova`
function to get the equivalent of the F test from a linear model
([with type 3 sums of squares](#sums-squares); remember not to use the built in
`anova` function unless you want type 1 sums of squares):
```{r}
car::Anova(titanic.model, type=3)
```
[Note that the Anova table for a `glm` model provides $\chi^2$ tests in place of
F tests. Although they are calculated differently, you can interpret these
$\chi^2$ tests and *p* values as you would for F tests in a regular
Anova.]{.explainer}
#### Predictions after `glm` {- #glm-predictions}
As with linear models, we can make predictions from `glm` models for our current
or new data.
One twist here though is that we have to choose whether to make predictions in
units of the response (i.e. probability of survival), or of the transformed
response (logit) that is actually the 'outcome' in a `glm` (see the
[explainer on transformations and links functions](#link-functions)).
_You will almost always want predictions in the units of your response_, which
means you need to add `type="response"` to the `predict()` function call. Here
we predict the chance of survival for a new female passenger with a first class
ticket:
```{r}
new.passenger = expand.grid(Pclass=1, Sex=c("female"))
predict.glm(titanic.model, newdata=new.passenger, type="response")
```
And we could plot probabilities for each gender and class with a standard error
for this prediction if desired:
```{r}
new.passengers = expand.grid(Pclass=1:3, Sex=c("female", "male"))
# this creates two vectors: $fit, which contains
# predicted probabilities and $se.fit
preds <- predict.glm(titanic.model,
newdata=new.passengers,
type="response",
se.fit=T)
new.passengers %>%
mutate(fit = preds$fit,
lower=fit - preds$se.fit,
upper=fit + preds$se.fit) %>%
ggplot(aes(factor(Pclass), fit,
ymin=lower, ymax=upper,
group=Sex, color=Sex)) +
geom_pointrange() +
geom_line() +
xlab("Ticket class") +
ylab("Probability of survival")
```
#### Evaluating logistic regression models {- #glm-cross-validation}
`glm` models don't provide an _R_^2^ statistic, but it is possible to evaluate
how well the model fits the data in other ways.
[Although there are various pseudo-*R*^2^ statistics available for `glm`; see
https://www.r-bloggers.com/evaluating-logistic-regression-models/]{.explainer}
One common technique, however, is to build a model using a 'training' dataset
(sometimes a subset of your data) and evaluate how well this model predicts new
observations in a 'test' dataset. See http://r4ds.had.co.nz/model-assess.html
for an introduction.
<!-- TODO
- interactions and weirness between probbaility and logit space
-->
<!-- TODO
### Poisson regression {-}
- fitting
- predicted counts
-->