generated from r4ds/bookclub-template
-
Notifications
You must be signed in to change notification settings - Fork 5
/
08_statistical-models.Rmd
327 lines (217 loc) · 13.9 KB
/
08_statistical-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
# Statistical Models
**Learning objectives:**
- describe graphs that can help you interpret the results of statistical models, focusing on models that have a single response variable that is either quantitative (a number) or binary (yes/no)
## Correlation plots
- [Correlation coefficients](https://www.statisticshowto.com/probability-and-statistics/correlation-coefficient-formula/) are used to measure how strong a relationship is between two variables.
- There are several types of correlation coefficient, but the most popular is Pearson’s Product-Moment correlation coefficient, which is commonly used in linear regression.
- Correlation plots help you to visualize the pairwise relationships between a set of quantitative variables by displaying their correlations using color or shading.
## Saratoga Houses Dataset
- Let's look at the [Saratoga Houses dataset](https://rkabacoff.github.io/datavis/Data.html#SaratogaHousing), which contains the sale price and characteristics of Saratoga County, NY homes in 2006 in order to explore the relationships among the quantitative variables,
```{r 08-01}
data(SaratogaHouses, package="mosaicData")
# select numeric variables
df <- dplyr::select_if(SaratogaHouses, is.numeric)
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)
```
## [`ggcorrplot`](https://www.rdocumentation.org/packages/ggcorrplot/versions/0.1.1/topics/ggcorrplot) package
- The `ggcorrplot` function in the `ggcorrplot` package can be used to visualize these correlations.
- By default, it creates a `ggplot2` graph were darker red indicates stronger positive correlations, darker blue indicates stronger negative correlations and white indicates no correlation.
```{r 08-02}
library(ggplot2)
library(ggcorrplot)
ggcorrplot(r)
```
## `ggcorrplot` function
The `ggcorrplot` function has a number of options for customizing the output. For example,
- `hc.order = TRUE` reorders the variables, placing variables with similar correlation patterns together.
- `type = "lower"` plots the lower portion of the correlation matrix.
- `lab = TRUE` overlays the correlation coefficients (as text) on the plot.
```{r 08-03}
ggcorrplot(r,
hc.order = TRUE,
type = "lower",
lab = TRUE)
```
## Linear Regression
- [Linear regression](https://towardsdatascience.com/linear-regression-detailed-view-ea73175f6e86) allows us to explore the relationship between a quantitative response variable and an explanatory variable while other variables are held constant.
- Below is a model to predict home prices, the response variable, by the explanatory variables, lot size ($ft^2$), age (yrs), land value ($\$1000s$), living area ($ft^2$), number of bedrooms and bathrooms and whether the home is on the waterfront or not.
```{r 08-04}
data(SaratogaHouses, package="mosaicData")
houses_lm <- lm(price ~ lotSize + age + landValue +
livingArea + bedrooms + bathrooms +
waterfront,
data = SaratogaHouses)
summary(houses_lm)$coefficients
```
- We estimate that an increase of one square foot of living area is associated with a home price increase of $\$75$, holding the other variables constant.
- We estimate that a waterfront home costs approximately $\$120,726$ more than non-waterfront home, again controlling for the other variables in the model.
## [`visreg`](http://pbreheny.github.io/visreg/index.html) package
- The `visreg` package provides tools for visualizing these conditional relationships.
- The `visreg` function takes (1) the model and (2) the variable of interest and plots the conditional relationship, controlling for the other variables.
- The option `gg = TRUE` is used to produce a `ggplot2` graph.
```{r 08-05}
# conditional plot of price vs. living area
library(ggplot2)
library(visreg)
visreg(houses_lm, "livingArea", gg = TRUE)
```
- The graph suggests that, holding all else equal, sales price increases with living area in a linear fashion.
## Another linear regression example
- Continuing the example, the price difference between waterfront and non-waterfront homes is plotted, controlling for the other seven variables.
- Since a `ggplot2` graph is produced, other `ggplot2` functions can be added to customize the graph.
```{r 08-06}
# conditional plot of price vs. waterfront location
visreg(houses_lm, "waterfront", gg = TRUE) +
scale_y_continuous(label = scales::dollar) +
labs(title = "Relationship between price and location",
subtitle = "controlling for lot size, age, land value, bedrooms and bathrooms",
caption = "source: Saratoga Housing Data (2006)",
y = "Home Price",
x = "Waterfront")
```
- We see that there are far fewer homes on the water, and they tend to be more expensive (even controlling for size, age, and land value).
## [Logistic Regression](https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148)
- Logistic regression can be used to explore the relationship between a binary response variable and an explanatory variable while other variables are held constant.
- Binary response variables have two levels (yes/no, lived/died, pass/fail, malignant/benign).
- As with linear regression, we can use the `visreg` package to visualize these relationships.
## [CPS85 dataset](https://rkabacoff.github.io/datavis/Data.html#CPS85) and linear regression
- The CPS85 dataset from the `mosaicData` package, contains 1985 data on wages and other characteristics of workers.
- Using the CPS85 data let’s predict the [log-odds](https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704) of being married, given one’s sex, age, race and job sector.
```{r 08-07}
# fit logistic model for predicting
# marital status: married/single
data(CPS85, package = "mosaicData")
cps85_glm <- glm(married ~ sex + age + race + sector,
family="binomial",
data=CPS85)
```
## Visualizing logistic regression
- Using the fitted model, we visualize the relationship between age and the probability of being married, holding the other variables constant.
- The `scale = "response"` option creates a plot based on a probability (rather than log-odds) scale.
```{r 08-08}
# plot results
library(ggplot2)
library(visreg)
visreg(cps85_glm, "age",
gg = TRUE,
scale="response") +
labs(y = "Prob(Married)",
x = "Age",
title = "Relationship of age and marital status",
subtitle = "controlling for sex, race, and job sector",
caption = "source: Current Population Survey 1985")
```
- Here we see that the probability of being married is estimated to be roughly 0.5 at age 20 and decreases to 0.1 at age 60, controlling for the other variables.
## Visualizing multiple conditional logistic regression plots
- We can create multiple conditional plots by adding a `by` option.
- The following code plots the probability of being married given age, seperately for men and women (`by="sex"`), controlling for race and job sector.
```{r 08-09}
# plot results
library(ggplot2)
library(visreg)
visreg(cps85_glm, "age",
by = "sex",
gg = TRUE,
scale="response") +
labs(y = "Prob(Married)",
x = "Age",
title = "Relationship of age and marital status",
subtitle = "controlling for race and job sector",
caption = "source: Current Population Survey 1985")
```
- In this data, the probability of marriage is very similar for men and women.
## [Survival Analysis](https://towardsdatascience.com/the-complete-introduction-to-survival-analysis-in-python-7523e17737e6)
- In many research settings, especially healthcare research, the response variable is the time to an event i.e. time to recovery, time to death, or time to relapse.
- If the event has not occurred for an observation (either because the study ended or the patient dropped out) the observation is said to be *censored*.
- The [NCCTG Lung Cancer](https://rkabacoff.github.io/datavis/Data.html#Lung) dataset in the `survival` package provides data on the survival times of patients with advanced lung cancer following treatment for up to 34 months.
- The outcome for each patient is measured by two variables (1) *time*, survival time in days, (2) *status*, 1 = censored, 2 = dead.
- Thus a patient with time = 305 & status = 2 lived 305 days following treatment. Another patient with time = 400 & status = 1, lived at least 400 days but then dropped out of the study. A patient with time = 1022 & status = 1, survived to the end of the study (34 months).
## Survival Plots using [`ggsurvplot`](https://exts.ggplot2.tidyverse.org/gallery/)
- A survival plot (also called a Kaplan-Meier Curve) can be used to illustrates the probability that an individual survives up to and including time *t*.
```{r 08-10, warning=FALSE, message=FALSE}
# plot survival curve
library(survival)
library(survminer)
data(lung)
sfit <- survfit(Surv(time, status) ~ 1, data=lung)
ggsurvplot(sfit,
title="Kaplan-Meier curve for lung cancer survival")
```
- Roughly 50% of patients are still alive 300 days post treatment.
- We can run `summary(sfit)` for more details.
## Comparing survival probabilities
- It is frequently of great interest whether groups of patients have the same survival probabilities.
- In this graph, the survival curve for men and women are compared.
```{r 08-11}
# plot survival curve for men and women
sfit <- survfit(Surv(time, status) ~ sex, data=lung)
ggsurvplot(sfit,
conf.int=TRUE,
pval=TRUE,
legend.labs=c("Male", "Female"),
legend.title="Sex",
palette=c("cornflowerblue", "indianred3"),
title="Kaplan-Meier Curve for lung cancer survival",
xlab = "Time (days)")
```
- A couple `ggsurvplot` options: (1) `conf.int` provides confidence intervals, (2) `pval` provides a log-rank test comparing the survival curves.
- The p-value (0.0013) provides strong evidence that men and women have different survival probabilities following treatment.
## Mosaic Plots
- Mosaic charts can display the relationship between categorical variables using rectangles whose areas represent the proportion of cases for any given combination of levels.
- He we look at the [Titanic dataset](https://rkabacoff.github.io/datavis/Data.html#Titanic) and visualize the relationship between the three categorical variables in the code below.
```{r 08-12, message=FALSE}
# input data
library(readr)
titanic <- read_csv("data/titanic.csv")
# create a table
tbl <- xtabs(~Survived + Class + Sex, titanic)
ftable(tbl)
```
## Mosaic plots and the `vcd` package
- Although mosaic charts can be created with `ggplot2` using the [`ggmosaic`](https://cran.r-project.org/web/packages/ggmosaic/vignettes/ggmosaic.html) package, the author recommends using the `vcd` package instead, which won’t create `ggplot2` graphs, but provides a more comprehensive approach to visualizing categorical data.
- In a mosaic plot, the size of the tile is proportional to the percentage of cases in that combination of levels.
- For example, as seen below, more Titanic passengers perished than survived, and those that perished were primarily 3rd class male passengers and male crew (the largest group).
```{r 08-13, message=FALSE}
# create a mosaic plot from the table
library(vcd)
mosaic(tbl, main = "Titanic data")
```
## Adding color to mosaic plot
- In mosaic plots, the color of the tiles can also be used to indicate the degree relationship among the variables.
- If we assume that these three variables are independent, we can examine the residuals (the error between a predicted value and the observed actual value) from the model and shade the tiles to match.
- In the graph below, dark blue represents more cases than expected given independence, and dark red represents less cases than expected if independence holds.
```{r 08-14}
mosaic(tbl,
shade = TRUE,
legend = TRUE,
labeling_args = list(set_varnames = c(Sex = "Gender",
Survived = "Survived",
Class = "Passenger Class")),
set_labels = list(Survived = c("No", "Yes"),
Class = c("1st", "2nd", "3rd", "Crew"),
Sex = c("F", "M")),
main = "Titanic data")
```
- We can see that if class, gender, and survival are independent, we are seeing many more male crew perishing, and 1st, 2nd and 3rd class females surviving than would be expected.
- Conversely, far fewer 1st class passengers (both male and female) died than would be expected by chance, thus the assumption of independence is rejected.
- NOTE: For complicated tables, labels can easily overlap. See [`labeling_border`](https://www.rdocumentation.org/packages/vcd/versions/1.4-4/topics/labeling_border), for plotting options.
## Resources
- [R Graph Gallery: Correlogram](https://r-graph-gallery.com/correlogram.html)
- [ggplot2 extension gallery](https://exts.ggplot2.tidyverse.org/gallery/)
- The [`lm`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm) function is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although aov may provide a more convenient interface for these).
- The [`glm`](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm) function is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
## Meeting Videos {-}
### Cohort 1 {-}
`r knitr::include_url("https://www.youtube.com/embed/AEKNf7p2hYA")`
<details>
<summary> Meeting chat log </summary>
```
00:38:11 Kotomi Oda: Putting color makes it much easier to read
00:40:24 Tiffany Kollah: Reacted to "Putting color makes ..." with ❤️
00:41:16 Lydia Gibson: http://pbreheny.github.io/visreg/index.html
00:42:51 Lydia Gibson: https://www.rdocumentation.org/packages/glmnet/versions/4.1-7
00:43:25 Kotomi Oda: yes
```
</details>