-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathcolumn-types-and-missing.Rmd
344 lines (203 loc) · 11.6 KB
/
column-types-and-missing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
title: 'Variable types and missing data'
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=F, message=F, warning=F)
library(tufte)
library(tidyverse)
library(broom)
library(pander)
```
## Types of variable {- #factors-and-numerics}
When working with data in Excel or other packages like SPSS you've probably become aware that different types of data get treated differently.
For example, in Excel you can't set up a formula like `=SUM(...)` on cells which include letters (rather than just numbers). It does't make sensible.
However, Excel and many other programmes will sometimes make guesses about what to do if you combine different types of data.
For example, in Excel, if you add `28` to `1 Feb 2017` the result is `1 March 2017`. This is sometimes what you want, but can often lead to [unexpected results and errors in data analyses](http://www.sciencemag.org/news/sifter/one-five-genetics-papers-contains-errors-thanks-microsoft-excel).
R is much more strict about not mixing types of data. Vectors (or columns in dataframes) can only contain one type of thing. In general, there are probably 4 types of data you will encounter in your data analysis:
- Numeric variables
- Character variables
- Factors
- Dates
```{r, echo=F, include=F}
# process the lakers data from lubridate to illustrate some points
Sys.setenv(TZ="Europe/London")
lakers <- lubridate::lakers %>%
select(date, opponent, team, points) %>%
mutate(date=lubridate::ymd(date), team=factor(team))
saveRDS(lakers, file="data/lakers.RDS")
```
The file `data/lakers.RDS` contains a dataset adapted from the `lubridate::lakers` dataset (this is a dataset built into [an add-on package for R](#packages)).
This dataset contains four variables to illustrate the common variable types (a subset of the original dataset which provides scores and other information from each Los Angeles Lakers basketball game in the 2008-2009 season). We have the `date`, `opponent`, `team`, and `points` variables.
```{r}
lakers <- readRDS("data/lakers.RDS")
lakers %>%
glimpse()
```
One thing to note here is that the `glimpse()` command tells us the *type* of each variable. So we have
- `points`: type `int`, short for integer (i.e. whole numbers).
- `date`: type `date`
- `opponent`: type `chr`, short for 'character', or alaphanumeric data
- `team`: type `fctr`, short for factor and
### Differences in *quantity*: numeric variables {- #diffsinquantity}
We've already seem numeric variables in the section on [vectors and lists](#vectors). These behave pretty much as you'd expect, and we won't expand on them here.
#### {- .tip}
There are different types of numeric variable. Integers (whole numbers) are stored as type `int` but other types, like `dbl`, can can store numbers with a decimal place. For most purposes (in doing analyses of psychological data) the differences won't matter.
### Differences in *quality or kind* {- #character-and-factor}
In many cases variables will be used to identify values which are *qualitatively different*. For example, different groups or measurement occasions in an experimental study, or perhaps different genders or countries in survey data.
In practice, these qualitative differences get stored in a range of different variable types, including:
- Numeric variables (e.g. `time = 1`, or `time = 2`...)
- Character variables (e.g. `time = "time 1"`, `time = "time 2"`...)
- Boolean or logical variables (e.g. `time1 == TRUE` or `time1 == FALSE`)
- 'Factors'
Storing categories as numeric variables can produce [confusing results when running regression models](#factors-vs-linear-inputs).
For this reason, it's normally best to store your categorical variables as descriptive strings of letters and numbers (e.g. "Treatment", "Control") and avoid simple numbers (e.g. 1, 2, 3). Or as a factor.
### Factors for categorical data {- #facsforcategs}
Factors are R's answer to the problem of storing categorical data.
Factors assign one number for each unique value in a variable, and allow you to attach a label to it.
This means the categories are stored as numbers 'under the hood', but you can also work with factors as though they were strings of letters and numbers, and they display nicely when making tables and graphs.
For example:
```{r}
1:10
group.factor <- factor(1:10)
group.factor
group.labelled <- factor(1:10, labels = paste("Group", 1:10))
group.labelled
```
We can see this 'underlying' number which represents each category by using `as.numeric`:
```{r}
# note, there is no guarantee that "Group 1" == 1 (although it is here)
as.numeric(group.labelled)
```
For simple analyses it's often best to store everything as the `character` type (letters and numbers), but factors can still be useful for making tables or graphs where the list of categories is known and needs to be in a particular order. For more about factors, and lots of useful functions for working with them, see the `forcats::` package: <https://github.com/tidyverse/forcats>
### Dates {- #storingdates}
Internally, R stores dates as the number of days since January 1, 1970. This means that we can work with dates just like other numbers, and it makes sense to have the `min()`, or `max()` of a series of dates:
```{r}
# the first few dates in the sequence
head(lakers$date)
# first and last dates
min(lakers$date)
max(lakers$date)
```
Because dates are numbers we can also do arithmetic with them, and R will give us a difference (in this case, in days):
```{r}
max(lakers$date) - min(lakers$date)
```
However, R does treat dates slightly differently from other numbers, and will format plot axes appropriately, which is helpful (see more on this in the [graphics section](#graphics)):
```{r}
hist(lakers$date, breaks=7)
```
## Missing values {- #missingvalues}
Missing values aren't a data type as such, but are an important concept in R; the way different functions handle missing values can be both helpful and frustrating in equal measure.
Missing values in a vector are denoted by the letters `NA`, but notice that these letters are unquoted. That is to say `NA` is not the same as `"NA"`!
To check for missing values in a vector (or dataframe column) we use the `is.na()` function:
```{r}
nums.with.missing <- c(1, 2, NA)
nums.with.missing
is.na(nums.with.missing)
```
Here the `is.na()` function has tested whether each item in our vector called `nums.with.missing` is missing. It returns a new vector with the results of each test: either `TRUE` or `FALSE`.
We can also use the negation operator, the `!` symbol to reverse the meaning of `is.na`. So we can read `!is.na(nums)` as "test whether the values in `nums` are NOT missing":
```{r}
# test if missing
is.na(nums.with.missing)
# test if NOT missing (note the exclamation mark in front of the function)
!is.na(nums.with.missing)
```
We can use the `is.na()` function as part of dplyr filters:
```{r}
airquality %>%
filter(is.na(Solar.R)) %>%
head(3) %>%
pander
```
Or to select only cases without missing values for a particular variable:
```{r}
airquality %>%
filter(!is.na(Solar.R)) %>%
head(3) %>%
pander
```
#### Complete cases {- #complete-cases}
Sometimes we want to select only rows which have no missing values --- i.e. *complete cases*.
The `complete.cases` function accepts a dataframe (or matrix) and tests whether each *row* is complete. It returns a vector with a `TRUE/FALSE` result for each row:
```{r}
complete.cases(airquality) %>%
head
```
This can also be useful in dplyr filters. Here we show all the rows which are *not* complete (note the exclamation mark):
```{r}
airquality %>%
filter(!complete.cases(airquality))
```
#### {- .tip}
Sometimes it's convenient to use the `.` (period) to represent the output from the previous pipe command. For example, we could rewrite the previous example as:
```{r}
airquality %>%
filter(!complete.cases(.)) # note the . (period) here in place of `airmiles`
```
This is nice because we can apply the `complete.cases` function to the output of the previous pipe. For example, if we wanted to select complete cases for a subset of the variables we could write:
```{r}
airquality %>%
select(Ozone, Solar.R) %>%
filter(!complete.cases(.))
```
Or alternatively:
```{r}
rows.to.keep <- !complete.cases(select(airquality, Ozone, Solar.R))
airquality %>%
filter(rows.to.keep) %>%
head(3) %>%
pander
```
#### Missing data and R functions {- #narm}
It's normally good practice to pre-process your data and select the rows you want to analyse *before* passing dataframes to R functions.
The reason for this is that different functions behave differently with missing data.
For example:
```{r}
mean(airquality$Solar.R)
```
Here the default for `mean()` is to return NA if any of the values are missing. We can explicitly tell R to ignore missing values by setting `na.rm=TRUE`
```{r}
mean(airquality$Solar.R, na.rm=TRUE)
```
In contrast some other functions, for example the `lm()` which runs a linear regression will ignore missing values by default. If we run `summary` on the call to `lm` then we can see the line near the bottom of the output which reads: "(7 observations deleted due to missingness)"
```{r}
lm(Solar.R ~ Temp, data=airquality) %>%
summary
```
[Normally R will do the 'sensible thing' when there are missing values, but it's always worth checking whether you do have any missing data, and addressing this explicitly in your code]{.tip}
#### Patterns of missingness {-}
The `mice` package has some nice functions to describe patterns of missingness in the data. These can be useful both at the exploratory stage, when you are checking and validating your data, but can also be used to create tables of missingness for publication:
```{r}
mice::md.pattern(airquality)
```
In this table, `md.pattern` list the number of cases with particular patterns of missing data.
- Each row describes a misisng data 'pattern'
- The first column indicates the number of cases
- The central columns indicate whether a particular variable is missing for the pattern (0=missing)
- The last column counts the number of values missing for the pattern
- The final row counts the number of missing values for each variable.
##### Visualising missingness {-}
Graphics can also be useful to explore patterns in missingness.
`rct.data` contains data from an RCT of functional imagery training (FIT) for weight loss, which measured outcome (weight in kg) at baseline and two followups (`kg1`, `kg2`, `kg3`). The trial also measured global quality of life (`gqol`).
As is common, there were some missing data at the follouwp:
```{r}
fit.data <- readRDS("data/fit-weight.RDS") %>%
select(kg1, kg2, kg3, age, gqol1)
mice::md.pattern(fit.data)
```
We might be interested to explore patterns in which observations were missing. Here we use colour to identify missing observations as a function of the data recorded at baseline:
```{r}
fit.data %>%
mutate(missing.followup = is.na(kg2)) %>%
ggplot(aes(kg1, age, color=missing.followup)) +
geom_point()
```
There's a clear trend here for lighter patients (at baseline) to have more missing data at followup. There's also a suggestion that younger patients are more likely to have been lost to followup.
If needed, we could perform [inferential tests](#common-stats) for these differences:
```{r}
t.test(kg1 ~ is.na(kg2), data=fit.data)
t.test(age ~ is.na(kg2), data=fit.data)
```
However, given the small number of missing values and the post-hoc nature of these analyses these tests are rather underpowered and we might prefer to report and comment on the plot alone.
For some nice missing data visualisation techniques, including those for repeated measures data, see @zhang2015missing.