-
Notifications
You must be signed in to change notification settings - Fork 9
/
Copy pathreshaping.Rmd
195 lines (143 loc) · 5.98 KB
/
reshaping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: Reshaping
---
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse=TRUE, cache=TRUE, message=F, warning=F)
library(tidyverse)
library(tufte)
library(pander)
```
## Reshaping {- #reshaping}
<!-- <div style="width:100%;height:0;padding-bottom:75%;position:relative;"><iframe src="https://giphy.com/embed/J42u1BTrks9eU" width="100%" height="100%" style="position:absolute" frameBorder="0" class="giphy-embed" allowFullScreen></iframe></div><p><a href="https://giphy.com/gifs/funny-transformer-J42u1BTrks9eU">via GIPHY</a></p>
-->
This section will probably require more attention than any other in the guide,
but will likely be one of the most useful things you learn in R.
As previously discussed, most things work best in R if you have data in _long
format_. This means we prefer data that look like this:
```{r, echo=F}
d <- expand.grid(person=1:3, time=paste("Time", 1:4)) %>%
mutate(outcome = rnorm(n(), 20, 3))
d %>% pander
```
And NOT like this:
```{r, echo=F}
d %>%
reshape2::dcast(person~time, value.var = "outcome") %>%
pander
```
In long format data:
- each row of the dataframe corresponds to a single measurement occasion
- each column corresponds to a variable which is measured
Fortunately it's fairly easy to move between the two formats, provided your
variables are named in a consistent way.
#### Wide to long format {- #wide-to-long}
This is the most common requirement. Often you will have several columns which
actually measure the same thing, and you will need to convert these two two
columns - a 'key', and a value.
For example, let's say we measure patients on 10 days:
```{r, include=F}
sleep.wide <- lme4::sleepstudy %>%
reshape2::dcast(Subject~paste0("Day.", Days), value.var = "Reaction") %>%
mutate(Subject=as.numeric(Subject)) %>%
arrange(Subject)
saveRDS(sleep.wide, 'data/sleep.wide.RDS')
```
```{r}
sleep.wide %>%
head(4) %>%
pander(caption="Data for the first 4 subjects")
```
We want to convert RT measurements on each Day to a single variable, and create
a new variable to keep track of what `Day` the measurement was taken:
The `melt()` function in the `reshape2::` package does this for us:
```{r}
library(reshape2)
sleep.long <- sleep.wide %>%
melt(id.var="Subject") %>%
arrange(Subject, variable)
sleep.long %>%
head(12) %>%
pander
```
```{r include=F}
saveRDS(sleep.long, 'data/sleep.long.RDS')
```
Here melt has created two new variable: `variable`, which keeps track of what
was measured, and `value` which contains the score. This is the format we need
when [plotting graphs](#graphics) and running
[regression and Anova models](#linear-models-simple).
#### Long to wide format {- #long-to-wide}
To continue the example from above, these are long form data we just made:
```{r}
sleep.long %>%
head(3) %>%
pander(caption="First 3 rows in the long format dataset")
```
We can convert these back to the original wide format using `dcast`, again in
the `reshape2` package. The name of the `dcast` function indicates we can 'cast'
a dataframe (the d prefix). So here, casting means the opposite of 'melting'.
Using `dcast` is a little more fiddly than `melt` because we have to say _how_
we want the data spread wide. In this example we could either have:
- Columns for each day, with rows for each subject
- Columns for each subject, with rows for each day
Although it's obvious to _us_ which format we want, we have to be explicit for R
to get it right.
We do this using a [formula](#formulae), which we'll see again in the regression
section.
Each formula has two sides, left and right, separated by the tilde (`~`) symbol.
On the left hand side we say which variable we want to keep in rows. On the
right hand side we say which variables to convert to columns. So, for example:
```{r}
# rows per subject, columns per day
sleep.long %>%
dcast(Subject~variable) %>%
head(3)
```
To compare, we can convert so each Subject has a column by reversing the
formula:
```{r}
# note we select only the first 7 Subjects to
# keep the table to a manageable size
sleep.long %>%
filter(Subject < 8) %>%
dcast(variable~Subject)
```
###### {- .tip}
One neat trick when casting is to use `paste` to give your columns nicer names.
So for example:
```{r}
sleep.long %>%
filter(Subject < 4) %>%
dcast(variable~paste0("Person.", Subject))
```
Notice we used `paste0` rather than `paste` to avoid spaces in variable names,
which is allowed but can be a pain.
[See more on working with character strings in a later section](#string-handling).
##### {-}
For a more detailed explanation and various other methods for reshaping data,
see: http://r4ds.had.co.nz/tidy-data.html
### Which package should you use to reshape data? {- #which-reshape-package}
There are three main options:
- `tidyr::`, which comes as part of the `tidyverse`, using `gather` and
`spread()`
- `reshape2::` using `melt()` and `dcast()`
- `data.table::` also using functions called `melt()` and `dcast()` (but which
are slightly different from those in `reshape2`)
This post walks through some of the differences:
https://www.r-bloggers.com/how-to-reshape-data-in-r-tidyr-vs-reshape2/ but the
short answer is whichever you find simplest and easiest to remember (for me
that's `melt` and `dcast`).
`
### Aggregating and reshaping at the same time {-}
One common trick when reshaping is to convert a datafile which has multiple rows
and columns per person to one with only a single row per person. That is, we
aggregae by using a summary (perhaps the mean) and reshape at the same time.
Although useful this isn't covered in this section, because it is combining two
techniques:
- Reshaping (i.e. from long to wide or back)
- Aggregating or summarising (converting multiple rows to one)
In the next section we cover [summarising data](#summarising-data), and
introduce the 'split-apply-combine' method for summarising.
Once you have a good grasp of this, you could check out the
['fancy reshaping' section](#fancy-reshaping) which does provide examples of
aggregating and reshaping simultaneously.