-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
282 lines (211 loc) · 10.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# unheadr <img src="man/figures/logosmall.png" align="right" />
<!-- badges: start -->
[](https://doi.org/10.4404/hystrix-00133-2018)
[](https://cran.r-project.org/package=unheadr)
[](https://cran.r-project.org/package=unheadr)
[](https://cran.r-project.org/package=unheadr)
[](https://app.codecov.io/gh/luisDVA/unheadr?branch=master)
<!-- badges: end -->
The goal of `unheadr` is to help wrangle data when it has embedded subheaders, or when values are wrapped across several rows. https://unheadr.liomys.mx/
## Installation
You can install the CRAN release or the development version with:
``` r
# Install unheadr from CRAN:
install.packages("unheadr")
# Or install the development version from GitHub with:
# install.packages("remotes")
remotes::install_github("luisDVA/unheadr")
```
The reasoning behind the package and some of the possible uses of `unheadr` are described in this publication:
Verde Arregoitia, L. D., Cooper, N., D'Elía, G. (2018). Good practices for sharing analysis-ready data in mammalogy and biodiversity research. _Hystrix, the Italian Journal of Mammalogy_, 29(2), 155-161. [Open Access, DOI 10.4404/hystrix-00133-2018](https://doi.org/10.4404/hystrix-00133-2018)
## Usage
Load the package first.
```{r}
library(unheadr)
```
### Main functions
**`untangle2()`**
`untangle2()` puts embedded subheaders into their own variable, using regular expressions to identify them.
In the data below (a subset of a bundled dataset which can be loaded with `data(primates2017)`), there are rows that correspond to values in grouping variables. These should be in their own column. Instead, they are embedded within the data rectangle. This is a common practice in many disciplines. This data presentation looks OK and is easy to read, but hard to work with (for example: calculating group-wise summaries).
In this example, values for an implicit "geographic region" variable and an implicit "taxonomic family" variable are embedded in the column that contains the observational units (the scientific names of various primates).
|scientific_name |common_name |red_list_status | mass_kg|
|:----------------------------|:----------------------------|:---------------|-------:|
|Asia |NA |NA | NA|
|CERCOPITHECIDAE |NA |NA | NA|
|Trachypithecus obscurus |Dusky Langur |NT | 7.13|
|Presbytis sumatra |Black Sumatran Langur |EN | 6.00|
|Rhinopithecus roxellana |Golden Snub-nosed Monkey |EN | NA|
|HYLOBATIDAE |NA |NA | NA|
|Hylobates funereus |East Bornean Gray Gibbon |EN | NA|
|Hylobates klossii |Kloss's Gibbon |EN | 5.80|
|Nomascus concolor |Western Black Crested Gibbon |CR | 7.71|
For a tidier structure, the subheaders embedded in the _scientific\_name_ column need to be plucked out and placed in their own variable. This was initially the main objective of `unheadr` and what `untangle2()` was made for. The function can be used with `magrittr` pipes as a `dplyr`-type verb.
If these subheaders can be matched in bulk with a regular expression because they share a prefix, suffix, or anything in common, we can save a lot of time. Otherwise, they can be matched by name. For more details, see the examples and vignette.
The 'untangled' version of the data:
|scientific_name |common_name |red_list_status | mass_kg|family |region |
|:----------------------------|:----------------------------|:---------------|-------:|:---------------|:----------|
|Trachypithecus obscurus |Dusky Langur |NT | 7.13|CERCOPITHECIDAE |Asia |
|Presbytis sumatra |Black Sumatran Langur |EN | 6.00|CERCOPITHECIDAE |Asia |
|Rhinopithecus roxellana |Golden Snub-nosed Monkey |EN | NA|CERCOPITHECIDAE |Asia |
|Hylobates funereus |East Bornean Gray Gibbon |EN | NA|HYLOBATIDAE |Asia |
|Hylobates klossii |Kloss's Gibbon |EN | 5.80|HYLOBATIDAE |Asia |
|Nomascus concolor |Western Black Crested Gibbon |CR | 7.71|HYLOBATIDAE |Asia |
Now we can easily perform grouping operations and summarize the data (e.g. calculating average body mass by Family).
**`unbreak_vals()`**
This function uses regex to fix values that are broken across two rows. This usually happens when we are formatting a table and we need to fit it on a page.
```{r}
# Set up a toy dataset
dogsDesc <-
data.frame(
stringsAsFactors = FALSE,
dogs = c(
"Retriever", "(Golden)",
"Retriever", "(Labrador)", "Bulldog", "(French)"
),
coat = c("long", NA, "short", NA, "short", NA)
)
dogsDesc
```
We can match the opening brackets with regex.
```{r}
unbreak_vals(df = dogsDesc, regex = "^\\(", ogcol = dogs, newcol = dogs_unbroken)
```
**`unwrap_cols()`**
Use this function to unwrap and glue values that have been wrapped across multiple rows for presentation purposes, with an inconsistent number of empty or `NA` values padding out the columns.
```{r}
# Set up the data
nyk <-
data.frame(
stringsAsFactors = FALSE,
player = c(
"Marcus Camby", NA, NA,
NA, NA, NA, NA, "Allan Houston", NA,
"Latrell Sprewell", NA, NA
),
listed_height_m. = c(
2.11, NA, NA, NA, NA, NA,
NA, 1.98, NA, 1.96, NA, NA
),
teams_chronological = c(
"Raptors", "Knicks",
"Nuggets", "Clippers", "Trail Blazers",
"Rockets", "Knicks", "Pistons",
"Knicks", "Warriors", "Knicks",
"Timberwolves"
),
position = c(
"Power forward", "Center",
NA, NA, NA, NA, NA,
"Shooting guard", NA, "Small forward", NA, NA
)
)
nyk
```
Unwrap the elements in the variable that defines the groups, separating with commas.
```{r}
unwrap_cols(nyk, groupingVar = player, separator = ", ")
```
**`unbreak_rows()`**
This function merges sets of two contiguous rows upwards by pasting the values of the lagging row to the values of the leading row (identified using regular expressions).
The following table of basketball records has two sets of header rows with values broken across two contiguous rows.
```{r}
bball <- data.frame(
stringsAsFactors = FALSE,
v1 = c(
"Player", NA, "Sleve McDichael", "Dean Wesrey",
"Karl Dandleton", "Player",
NA,
"Mike Sernandez",
"Glenallen Mixon",
"Rey McSriff"
),
v2 = c(
"Most points", "in a game", "55", "43", "41", "Most varsity",
"games played", "111", "109",
"104"
),
v3 = c(
"Season", "(year ending)", "2001", "2000", "2010", "Season",
"(year ending)", "2005",
"2004", "2002"
)
)
```
`unbreak_rows()` merges these rows if we can match them with a common pattern.
```{r}
# Match with regex on variable v2
unbreak_rows(bball, regex = "^Most", ogcol = v2)
```
**`mash_colnames()`**
When column names are broken up across the top _n_ rows of a data frame or tibble, `mash_colnames()` makes many header rows into column names. Existing names can be kept or ignored.
```{r}
# Data with broken headers
babies <-
data.frame(
stringsAsFactors = FALSE,
Baby = c(NA, NA, "Angie", "Yean", "Pierre"),
Age = c("in", "months", "11", "9", "7"),
Weight = c("kg", NA, "2", "3", "4"),
Ward = c(NA, NA, "A", "B", "C")
)
babies
```
```{r}
# Mash, including the object names
mash_colnames(babies, n_name_rows = 2, keep_names = TRUE)
```
For inputs with ragged column names (NA values in the first row), the first row can be filled row-wise before mashing.
```{r}
# Data with ragged headers
survey <-
data.frame(
stringsAsFactors = FALSE,
X1 = c("Participant", NA, "12", "34", "45", "123"),
X2 = c(
"How did you hear about us?",
"TV", "TRUE", "FALSE", "FALSE", "FALSE"
),
X3 = c(NA, "Social Media", "FALSE", "TRUE", "FALSE", "FALSE"),
X4 = c(NA, "Radio", "FALSE", "TRUE", "FALSE", "TRUE"),
X5 = c(NA, "Flyer", "FALSE", "FALSE", "FALSE", "FALSE"),
X6 = c("Age", NA, "31", "23", "19", "24")
)
survey
```
``` {r}
# Ignoring names and using sliding headers
mash_colnames(survey,2,keep_names = FALSE,sliding_headers = TRUE, sep = "_")
```
**`annotate_mf()` and `annotate_mf_all()`**
Sometimes embedded subheaders can't be matched by content or context, but they share the same formatting in a spreadsheet file.
`annotate_mf()` flattens four common approaches to confer meaningful formatting to cells and adds this as a character string to the target variable.
``` r
example_spreadsheet <- system.file("extdata/dog_test.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet,orig = Task, new=Task_annotated)
```
`annotate_mf_all()` applies the same approach to all values in the dataset.
``` r
example_spreadsheet_all <- system.file("extdata/boutiques.xlsx", package = "unheadr")
annotate_mf(example_spreadsheet_all)
```
Lastly, `regex_valign()` can adjust the whitespace (padding) within a character vector with one element per line, for easier parsing with `readr`.
```{r}
guests <-
unlist(strsplit(c("6 COAHUILA 20/03/2020
712 COAHUILA 20/03/2020"),"\n"))
guests
regex_valign(guests, "\\b(?=[A-Z])")
```
The inconsistent whitespace between the elements in each line can be adjusted after matching a position of interest through regular expressions.