-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
In dplyr 1.0.3 you can reference other columns in the same data-frame/tibble group by name. This functionality is broken in 1.0.4
To reproduce, the following example works in 1.0.3
storms %>% mutate(across(c('wind', 'pressure'), function(x) {
return(x/lat)
}))
In 1.0.4, running the above example results in the following error message
>rlang::last_error()
<error/dplyr:::mutate_error>
Problem withmutate()input..1.
x object 'lat' not found
i.e. we can't reference other columns by name. A possible workaround is to use cur_data()$name_of_column but this is slower as the following benchmark demonstrates:
library(dplyr, warn.conflicts = F)
df <- tibble(cbind.data.frame(
grp_1 = sort(rep(1:250, 4)),
grp_2 = rep(1:4, 250),
matrix(rnorm(1000 * 100), nrow = 1000)))
bench::mark(iterations = 100,
filter_gc = FALSE,
use_cur_data = df %>% summarise(across(is.numeric, function(x) {
rows = cur_data()
mask = (rows$grp_1 %% 2) == 0
return(mean(x[mask] / rows$grp_2[mask]))
})),
direct_reference = df %>% summarise(across(is.numeric, function(x) {
mask = (grp_1 %% 2) == 0
return(mean(x[mask] / grp_2[mask]))
}))) %>%
select(expression, min, median, `itr/sec`, `gc/sec`)
which results in the following output
# A tibble: 2 x 5
expression min median `itr/sec` `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <dbl>
1 use_cur_data 17.2ms 18.29ms 51.1 10.7
2 direct_reference 8.12ms 8.59ms 108. 9.76
TL;DR In dply 1.0.3, using cur_data()$column_name to reference columns instead of directly using the column names can be considerably slower. In 1.0.4 referencing columns by name, not using cur_data, is currently broken.