-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
Description
Right now these eventually just do summarise(n = n()) or mutate(n = n()) at some point, but that can be very slow with many groups. We already have vec_count(), which should be much much faster than count() with many groups. We could also add some kind of vctrs primitive that works like a windowed count for add_count(), or just build on top of vec_count()'s result plus an additional call to vec_match().
We'd have to think through how weighted counts would work, maybe vec_count() needs support for a weight argument (a double vector).
Motivation is something like this, and flights isn't even that big. Roughly 55k groups here.
library(dplyr)
library(nycflights13)
bench::mark(
count(flights, dep_time, dep_delay),
vctrs::vec_count(flights[c("dep_time", "dep_delay")]),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median itr/s…¹
#> <bch:expr> <bch:tm> <bch:t> <dbl>
#> 1 count(flights, dep_time, dep_delay) 419.6ms 441.4ms 2.27
#> 2 vctrs::vec_count(flights[c("dep_time", "dep_delay")]) 17.3ms 21.5ms 42.7
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> # variable name ¹`itr/sec`Also need to handle the fact that ... and wt are data-masking, probably with add_computed_columns() like distinct().
psychelzh and olivroy