Skip to content

count() and add_count() could be much faster #6806

@DavisVaughan

Description

@DavisVaughan

Right now these eventually just do summarise(n = n()) or mutate(n = n()) at some point, but that can be very slow with many groups. We already have vec_count(), which should be much much faster than count() with many groups. We could also add some kind of vctrs primitive that works like a windowed count for add_count(), or just build on top of vec_count()'s result plus an additional call to vec_match().

We'd have to think through how weighted counts would work, maybe vec_count() needs support for a weight argument (a double vector).

Motivation is something like this, and flights isn't even that big. Roughly 55k groups here.

library(dplyr)
library(nycflights13)

bench::mark(
  count(flights, dep_time, dep_delay),
  vctrs::vec_count(flights[c("dep_time", "dep_delay")]),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                                                 min  median itr/s…¹
#>   <bch:expr>                                            <bch:tm> <bch:t>   <dbl>
#> 1 count(flights, dep_time, dep_delay)                    419.6ms 441.4ms    2.27
#> 2 vctrs::vec_count(flights[c("dep_time", "dep_delay")])   17.3ms  21.5ms   42.7 
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> #   variable name ¹​`itr/sec`

Also need to handle the fact that ... and wt are data-masking, probably with add_computed_columns() like distinct().

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions