Skip to content

Request for a group_subset() function #7625

@marcuslehr

Description

@marcuslehr

Hi, so I frequently find myself attempting to subset a particular group from a grouped dataframe. Usually for troubleshooting purposes of some sort. There's already a set of group_ helper functions which I usually try to inspect for this task. You can make these work to select a group or call filter() and manually filter down to a single group, but either way it's a bit tedious. Especially when you're looking to quickly grab a random group or two for dev/debugging purposes. The most efficient way I can find to do this is:
grouped_df[group_rows(grouped_df)[[1]],]

This will subset the data from the first group. However, this is a bit tedious and difficult to remember. Plus, it doesn't work well with pipes as the data frame must be called twice (and pipes don't play well with subsetting in the first place). For demonstration, the piped equivalent is:
grouped_df %>% group_rows() %>% .[[1]] %>% grouped_df[.,]

Both of these are ugly and hard to remember so I think it would be nice to have a helper function specifically for this purpose. It could be called group_subset() or group_select(), tho the latter could be construed with select() (even though groups are row-based, but I can see why one might want to avoid it). Heck, I would actually argue for replacing group_data(), as you'd be forgiven for thinking that's what group_data() is for. But it's not.. it returns row numbers not data, which is misleading imo. In fact group_data() is so similar to group_rows() that I would argue they're basically redundant and group_data() could simply be repurposed.

Anyways, my envisioned syntax to replace the above calls is:
grouped_df %>% group_subset(1)

This would be a really nice clean solution to return a single group subset via a group index. If you're highly adverse to adding new functions or making breaking changes, then group_data() could at least be modified to return a data column. Then you could do
grouped_df %>% group_data() %>% slice(1) %>% pull(.data)

This would at least make group_data() true to it's name and be an improvement. But I still like the dedicated function option better (eg group_subset) and it seems reasonable given there's already a suite of helper functions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions