Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement GroupBy.value_counts to match pandas API #14114

Merged
merged 6 commits into from
Sep 20, 2023

Conversation

stmio
Copy link
Contributor

@stmio stmio commented Sep 17, 2023

Description

This PR implements GroupBy.value_counts, matching the pandas equivalent method.

Tests currently ignore the returned Series/DataFrame's name, as this was added to pandas in v2.0.0. This can be removed if tests are against pandas>=2.0.0.

Closes #12789

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Sep 17, 2023

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 17, 2023
@stmio stmio marked this pull request as ready for review September 18, 2023 11:36
@stmio stmio requested a review from a team as a code owner September 18, 2023 11:36
@shwina
Copy link
Contributor

shwina commented Sep 18, 2023

/ok to test

@shwina shwina added feature request New feature or request non-breaking Non-breaking change labels Sep 18, 2023
@galipremsagar
Copy link
Contributor

/ok to test

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this feature! I have a few questions.

Can we make sure to cover errors raised by pandas?

  1. pandas raises an error about "clashing" if the subset of columns and groupby columns have a common element: https://github.com/pandas-dev/pandas/blob/f4f598fb36c0809da01cade2d5d832ee09564101/pandas/core/groupby/groupby.py#L2761-L2763
  2. pandas raises an error about "doesn't exist" if the subset keys don't exist in the dataframe. https://github.com/pandas-dev/pandas/blob/f4f598fb36c0809da01cade2d5d832ee09564101/pandas/core/groupby/groupby.py#L2767-L2769
  3. pandas raises if there is already a column with the name count or proportion in the dataframe. https://github.com/pandas-dev/pandas/blob/f4f598fb36c0809da01cade2d5d832ee09564101/pandas/core/groupby/groupby.py#L2851-L2852

Can we cover cases like non-observed categorical values? https://github.com/pandas-dev/pandas/blob/f4f598fb36c0809da01cade2d5d832ee09564101/pandas/core/groupby/groupby.py#L2805

This is a great first pass -- but I think I'd like to see some coverage of at least some of the above features before approving (it's probably okay to leave some as TODO).

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
@stmio
Copy link
Contributor Author

stmio commented Sep 18, 2023

Thank you for the review @bdice 😃
I have covered the three error cases in the same way as pandas and will have a quick look at the possibility of non-observed categorical values.

@shwina
Copy link
Contributor

shwina commented Sep 18, 2023

/ok to test

@bdice
Copy link
Contributor

bdice commented Sep 18, 2023

@stmio Awesome, thanks. Can you add tests for the error cases, too?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me!

@bdice
Copy link
Contributor

bdice commented Sep 18, 2023

/ok to test

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests show some errors. Try replacing .columns with ._column_names.

python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/groupby/groupby.py Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Sep 19, 2023

/ok to test

@bdice
Copy link
Contributor

bdice commented Sep 20, 2023

/merge

@rapids-bot rapids-bot bot merged commit 2d4f22a into rapidsai:branch-23.10 Sep 20, 2023
54 checks passed
@shwina
Copy link
Contributor

shwina commented Sep 20, 2023

Nice work, @stmio!

@stmio
Copy link
Contributor Author

stmio commented Sep 20, 2023

Thank you! 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[FEA] GroupBy.value_counts
4 participants