Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interchange dataframe protocol #9071

Merged
merged 67 commits into from
Nov 17, 2021
Merged

Conversation

iskode
Copy link
Contributor

@iskode iskode commented Aug 19, 2021

This PR is a basic implementation of the interchange dataframe protocol for cudf.
As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on pandas with method like from_pandas and to_pandas.
This is a bad design as libraries should maintain an additional dependency to pandas peculiarities.
This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them.
Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details.
To illustrate:

  • df_obj = cudf_dataframe.__dataframe__()

df_obj can be consumed by any library implementing the protocol.

  • df = cudf.from_dataframe(any_supported_dataframe)

here we create a cudf dataframe from any dataframe object supporting the protocol.

So far, it supports the following:

  • Column dtypes: uint8, int, float, bool and categorical.
  • Missing values are handled for all these dtypes.
  • string support is on the way.

Additionally, we support dataframe from CPU device like pandas. But it is not testable here as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol.

@iskode iskode requested a review from a team as a code owner August 19, 2021 14:57
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

1 similar comment
@GPUtester
Copy link
Collaborator

Can one of the admins verify this patch?

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 19, 2021
@iskode iskode marked this pull request as draft August 19, 2021 14:58
@vyasr
Copy link
Contributor

vyasr commented Aug 19, 2021

Hi @iskode thanks for the contribution! We're aware of the Data API consortium and are on board with eventually moving in that direction. However, I don't think that the DataFrame standard is quite stable enough yet for us to be adding support (unlike the array API that has been released). @shwina is the RAPIDS representative in that group, though, and can probably speak to this in more detail.

@shwina
Copy link
Contributor

shwina commented Aug 21, 2021

Ok to test

@shwina shwina marked this pull request as ready for review August 23, 2021 15:43
@shwina shwina added feature request New feature or request non-breaking Non-breaking change labels Aug 23, 2021
@shwina
Copy link
Contributor

shwina commented Sep 13, 2021

Hi @iskode - to unblock CI, could you please resolve the conflicts on this PR when you have a chance? Specifically cudf/core/___init__.py should be left empty and the imports changed accordingly. Thanks!

@iskode
Copy link
Contributor Author

iskode commented Sep 22, 2021

It's Ok now I think.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @iskode! Thanks for working through the mypy bits, I apologize for not getting back to you sooner. The primary change I'd like to see is removing the aliases of _k = _DtypeKind in favor of using _DtypeKind directly. I marked some but not all instances of that pattern.

Otherwise we're down to pretty minor comments. I would be happy enough with the current state, so I'm approving. I'll let @shwina merge when ready.

python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_df_protocol.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_df_protocol.py Outdated Show resolved Hide resolved
@shwina
Copy link
Contributor

shwina commented Nov 11, 2021

rerun tests

@bdice
Copy link
Contributor

bdice commented Nov 12, 2021

@shwina I'm going to move this to 22.02 since we're in code freeze for 21.12.

edit: the previous failed tests were from HTTP errors and an issue in Java changes from #9616 that should be resolved by re-running. This bump to 22.02 will also rerun tests.

@bdice bdice changed the base branch from branch-21.12 to branch-22.02 November 12, 2021 16:21
@bdice
Copy link
Contributor

bdice commented Nov 17, 2021

rerun tests

@iskode
Copy link
Contributor Author

iskode commented Nov 17, 2021

Hi @shwina and @bdice , thank you for your continuous effort to merge this PR.
Some Java unit tests (58) are failing.

@iskode
Copy link
Contributor Author

iskode commented Nov 17, 2021

Java unit tests are not passing.... it is like a regression from Java code. Between the two builds, number of failures (unit test) went down from 58 to 55. So recent changes made 3 more tests passing. Is it possible to mute Java tests to see if the dataframe protocol implementation pass all tests and the gpuCI ?

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry about Java tests. There are lots of other changes in flight right now that are causing problems there, but none of them are related to you. If everything other than Java tests pass then I think we're good to go.

@shwina the PR has gone through a few rounds of review since you approved, so I'll let you do the honors and merge when you're ready.

@shwina
Copy link
Contributor

shwina commented Nov 17, 2021

I'm going to go ahead and merge this. Thank you @iskode for all your hard work on this PR and for your patience with multiple rounds of reviews! Fantastic job!

For anyone wanting to follow up on this work, here are a couple of suggestions for the future:

  • Because the cuDFColumn and cuDFBuffer types here so closely match existing Buffer and Column types, I think it'd be good to merge cuDFColumn into cudf.core.column.ColumnandcudfBufferintocudf.core.buffer.Buffer` and just use those here.

  • I think with better use of isinstance, we can avoid some of the uses of typing.cast we're currently doing. But it could also be indicative of larger typing issues that I know exist in cuDF.

@shwina
Copy link
Contributor

shwina commented Nov 17, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 32bacfa into rapidsai:branch-22.02 Nov 17, 2021
@iskode
Copy link
Contributor Author

iskode commented Nov 19, 2021

I'm going to go ahead and merge this. Thank you @iskode for all your hard work on this PR and for your patience with multiple rounds of reviews! Fantastic job!

Thank you so much your investment (in particular @shwina and the rest of the team) in assisting me during this process. It has been a great pleasure and rewarding adventure to work with you. I've learned many things along the way.

Very interesting and justified next steps.
Currently I'm busy on other projects but if I get anytime I'll head on to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants