Interchange dataframe protocol #9071

iskode · 2021-08-19T14:57:16Z

This PR is a basic implementation of the interchange dataframe protocol for cudf.
As well-known, there are many dataframe libraries out there where one's weakness is handle by another. To work across these libraries, we rely on pandas with method like from_pandas and to_pandas.
This is a bad design as libraries should maintain an additional dependency to pandas peculiarities.
This protocol provides a high level API that must be implemented by dataframe libraries to allow communication between them.
Thus, we get rid of the high coupling with pandas and depend only on the protocol API where each library has the freedom of its implementation details.
To illustrate:

df_obj = cudf_dataframe.__dataframe__()

df_obj can be consumed by any library implementing the protocol.

df = cudf.from_dataframe(any_supported_dataframe)

here we create a cudf dataframe from any dataframe object supporting the protocol.

So far, it supports the following:

Column dtypes: uint8, int, float, bool and categorical.
Missing values are handled for all these dtypes.
string support is on the way.

Additionally, we support dataframe from CPU device like pandas. But it is not testable here as pandas has not yet adopted the protocol. We've tested it locally with a pandas monkey patched implementation of the protocol.

…al without missing values

…on the dataframe object

GPUtester · 2021-08-19T14:57:17Z

Can one of the admins verify this patch?

GPUtester · 2021-08-19T14:57:17Z

Can one of the admins verify this patch?

vyasr · 2021-08-19T18:10:32Z

Hi @iskode thanks for the contribution! We're aware of the Data API consortium and are on board with eventually moving in that direction. However, I don't think that the DataFrame standard is quite stable enough yet for us to be adding support (unlike the array API that has been released). @shwina is the RAPIDS representative in that group, though, and can probably speak to this in more detail.

shwina · 2021-08-21T11:58:59Z

Ok to test

shwina · 2021-09-13T13:27:37Z

Hi @iskode - to unblock CI, could you please resolve the conflicts on this PR when you have a chance? Specifically cudf/core/___init__.py should be left empty and the imports changed accordingly. Thanks!

…cp.asarray' to enforce zero-copy

iskode · 2021-09-22T04:18:28Z

It's Ok now I think.

bdice

Great work @iskode! Thanks for working through the mypy bits, I apologize for not getting back to you sooner. The primary change I'd like to see is removing the aliases of _k = _DtypeKind in favor of using _DtypeKind directly. I marked some but not all instances of that pattern.

Otherwise we're down to pretty minor comments. I would be happy enough with the current state, so I'm approving. I'll let @shwina merge when ready.

python/cudf/cudf/core/df_protocol.py

python/cudf/cudf/tests/test_df_protocol.py

Co-authored-by: Bradley Dice <[email protected]>

shwina · 2021-11-11T20:56:38Z

rerun tests

bdice · 2021-11-12T16:20:47Z

@shwina I'm going to move this to 22.02 since we're in code freeze for 21.12.

edit: the previous failed tests were from HTTP errors and an issue in Java changes from #9616 that should be resolved by re-running. This bump to 22.02 will also rerun tests.

bdice · 2021-11-17T03:41:18Z

rerun tests

iskode · 2021-11-17T08:57:59Z

Hi @shwina and @bdice , thank you for your continuous effort to merge this PR.
Some Java unit tests (58) are failing.

…df-protocol

iskode · 2021-11-17T11:55:57Z

Java unit tests are not passing.... it is like a regression from Java code. Between the two builds, number of failures (unit test) went down from 58 to 55. So recent changes made 3 more tests passing. Is it possible to mute Java tests to see if the dataframe protocol implementation pass all tests and the gpuCI ?

vyasr

Don't worry about Java tests. There are lots of other changes in flight right now that are causing problems there, but none of them are related to you. If everything other than Java tests pass then I think we're good to go.

@shwina the PR has gone through a few rounds of review since you approved, so I'll let you do the honors and merge when you're ready.

shwina · 2021-11-17T23:23:48Z

I'm going to go ahead and merge this. Thank you @iskode for all your hard work on this PR and for your patience with multiple rounds of reviews! Fantastic job!

For anyone wanting to follow up on this work, here are a couple of suggestions for the future:

Because the cuDFColumn and cuDFBuffer types here so closely match existing Buffer and Column types, I think it'd be good to merge cuDFColumn into cudf.core.column.ColumnandcudfBufferintocudf.core.buffer.Buffer` and just use those here.
I think with better use of isinstance, we can avoid some of the uses of typing.cast we're currently doing. But it could also be indicative of larger typing issues that I know exist in cuDF.

shwina · 2021-11-17T23:24:02Z

@gpucibot merge

iskode · 2021-11-19T15:58:17Z

I'm going to go ahead and merge this. Thank you @iskode for all your hard work on this PR and for your patience with multiple rounds of reviews! Fantastic job!

Thank you so much your investment (in particular @shwina and the rest of the team) in assisting me during this process. It has been a great pleasure and rewarding adventure to work with you. I've learned many things along the way.

Very interesting and justified next steps.
Currently I'm busy on other projects but if I get anytime I'll head on to it.

iskode added 4 commits August 19, 2021 11:30

dataframe protocol implementation, support only int, float, categoric…

8fc7e17

…al without missing values

refactor to call from_dataframe on cudf directly and __dataframe__() …

4367d8f

…on the dataframe object

remove commented monkeypatch

331c696

refactor test cases

c83b4d7

iskode requested a review from a team as a code owner August 19, 2021 14:57

iskode requested review from trxcllnt and charlesbluca August 19, 2021 14:57

github-actions bot added the Python Affects Python cuDF API. label Aug 19, 2021

iskode marked this pull request as draft August 19, 2021 14:58

propagate nan_as_null from DataFrame to Column class

defcbc5

shwina marked this pull request as ready for review August 23, 2021 15:43

shwina added feature request New feature or request non-breaking Non-breaking change labels Aug 23, 2021

start missing value supports + int missing value tests

7c19720

apply protocol update changes

89d00f2

iskode added 7 commits September 13, 2021 17:56

missing values support for int, float and categorical

8450d7e

add boolean support w/ missing values

ec842d6

refactor 'convert_column_to_cupy_ndarray' and replace 'cp.array' to '…

13e0b95

…cp.asarray' to enforce zero-copy

change 'copy' to 'allow_copy' code wide

dfa02a2

fix merging issues with rapidsai/cudf updates

8586daa

make "from_dataframe" accessible cudf: cudf.from_dataframe

d0cd04c

fix merge with recent changes on cudf repo

f0aede4

bdice approved these changes Nov 10, 2021

View reviewed changes

iskode and others added 10 commits November 10, 2021 07:14

Remove _DTypeKind alias _k

9a3549c

Co-authored-by: Bradley Dice <[email protected]>

Remove _DTypeKind alias _k

b76d419

Co-authored-by: Bradley Dice <[email protected]>

Remove _DTypeKind alias _k

a34f118

Co-authored-by: Bradley Dice <[email protected]>

Remove _DTypeKind alias _k

99ca31d

Co-authored-by: Bradley Dice <[email protected]>

Remove _DTypeKind alias _k

87bba32

Co-authored-by: Bradley Dice <[email protected]>

add space to multi-line comment.

535e56e

Co-authored-by: Bradley Dice <[email protected]>

Remove _DTypeKind alias _k

b421a29

Co-authored-by: Bradley Dice <[email protected]>

remove remaining _DtypeKind aliases

6ae5ee0

import DataFrameObject from df_protocol

581153f

rename assertion methods

c1231cb

bdice changed the base branch from branch-21.12 to branch-22.02 November 12, 2021 16:21

Merge remote-tracking branch 'upstream/branch-22.02' into df-protocol

043333c

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

0423cc6

…df-protocol

vyasr approved these changes Nov 17, 2021

View reviewed changes

rapids-bot bot merged commit 32bacfa into rapidsai:branch-22.02 Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interchange dataframe protocol #9071

Interchange dataframe protocol #9071

iskode commented Aug 19, 2021 •

edited

Loading

GPUtester commented Aug 19, 2021

GPUtester commented Aug 19, 2021

vyasr commented Aug 19, 2021

shwina commented Aug 21, 2021

shwina commented Sep 13, 2021

iskode commented Sep 22, 2021

bdice left a comment

shwina commented Nov 11, 2021

bdice commented Nov 12, 2021 •

edited

Loading

bdice commented Nov 17, 2021

iskode commented Nov 17, 2021 •

edited

Loading

iskode commented Nov 17, 2021 •

edited

Loading

vyasr left a comment

shwina commented Nov 17, 2021

shwina commented Nov 17, 2021

iskode commented Nov 19, 2021

Interchange dataframe protocol #9071

Interchange dataframe protocol #9071

Conversation

iskode commented Aug 19, 2021 • edited Loading

GPUtester commented Aug 19, 2021

GPUtester commented Aug 19, 2021

vyasr commented Aug 19, 2021

shwina commented Aug 21, 2021

shwina commented Sep 13, 2021

iskode commented Sep 22, 2021

bdice left a comment

Choose a reason for hiding this comment

shwina commented Nov 11, 2021

bdice commented Nov 12, 2021 • edited Loading

bdice commented Nov 17, 2021

iskode commented Nov 17, 2021 • edited Loading

iskode commented Nov 17, 2021 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

shwina commented Nov 17, 2021

shwina commented Nov 17, 2021

iskode commented Nov 19, 2021

iskode commented Aug 19, 2021 •

edited

Loading

bdice commented Nov 12, 2021 •

edited

Loading

iskode commented Nov 17, 2021 •

edited

Loading

iskode commented Nov 17, 2021 •

edited

Loading