Skip to content

feat(python): Native implementation of dataframe interchange protocol #10267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 10, 2023

Conversation

stinodego
Copy link
Contributor

@stinodego stinodego commented Aug 3, 2023

Closes #10035

Changes:

  • Implement DataFrame.__dataframe__ natively, rather than relying on pyarrow.
  • Move existing from_dataframe implementation to the new interchange module. For this function, we still rely on pyarrow's implementation (for now).

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 3, 2023
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from 4ed0c65 to a111410 Compare August 4, 2023 11:58
@stinodego
Copy link
Contributor Author

@MarcoGorelli from your experience with the DataFrame API, could you give me a hint as to where the nan_as_null option should be implemented? I imagine we could do a fill_null of the data buffer when Column.get_buffers is called? Or should I be thinking about this differently?

@stinodego
Copy link
Contributor Author

Do we allocate a validity buffer if there are no missing values?

@ritchie46 Could you answer this one for me?

@ritchie46
Copy link
Member

Do we allocate a validity buffer if there are no missing values?

@ritchie46 Could you answer this one for me?

No, we don't.

@MarcoGorelli
Copy link
Collaborator

MarcoGorelli commented Aug 7, 2023

How to implement nan_as_null option? Initial idea would be to call fill_null on the data buffer

I think that's the idea, but in practice it looks like for both pyarrow and for pandas it's not implemented

https://github.com/pandas-dev/pandas/blob/b864b8eb54605124827e6af84aa7fbc816cfc1e0/pandas/core/interchange/dataframe.py#L30-L42

https://github.com/apache/arrow/blob/7cbbd3ee95bc0f7d20bb358e2978e2eb18a05304/python/pyarrow/interchange/dataframe.py#L56-L63

cc @AlenkaF - do you know about how nan_as_null is meant to be implemented?

@rgommers
Copy link

rgommers commented Aug 7, 2023

I imagine we could do a fill_null of the data buffer when Column.get_buffers is called?

That sounds right to me, assuming that that is the call where you ensure the data is accessible in memory. In case calling __dataframe__ will already trigger materialization for whole columns, then I'd suggest doing it at that point.

I'm not sure why it isn't implemented in other libraries yet, probably just skipped for convenience. It shouldn't be a big burden to do so I'd think though. And at least in the case of pandas with numpy floating-point dtypes, it is in principle useful - nan is used as a sentinel for missing data there, and those dtypes don't have separate NA/null support.

@stinodego
Copy link
Contributor Author

Thanks for the input.

I guess setting nan_as_null=True and allow_copy=False should fail if there are any missing values to fill?

Implementing this for our Float types is trivial. I guess we can do this for Integer types as well though it will trigger a cast to Float64 and no longer be zero copy. Utf8 types can be filled with a string "NaN", and for Categorical I guess we can add a category called "NaN".

I don't know what to do with Boolean types or Datetime types though. Your input is appreciated!

Though now I am thinking I should probably save it for a separate PR. It's not trivial and this PR is already huge.

@rgommers
Copy link

rgommers commented Aug 7, 2023

I guess setting nan_as_null=True and allow_copy=False should fail if there are any missing values to fill?

I think so, yes.

Implementing this for our Float types is trivial. I guess we can do this for Integer types as well though it will trigger a cast to Float64 and no longer be zero copy. Utf8 types can be filled with a string "NaN", and for Categorical I guess we can add a category called "NaN".

My first thought here was that it should only apply to columns with floating-point dtypes. However, after doing some digging to confirm that, I instead found data-apis/dataframe-api#125. So perhaps don't do anything here, and if we confirm on that issue that no one needs it, we should instead follow through and remove the keyword.

@stinodego
Copy link
Contributor Author

So perhaps don't do anything here

Great, thanks for the insights. In that case, I'll not implement anything for this right now.

@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from f0ff253 to 72a8200 Compare August 7, 2023 14:01
@stinodego stinodego added the highlight Highlight this PR in the changelog label Aug 7, 2023
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch 4 times, most recently from 8f365d3 to 9d3c5c8 Compare August 7, 2023 20:38
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from ede4ba3 to b3364c6 Compare August 8, 2023 06:22
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from 97c73cd to ff89cdd Compare August 8, 2023 09:55
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from ff89cdd to 6ea1c54 Compare August 9, 2023 11:03
@stinodego stinodego force-pushed the dataframe-interchange-protocol branch from 30c2eac to e2c9ebe Compare August 9, 2023 11:44
@stinodego stinodego marked this pull request as ready for review August 9, 2023 11:46
@stinodego stinodego requested a review from ritchie46 as a code owner August 9, 2023 11:46
@stinodego
Copy link
Contributor Author

Discussed with Ritchie and this is going in 🚀

Next up: from_dataframe

@stinodego stinodego merged commit 878da45 into main Aug 10, 2023
@stinodego stinodego deleted the dataframe-interchange-protocol branch August 10, 2023 14:34
@AlenkaF
Copy link

AlenkaF commented Aug 14, 2023

cc @AlenkaF - do you know about how nan_as_null is meant to be implemented?

We have a default set to False in PyArrow implementation, otherwise it results in an error. I saw you have already found a connected issue data-apis/dataframe-api#125 and submitted a PR for it data-apis/dataframe-api#226. Great!

And sorry for late response!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-interchange Area: Python dataframe interchange protocol enhancement New feature or an improvement of an existing feature highlight Highlight this PR in the changelog python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Python dataframe interchange protocol natively
5 participants