feat(python): Native implementation of dataframe interchange protocol #10267

stinodego · 2023-08-03T09:52:53Z

Changes:

Implement DataFrame.__dataframe__ natively, rather than relying on pyarrow.
Move existing from_dataframe implementation to the new interchange module. For this function, we still rely on pyarrow's implementation (for now).

stinodego · 2023-08-06T18:15:26Z

@MarcoGorelli from your experience with the DataFrame API, could you give me a hint as to where the nan_as_null option should be implemented? I imagine we could do a fill_null of the data buffer when Column.get_buffers is called? Or should I be thinking about this differently?

stinodego · 2023-08-06T19:35:22Z

Do we allocate a validity buffer if there are no missing values?

@ritchie46 Could you answer this one for me?

ritchie46 · 2023-08-06T19:52:05Z

Do we allocate a validity buffer if there are no missing values?

@ritchie46 Could you answer this one for me?

No, we don't.

MarcoGorelli · 2023-08-07T07:49:04Z

How to implement nan_as_null option? Initial idea would be to call fill_null on the data buffer

I think that's the idea, but in practice it looks like for both pyarrow and for pandas it's not implemented

https://github.com/pandas-dev/pandas/blob/b864b8eb54605124827e6af84aa7fbc816cfc1e0/pandas/core/interchange/dataframe.py#L30-L42

https://github.com/apache/arrow/blob/7cbbd3ee95bc0f7d20bb358e2978e2eb18a05304/python/pyarrow/interchange/dataframe.py#L56-L63

cc @AlenkaF - do you know about how nan_as_null is meant to be implemented?

rgommers · 2023-08-07T09:01:46Z

I imagine we could do a fill_null of the data buffer when Column.get_buffers is called?

That sounds right to me, assuming that that is the call where you ensure the data is accessible in memory. In case calling __dataframe__ will already trigger materialization for whole columns, then I'd suggest doing it at that point.

I'm not sure why it isn't implemented in other libraries yet, probably just skipped for convenience. It shouldn't be a big burden to do so I'd think though. And at least in the case of pandas with numpy floating-point dtypes, it is in principle useful - nan is used as a sentinel for missing data there, and those dtypes don't have separate NA/null support.

stinodego · 2023-08-07T09:59:03Z

Thanks for the input.

I guess setting nan_as_null=True and allow_copy=False should fail if there are any missing values to fill?

Implementing this for our Float types is trivial. I guess we can do this for Integer types as well though it will trigger a cast to Float64 and no longer be zero copy. Utf8 types can be filled with a string "NaN", and for Categorical I guess we can add a category called "NaN".

I don't know what to do with Boolean types or Datetime types though. Your input is appreciated!

Though now I am thinking I should probably save it for a separate PR. It's not trivial and this PR is already huge.

rgommers · 2023-08-07T10:23:09Z

I guess setting nan_as_null=True and allow_copy=False should fail if there are any missing values to fill?

I think so, yes.

Implementing this for our Float types is trivial. I guess we can do this for Integer types as well though it will trigger a cast to Float64 and no longer be zero copy. Utf8 types can be filled with a string "NaN", and for Categorical I guess we can add a category called "NaN".

My first thought here was that it should only apply to columns with floating-point dtypes. However, after doing some digging to confirm that, I instead found data-apis/dataframe-api#125. So perhaps don't do anything here, and if we confirm on that issue that no one needs it, we should instead follow through and remove the keyword.

stinodego · 2023-08-07T10:29:42Z

So perhaps don't do anything here

Great, thanks for the insights. In that case, I'll not implement anything for this right now.

py-polars/polars/dataframe/frame.py

py-polars/polars/interchange/column.py

py-polars/polars/interchange/dataframe.py

py-polars/polars/interchange/utils.py

stinodego · 2023-08-10T14:34:36Z

Discussed with Ritchie and this is going in 🚀

Next up: from_dataframe

AlenkaF · 2023-08-14T07:25:14Z

cc @AlenkaF - do you know about how nan_as_null is meant to be implemented?

We have a default set to False in PyArrow implementation, otherwise it results in an error. I saw you have already found a connected issue data-apis/dataframe-api#125 and submitted a PR for it data-apis/dataframe-api#226. Great!

And sorry for late response!

github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Aug 3, 2023

stinodego force-pushed the dataframe-interchange-protocol branch from 4ed0c65 to a111410 Compare August 4, 2023 11:58

stinodego mentioned this pull request Aug 6, 2023

feat(rust!,python): Add Series.cat.uses_lexical_ordering #10325

Merged

stinodego force-pushed the dataframe-interchange-protocol branch from f0ff253 to 72a8200 Compare August 7, 2023 14:01

stinodego added the highlight Highlight this PR in the changelog label Aug 7, 2023

stinodego force-pushed the dataframe-interchange-protocol branch 4 times, most recently from 8f365d3 to 9d3c5c8 Compare August 7, 2023 20:38

stinodego mentioned this pull request Aug 7, 2023

Remove nan_as_null parameter for DataFrame protocol data-apis/dataframe-api#226

Closed

stinodego force-pushed the dataframe-interchange-protocol branch from ede4ba3 to b3364c6 Compare August 8, 2023 06:22

stinodego mentioned this pull request Aug 8, 2023

Add to_local method to categorical namespace #10360

Closed

stinodego force-pushed the dataframe-interchange-protocol branch from 97c73cd to ff89cdd Compare August 8, 2023 09:55

stinodego mentioned this pull request Aug 9, 2023

feat(rust,python): Add is_local and to_local to categorical namespace #10372

Merged

stinodego force-pushed the dataframe-interchange-protocol branch from ff89cdd to 6ea1c54 Compare August 9, 2023 11:03

stinodego added 3 commits August 9, 2023 13:43

Move from_dataframe

7510bc2

Implement interchange protocol natively

cb24709

Add tests

e2c9ebe

stinodego force-pushed the dataframe-interchange-protocol branch from 30c2eac to e2c9ebe Compare August 9, 2023 11:44

stinodego marked this pull request as ready for review August 9, 2023 11:46

stinodego requested a review from ritchie46 as a code owner August 9, 2023 11:46

stinodego requested a review from alexander-beedie as a code owner August 9, 2023 11:46

ritchie46 reviewed Aug 9, 2023

View reviewed changes

py-polars/polars/dataframe/frame.py Show resolved Hide resolved

py-polars/polars/interchange/column.py Show resolved Hide resolved

py-polars/polars/interchange/dataframe.py Show resolved Hide resolved

py-polars/polars/interchange/utils.py Show resolved Hide resolved

stinodego added 4 commits August 10, 2023 14:46

Add a note about slicing in get_chunks

f4a8890

Minor optimization for dtype conversion calls

3d1a2f5

Improve dtype conversion performance

7abb47b

Fix return type of base_type util

dee3f12

stinodego merged commit 878da45 into main Aug 10, 2023

stinodego deleted the dataframe-interchange-protocol branch August 10, 2023 14:34

stinodego added the A-interchange Area: Python dataframe interchange protocol label Sep 7, 2023

alexander-beedie mentioned this pull request Nov 14, 2023

docs(python): make "DataFrame" and "Series" case more consistent across docs/comments/errors #12428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(python): Native implementation of dataframe interchange protocol #10267

feat(python): Native implementation of dataframe interchange protocol #10267

stinodego commented Aug 3, 2023 •

edited

Loading

stinodego commented Aug 6, 2023

stinodego commented Aug 6, 2023

ritchie46 commented Aug 6, 2023

MarcoGorelli commented Aug 7, 2023 •

edited

Loading

rgommers commented Aug 7, 2023

stinodego commented Aug 7, 2023

rgommers commented Aug 7, 2023

stinodego commented Aug 7, 2023

stinodego commented Aug 10, 2023

AlenkaF commented Aug 14, 2023

feat(python): Native implementation of dataframe interchange protocol #10267

feat(python): Native implementation of dataframe interchange protocol #10267

Conversation

stinodego commented Aug 3, 2023 • edited Loading

stinodego commented Aug 6, 2023

stinodego commented Aug 6, 2023

ritchie46 commented Aug 6, 2023

MarcoGorelli commented Aug 7, 2023 • edited Loading

rgommers commented Aug 7, 2023

stinodego commented Aug 7, 2023

rgommers commented Aug 7, 2023

stinodego commented Aug 7, 2023

stinodego commented Aug 10, 2023

AlenkaF commented Aug 14, 2023

stinodego commented Aug 3, 2023 •

edited

Loading

MarcoGorelli commented Aug 7, 2023 •

edited

Loading