[Protocol] Make Column.get_buffers() docstring more explicit #272

pitrou · 2023-10-03T11:38:31Z

Several implementations got Column.get_buffers() wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity.

See apache/arrow#37598 for example.

Closes #273

Several implementations got ``Column.get_buffers()`` wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity. See apache/arrow#37598 for example.

pitrou · 2023-10-03T11:38:56Z

@rgommers @MarcoGorelli @AlenkaF @jorisvandenbossche @stinodego

MarcoGorelli · 2023-10-03T11:42:46Z

I do agree, but some care needs to be taken - currently from_dataframe in pandas assumes the buffer dtype is the same as the column dtype, so if the buffer dtype were to be fixed immediately then this would lead to breakage in libraries using the interchange protocol (eg plotly)

Thoughts on the approach suggested in pandas-dev/pandas#54781 (comment) ? I think we should probably wait 1-2 years before updating implementations' buffer dtype

jorisvandenbossche · 2023-10-03T11:44:08Z

Thanks @pitrou! I was also just opening #273 about this, but so this PR can close that issue then ;)

jorisvandenbossche · 2023-10-03T11:46:39Z

I agree we should be careful in fixing this. But I assume a start that is needed anyway is 1) agreeing that this is the correct interpretation and 2) if so, updating all libraries' from_dataframe function to handle both ways of specifying the buffers' dtypes?

And once that is done, we can see how to go about updating the return value of the buffer's dtype (or how long that should take, etc)

stinodego · 2023-10-03T12:01:48Z

updating all libraries' from_dataframe function to handle both ways of specifying the buffers' dtypes?

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. column dtype STRING will mean an 8bit UINT data buffer).

If it's implemented like this (everywhere), the data buffer dtype can be changed without breaking from_dataframe.

pitrou · 2023-10-03T12:04:20Z

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. dtype STRING will mean an 8bit UINT data buffer).

Where does the spec spell the expected per-buffer dtypes for each column dtype?

Arrow, for example, has different string-like types: one with 32-bit offsets, and another with 64-bit offsets. If the dataframe consumer disregards the "offsets" buffer dtype, then it will misread at least some of the columns exported by Arrow.

stinodego · 2023-10-03T12:12:03Z

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer (e.g. dtype STRING will mean an 8bit UINT data buffer).

Where does the spec spell the expected per-buffer dtypes for each column dtype?

Arrow, for example, has different string-like types: one with 32-bit offsets, and another with 64-bit offsets. If the dataframe consumer disregards the "offsets" buffer dtype, then it will misread at least some of the columns exported by Arrow.

Yes, I was talking specifically about the data buffer dtype. The offsets buffer and the validity buffer dtypes are very relevant.

MarcoGorelli · 2023-10-03T12:26:46Z

If it's implemented like this (everywhere), the data buffer dtype can be changed without breaking from_dataframe.

I think we should still take some care though - e.g. if polars 1.0.0 were to fix its buffer dtype and pandas 2.2.0 were to fix from_dataframe to not use the buffer dtype, then anyone with pandas 2.1.2 and polars 1.0.0 installed who tried plotting with plotly would run into issues

stinodego · 2023-10-03T12:30:56Z

I think we should still take some care though

For sure! Let's first get the from_dataframe implementations fixed, then we can update the data buffer dtype whenever we feel comfortable (no real rush).

I have a PR for Polars ready (pola-rs/polars#10787), as you can see the change can be very small. I was planning on giving it about 6 months after the from_dataframe fixes from pyarrow and pandas come in. But I can be persuaded to hold off on that a bit longer 😬

pitrou · 2023-10-03T13:17:59Z

Is the spec supposed to be stable? There are a bunch of "TODO" and "TBD" statements, and implementations are generally very recent.

jorisvandenbossche · 2023-10-05T09:50:31Z

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer

Thinking further on this path, if we update all implementation to disregard that information (because indeed you don't need it), shouldn't we then rather remove that information from the protocol?
It's probably a much more difficult change to make (and for implementations to correctly update for that). But at the same time, if all our main (and cross-compat tested) implementations disregard that, we are also not really testing that we return the correct information there.
(although we could of course quite easily add some tests for exactly those dtypes to https://github.com/data-apis/dataframe-interchange-tests, to ensure there is consistent behaviour)

pitrou · 2023-10-05T10:00:22Z

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? Is it INT64 perhaps (but it might also be INT32 for a Unix timestamp)? It might be spelled out in the spec, but I'm certainly missing where.

Sidenote: it would be good to list all potential buffer dtypes for each column dtype somewhere, for reference and to let consumers be sure they handle all cases. Also all potential bitwidths and format strings for each DTypeKind.

stinodego · 2023-10-05T10:01:41Z

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? It might be spelled out in the spec, but I'm certainly missing where.

This is defined by the Arrow C types.

pitrou · 2023-10-05T10:03:40Z

Well, if you have a DATETIME column, for example, what is the implied dtype for the data buffer? It might be spelled out in the spec, but I'm certainly missing where.

This is defined by the Arrow C types.

Hmm, I see. Really, this spec is hard to understand as very important details are hidden in docstrings for various properties and methods.

jorisvandenbossche · 2023-10-05T10:10:35Z

In addition it also rather under-tested .. (all those details about expected buffer dtypes should ideally have shared tests in https://github.com/data-apis/dataframe-interchange-tests IMO)

stinodego · 2023-10-05T11:52:03Z

Implementations of from_dataframe should just disregard the data buffer dtype entirely. column.dtype already tells you what to expect in the data buffer

Thinking further on this path, if we update all implementation to disregard that information (because indeed you don't need it), shouldn't we then rather remove that information from the protocol? It's probably a much more difficult change to make (and for implementations to correctly update for that). But at the same time, if all our main (and cross-compat tested) implementations disregard that, we are also not really testing that we return the correct information there. (although we could of course quite easily add some tests for exactly those dtypes to data-apis/dataframe-interchange-tests, to ensure there is consistent behaviour)

Column.get_buffers should still return the dtype of the data buffer. This is valid information. Just because a from_dataframe implementation does not rely on this information, does not mean it should be discarded.

In fact, I would like to use the data buffer dtype when implementing from_dataframe for Polars (first create a Series out of each buffer, then construct a new Series out of the data/validity/offsets Series). And I would have, if it wasn't for the fact that all existing implementations have the wrong dtype there. This led me to discover the issue in the first place.

[Protocol] Make Column.get_buffers() docstring more explicit

9cea28e

Several implementations got ``Column.get_buffers()`` wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity. See apache/arrow#37598 for example.

[Protocol] Make Column.get_buffers() docstring more explicit #272

Are you sure you want to change the base?

[Protocol] Make Column.get_buffers() docstring more explicit #272

Uh oh!

Conversation

pitrou commented Oct 3, 2023 • edited by jorisvandenbossche Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Oct 3, 2023

Uh oh!

MarcoGorelli commented Oct 3, 2023

Uh oh!

jorisvandenbossche commented Oct 3, 2023

Uh oh!

jorisvandenbossche commented Oct 3, 2023

Uh oh!

stinodego commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Oct 3, 2023

Uh oh!

stinodego commented Oct 3, 2023

Uh oh!

MarcoGorelli commented Oct 3, 2023

Uh oh!

stinodego commented Oct 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Oct 3, 2023

Uh oh!

jorisvandenbossche commented Oct 5, 2023

Uh oh!

pitrou commented Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stinodego commented Oct 5, 2023

Uh oh!

pitrou commented Oct 5, 2023

Uh oh!

jorisvandenbossche commented Oct 5, 2023

Uh oh!

stinodego commented Oct 5, 2023

Uh oh!

Uh oh!

pitrou commented Oct 3, 2023 •

edited by jorisvandenbossche

Loading

stinodego commented Oct 3, 2023 •

edited

Loading

stinodego commented Oct 3, 2023 •

edited

Loading

pitrou commented Oct 5, 2023 •

edited

Loading