-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Protocol] Make Column.get_buffers() docstring more explicit #272
base: main
Are you sure you want to change the base?
Conversation
Several implementations got ``Column.get_buffers()`` wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity. See apache/arrow#37598 for example.
I do agree, but some care needs to be taken - currently Thoughts on the approach suggested in pandas-dev/pandas#54781 (comment) ? I think we should probably wait 1-2 years before updating implementations' buffer dtype |
I agree we should be careful in fixing this. But I assume a start that is needed anyway is 1) agreeing that this is the correct interpretation and 2) if so, updating all libraries' And once that is done, we can see how to go about updating the return value of the buffer's dtype (or how long that should take, etc) |
Implementations of If it's implemented like this (everywhere), the data buffer dtype can be changed without breaking |
Where does the spec spell the expected per-buffer dtypes for each column dtype? Arrow, for example, has different string-like types: one with 32-bit offsets, and another with 64-bit offsets. If the dataframe consumer disregards the "offsets" buffer dtype, then it will misread at least some of the columns exported by Arrow. |
Yes, I was talking specifically about the data buffer dtype. The offsets buffer and the validity buffer dtypes are very relevant. |
I think we should still take some care though - e.g. if polars 1.0.0 were to fix its buffer dtype and pandas 2.2.0 were to fix |
For sure! Let's first get the I have a PR for Polars ready (pola-rs/polars#10787), as you can see the change can be very small. I was planning on giving it about 6 months after the |
Is the spec supposed to be stable? There are a bunch of "TODO" and "TBD" statements, and implementations are generally very recent. |
Thinking further on this path, if we update all implementation to disregard that information (because indeed you don't need it), shouldn't we then rather remove that information from the protocol? |
Well, if you have a Sidenote: it would be good to list all potential buffer dtypes for each column dtype somewhere, for reference and to let consumers be sure they handle all cases. Also all potential bitwidths and format strings for each DTypeKind. |
This is defined by the Arrow C types. |
Hmm, I see. Really, this spec is hard to understand as very important details are hidden in docstrings for various properties and methods. |
In addition it also rather under-tested .. (all those details about expected buffer dtypes should ideally have shared tests in https://github.com/data-apis/dataframe-interchange-tests IMO) |
In fact, I would like to use the data buffer dtype when implementing |
Several implementations got
Column.get_buffers()
wrong by assuming the buffers dtypes would be the same as the column dtype. Clarify to eliminate any ambiguity.See apache/arrow#37598 for example.
Closes #273