BUG: Assigning pd.NA to StringDtype column causes data corruption and pyarrow error#64339
Conversation
…pping .. failed' after pd.concat with PyArrow
pandas/core/arrays/arrow/array.py
Outdated
| # TODO: Remove this part when pa.if_else is fixed (GH#64320) | ||
| if isinstance(left, pa.ChunkedArray) and ( | ||
| pa.types.is_string(left.type) or pa.types.is_large_string(left.type) | ||
| ): | ||
| left = left.combine_chunks() | ||
|
|
||
| if isinstance(right, pa.ChunkedArray) and ( | ||
| pa.types.is_string(right.type) or pa.types.is_large_string(right.type) | ||
| ): | ||
| right = right.combine_chunks() |
There was a problem hiding this comment.
PyArrow's pc.if_else misbehaves with chunked string arrays, causing PyArrow errors and data corruption; this fix call combine_chunks() on string/large_string ChunkedArrays in _if_else before invoking pc.if_else.
There was a problem hiding this comment.
Can you check if combine_chunks() might fail if the string chunks together would go over the 2GB limit of what string (in contrast to large_string) can represent in a single array?
So potentially we have to put this in a try-except ..
Or, we might also want to check if one of the chunks has a non-zero offset (because the bug only happens in that case, and not for chunked string arrays in general)
There was a problem hiding this comment.
@jorisvandenbossche Thanks for your review! I've applied the review.
There was a problem hiding this comment.
Thanks for the update, that looks good now!
Co-authored-by: Joris Van den Bossche <[email protected]>
|
@meeseeksdev backport to 3.0.x |
…umn causes data corruption and pyarrow error
|
Thanks @kjmin622 ! |
…pe column causes data corruption and pyarrow error) (#64526) Co-authored-by: Jeongmin Gil <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]>
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.AGENTS.md.