Skip to content

BUG: Assigning pd.NA to StringDtype column causes data corruption and pyarrow error#64339

Merged
jorisvandenbossche merged 9 commits intopandas-dev:mainfrom
kjmin622:issue64320
Mar 11, 2026
Merged

BUG: Assigning pd.NA to StringDtype column causes data corruption and pyarrow error#64339
jorisvandenbossche merged 9 commits intopandas-dev:mainfrom
kjmin622:issue64320

Conversation

@kjmin622
Copy link
Contributor

Comment on lines +2725 to +2734
# TODO: Remove this part when pa.if_else is fixed (GH#64320)
if isinstance(left, pa.ChunkedArray) and (
pa.types.is_string(left.type) or pa.types.is_large_string(left.type)
):
left = left.combine_chunks()

if isinstance(right, pa.ChunkedArray) and (
pa.types.is_string(right.type) or pa.types.is_large_string(right.type)
):
right = right.combine_chunks()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArrow's pc.if_else misbehaves with chunked string arrays, causing PyArrow errors and data corruption; this fix call combine_chunks() on string/large_string ChunkedArrays in _if_else before invoking pc.if_else.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check if combine_chunks() might fail if the string chunks together would go over the 2GB limit of what string (in contrast to large_string) can represent in a single array?

So potentially we have to put this in a try-except ..
Or, we might also want to check if one of the chunks has a non-zero offset (because the bug only happens in that case, and not for chunked string arrays in general)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche Thanks for your review! I've applied the review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, that looks good now!

@kjmin622 kjmin622 marked this pull request as draft February 27, 2026 15:40
@kjmin622 kjmin622 marked this pull request as ready for review February 27, 2026 17:25
@jorisvandenbossche jorisvandenbossche changed the title BUG: Assigning pd.NA to StringDtype column causes "Unknown error: Wrapping .. failed" after pd.concat with PyArrow BUG: Assigning pd.NA to StringDtype column causes data corruption and pyarrow error Mar 11, 2026
@jorisvandenbossche jorisvandenbossche merged commit 84c7b19 into pandas-dev:main Mar 11, 2026
42 of 45 checks passed
@jorisvandenbossche jorisvandenbossche added this to the 3.0.2 milestone Mar 11, 2026
@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Mar 11, 2026
@jorisvandenbossche
Copy link
Member

@meeseeksdev backport to 3.0.x

@jorisvandenbossche
Copy link
Member

Thanks @kjmin622 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Bug Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Assigning pd.NA to StringDtype column causes "Unknown error: Wrapping .. failed" after pd.concat with PyArrow

2 participants