BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

rhshadrach · 2025-05-10T14:43:33Z

Similar to #61099, but concerning lhs + rhs. Alignment in general is heavily involved here as well. One thing to note is that unlike in comparisons operations, in arithmetic operations the lhs.index dtype is favored, assuming no coercion is necessary.

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

When concerning string dtypes, I've observed the following:

NaN vs NA generally aligns, the value propagated is always NA
NaN vs NA does not align when the NA arises from ArrowExtensionArray
NaN vs None (object) aligns, the value propagated is from lhs
NA vs None does not align
PyArrow-NA + ArrowExtensionArray results in object dtype (NAs do align)
Python-NA + PyArrow-NA results in PyArrow-NA; contrary to the left being preferred
Python-NA + PyArrow-NA results in object type (NAs do align)
When lhs and rhs have indices that are both object dtype:
- NaN vs None aligns and propagates the lhs value.
- NA vs None does not align
- NA vs NaN does not align

I think the main two things we need to decide are:

How should NA vs NaN vs None align.
When they do align, which value should be propagated.

A few properties I think are crucial:

Alignment should only depend on value and left-vs-right operand, not storage.
Alignment should be transitive.

If we do decide on aligning between different values, a natural order is None < NaN < NA. However, the most backwards compatible would be to have None vs NaN be operand dependent with NA always propagating when present.

The text was updated successfully, but these errors were encountered:

samruddhibaviskar11 · 2025-05-11T17:28:52Z

take

samruddhibaviskar11 · 2025-05-11T23:53:36Z

Hi @rhshadrach
I’ve dug into this issue on pandas 2.2.2 and here’s what I’ve confirmed:
import numpy as np
import pandas as pd
import pyarrow as pa

Check pandas version

print(pd.version)

dtypes = [
np.dtype(object),
pd.StringDtype("pyarrow"), # Remove na_value for older pandas versions
pd.StringDtype("python"), # Remove na_value for older pandas versions
pd.StringDtype("pyarrow"), # Remove na_value for older pandas versions
pd.StringDtype("python"), # Remove na_value for older pandas versions
pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

output
2.2.2

idx
a 2
NA 4
b 6
Name: value, dtype: int64
idx
a 2
NA 4
b 6
Name: value, dtype: int64

While the arithmetic operations are working in my environment, I noticed that the index dtypes for df1 and df2 are slightly different despite using pd.StringDtype("pyarrow") for both, which might contribute to the potential inconsistencies when using the 'pyarrow' storage backend.

I'll share any additional findings or reproducible examples I come across. Looking forward to contributing to a resolution for this issue.

rhshadrach added Bug Strings Needs Discussion API - Consistency labels May 10, 2025

github-actions bot assigned samruddhibaviskar11 May 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

rhshadrach commented May 10, 2025 •

edited

Loading

samruddhibaviskar11 commented May 11, 2025

samruddhibaviskar11 commented May 11, 2025 •

edited

Loading

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

Comments

rhshadrach commented May 10, 2025 • edited Loading

samruddhibaviskar11 commented May 11, 2025

samruddhibaviskar11 commented May 11, 2025 • edited Loading

Check pandas version

rhshadrach commented May 10, 2025 •

edited

Loading

samruddhibaviskar11 commented May 11, 2025 •

edited

Loading