Skip to content

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rhshadrach opened this issue May 10, 2025 · 2 comments
Assignees
Labels
API - Consistency Internal Consistency of API/Behavior Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@rhshadrach
Copy link
Member

rhshadrach commented May 10, 2025

Similar to #61099, but concerning lhs + rhs. Alignment in general is heavily involved here as well. One thing to note is that unlike in comparisons operations, in arithmetic operations the lhs.index dtype is favored, assuming no coercion is necessary.

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

When concerning string dtypes, I've observed the following:

  • NaN vs NA generally aligns, the value propagated is always NA
  • NaN vs NA does not align when the NA arises from ArrowExtensionArray
  • NaN vs None (object) aligns, the value propagated is from lhs
  • NA vs None does not align
  • PyArrow-NA + ArrowExtensionArray results in object dtype (NAs do align)
  • Python-NA + PyArrow-NA results in PyArrow-NA; contrary to the left being preferred
  • Python-NA + PyArrow-NA results in object type (NAs do align)
  • When lhs and rhs have indices that are both object dtype:
    • NaN vs None aligns and propagates the lhs value.
    • NA vs None does not align
    • NA vs NaN does not align

I think the main two things we need to decide are:

  1. How should NA vs NaN vs None align.
  2. When they do align, which value should be propagated.

A few properties I think are crucial:

  • Alignment should only depend on value and left-vs-right operand, not storage.
  • Alignment should be transitive.

If we do decide on aligning between different values, a natural order is None < NaN < NA. However, the most backwards compatible would be to have None vs NaN be operand dependent with NA always propagating when present.

@rhshadrach rhshadrach added Bug Strings String extension data type and string data Needs Discussion Requires discussion from core team before further action API - Consistency Internal Consistency of API/Behavior labels May 10, 2025
@samruddhibaviskar11
Copy link

take

@samruddhibaviskar11
Copy link

samruddhibaviskar11 commented May 11, 2025

Hi @rhshadrach
I’ve dug into this issue on pandas 2.2.2 and here’s what I’ve confirmed:
import numpy as np
import pandas as pd
import pyarrow as pa

Check pandas version

print(pd.version)

dtypes = [
np.dtype(object),
pd.StringDtype("pyarrow"), # Remove na_value for older pandas versions
pd.StringDtype("python"), # Remove na_value for older pandas versions
pd.StringDtype("pyarrow"), # Remove na_value for older pandas versions
pd.StringDtype("python"), # Remove na_value for older pandas versions
pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

output
2.2.2

idx
a 2
NA 4
b 6
Name: value, dtype: int64
idx
a 2
NA 4
b 6
Name: value, dtype: int64

While the arithmetic operations are working in my environment, I noticed that the index dtypes for df1 and df2 are slightly different despite using pd.StringDtype("pyarrow") for both, which might contribute to the potential inconsistencies when using the 'pyarrow' storage backend.

I'll share any additional findings or reproducible examples I come across. Looking forward to contributing to a resolution for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants
@rhshadrach @samruddhibaviskar11 and others