Skip to content

[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146

Closed
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-55363/ignore_null
Closed

[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-55363/ignore_null

Conversation

@ueshin
Copy link
Member

@ueshin ueshin commented Feb 4, 2026

What changes were proposed in this pull request?

Makes ops tests with "decimal_nan" columns ignore NaN vs. None.

Why are the changes needed?

pandas 3 made assert_frame_equal strictly check NaN vs. None.

>>> pdf = pd.DataFrame([decimal.Decimal(np.nan)])
>>> psdf = ps.from_pandas(pdf)
>>>
>>> pdf
     0
0  NaN
>>> psdf
      0
0  None
  • pandas < 3
>>> pd.__version__
'2.3.3'
>>> assert_frame_equal(pdf, psdf.to_pandas())
<stdin>:1: FutureWarning: Mismatched null-like values NaN and None found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.
  • pandas == 3
>>> pd.__version__
'3.0.0'
>>> assert_frame_equal(pdf, psdf.to_pandas())
Traceback (most recent call last):
...
AssertionError: DataFrame.iloc[:, 0] (column name="0") are different

DataFrame.iloc[:, 0] (column name="0") values are different (100.0 %)
[index]: [0]
[left]:  [NaN]
[right]: [None]
At positional index 0, first diff: NaN != None

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

JIRA Issue Information

=== Sub-task SPARK-55363 ===
Summary: Make ops tests with "decimal_nan" columns ignore NaN vs. None.
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@ueshin
Copy link
Member Author

ueshin commented Feb 4, 2026

self.assert_eq(left, right)

def ignore_null(self, col):
return LooseVersion(pd.__version__) >= LooseVersion("3.0") and col == "decimal_nan"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is decimal_nan the only case where this happens? I think we can have this matter any time we do some calculation that results in a null-like value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as long as I've observed so far, it only happens with decimal_nan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use subTest to check the parameters in the test loops. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the issues happened because decimal_nan actually has null-like values. Some of the columns don't. We should probably not use it as a fact. I think a better way is probably check if there's any null-like value in the column with col.isna().to_numpy().any()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actually issue should be how decimal.Decimal(np.nan) is handled?

The other numeric types, None will be NaN when converting to pandas, which is well-handled.
The other types, None will be None anyway.

But decimal.Decimal(np.nan) is kind of special value that Spark can't handle well anyway?

It will be None in pandas API on Spark as Spark doesn't have a concept of NaN on decimal type.

>>> pdf = pd.DataFrame([decimal.Decimal(np.nan), None])
>>> pdf
      0
0   NaN
1  None
>>>
>>> psdf = ps.from_pandas(pdf)
>>> psdf
      0
0  None
1  None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so for these tests, we are doing operations on a "decimal column" - which is really just object in pandas because pandas does not have a decimal dtype. The psdf output, unfortunately, also has type object because object + anything is object. So there is no way for us to know that we should convert this None to np.nan.

Then I guess this change is fine - we should ignore the null differences from operation of "decimal" data - which is just object.

@ueshin
Copy link
Member Author

ueshin commented Feb 4, 2026

The link to the build in the build status is missing for some reason.

@gaogaotiantian
Copy link
Contributor

Actually, I'm doing #54100 this. I think I've seen other cases that's caused by null comparison, but maybe I'm wrong?

@zhengruifeng
Copy link
Contributor

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants