Skip to content

[SPARK-55321][PYTHON][TESTS] Ignore null difference when comparing ps df/series #54100

Open
gaogaotiantian wants to merge 2 commits intoapache:masterfrom
gaogaotiantian:fix-numeric-null
Open

[SPARK-55321][PYTHON][TESTS] Ignore null difference when comparing ps df/series #54100
gaogaotiantian wants to merge 2 commits intoapache:masterfrom
gaogaotiantian:fix-numeric-null

Conversation

@gaogaotiantian
Copy link
Contributor

What changes were proposed in this pull request?

For all the numeric tests in data_type_ops, ignore the difference in null values (None vs np.nan vs pd.NA etc.).

Why are the changes needed?

pyspark.pandas always generate a different null value than pandas (pyspark only has one null value internally). However, pandas 3 makes it more strict for their internal testing utility so our tests start to fail. We can relax it on our side for now.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Locally with pandas 3, a lot of tests passed because of this change.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions
Copy link

github-actions bot commented Feb 2, 2026

JIRA Issue Information

=== Test SPARK-55321 ===
Summary: Ignore null difference when we compare results from numeric operations
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@ueshin
Copy link
Member

ueshin commented Feb 5, 2026

I'm afraid I feel this is too widely ignoring the nulls and I'm worrying we may miss what we can / should fix.
I'd go with #54146 first and see how many cases it fixes. WDYT? also cc @HyukjinKwon @zhengruifeng

@zhengruifeng
Copy link
Contributor

@ueshin +1, I also feel we should try to fix as many as possible before we ignore this difference

@HyukjinKwon
Copy link
Member

Yeah ..

@gaogaotiantian
Copy link
Contributor Author

gaogaotiantian commented Feb 5, 2026

Sure we can do a more strict check for now. But we should also be fully aware that this comparison is what we do for now (pandas 2.x). We don't check null differences now - because that's the default behavior for pandas testing util. After we upgraded to pandas 3, the testing util changed so it shows all the null differences now. We are not fighting for the behavior difference between pandas 2 and pandas 3, we are trying to change the once-expected behavior for pyspark.pandas.

Basically even for pandas 2.x, we already generate None where pandas generate np.nan - but pandas testing util considers them the same. Now we still generate None where pandas generate np.nan, but pandas testing util thinks it's wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants