[SPARK-55321][PYTHON][TESTS] Ignore null difference when comparing ps df/series #54100
[SPARK-55321][PYTHON][TESTS] Ignore null difference when comparing ps df/series #54100gaogaotiantian wants to merge 2 commits intoapache:masterfrom
Conversation
JIRA Issue Information=== Test SPARK-55321 === This comment was automatically generated by GitHub Actions |
|
I'm afraid I feel this is too widely ignoring the nulls and I'm worrying we may miss what we can / should fix. |
|
@ueshin +1, I also feel we should try to fix as many as possible before we ignore this difference |
|
Yeah .. |
|
Sure we can do a more strict check for now. But we should also be fully aware that this comparison is what we do for now (pandas 2.x). We don't check null differences now - because that's the default behavior for pandas testing util. After we upgraded to pandas 3, the testing util changed so it shows all the null differences now. We are not fighting for the behavior difference between pandas 2 and pandas 3, we are trying to change the once-expected behavior for pyspark.pandas. Basically even for pandas 2.x, we already generate |
What changes were proposed in this pull request?
For all the numeric tests in data_type_ops, ignore the difference in null values (
Nonevsnp.nanvspd.NAetc.).Why are the changes needed?
pyspark.pandas always generate a different null value than pandas (pyspark only has one null value internally). However, pandas 3 makes it more strict for their internal testing utility so our tests start to fail. We can relax it on our side for now.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Locally with pandas 3, a lot of tests passed because of this change.
Was this patch authored or co-authored using generative AI tooling?
No.