[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146
[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146ueshin wants to merge 1 commit intoapache:masterfrom
Conversation
JIRA Issue Information=== Sub-task SPARK-55363 === This comment was automatically generated by GitHub Actions |
| self.assert_eq(left, right) | ||
|
|
||
| def ignore_null(self, col): | ||
| return LooseVersion(pd.__version__) >= LooseVersion("3.0") and col == "decimal_nan" |
There was a problem hiding this comment.
Is decimal_nan the only case where this happens? I think we can have this matter any time we do some calculation that results in a null-like value.
There was a problem hiding this comment.
Yes, as long as I've observed so far, it only happens with decimal_nan.
There was a problem hiding this comment.
We can use subTest to check the parameters in the test loops. WDYT?
There was a problem hiding this comment.
I think most of the issues happened because decimal_nan actually has null-like values. Some of the columns don't. We should probably not use it as a fact. I think a better way is probably check if there's any null-like value in the column with col.isna().to_numpy().any()?
There was a problem hiding this comment.
The actually issue should be how decimal.Decimal(np.nan) is handled?
The other numeric types, None will be NaN when converting to pandas, which is well-handled.
The other types, None will be None anyway.
- The string type is a bit different; previously it was
objectdtype withNone, but now it'sStringDtypewithNaNfor null, which was fixed at [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string types #54015
But decimal.Decimal(np.nan) is kind of special value that Spark can't handle well anyway?
It will be None in pandas API on Spark as Spark doesn't have a concept of NaN on decimal type.
>>> pdf = pd.DataFrame([decimal.Decimal(np.nan), None])
>>> pdf
0
0 NaN
1 None
>>>
>>> psdf = ps.from_pandas(pdf)
>>> psdf
0
0 None
1 NoneThere was a problem hiding this comment.
Okay so for these tests, we are doing operations on a "decimal column" - which is really just object in pandas because pandas does not have a decimal dtype. The psdf output, unfortunately, also has type object because object + anything is object. So there is no way for us to know that we should convert this None to np.nan.
Then I guess this change is fine - we should ignore the null differences from operation of "decimal" data - which is just object.
|
The link to the build in the build status is missing for some reason. |
|
Actually, I'm doing #54100 this. I think I've seen other cases that's caused by null comparison, but maybe I'm wrong? |
|
thanks, merged to master |
What changes were proposed in this pull request?
Makes ops tests with "decimal_nan" columns ignore
NaNvs.None.Why are the changes needed?
pandas 3 made
assert_frame_equalstrictly checkNaNvs.None.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated the related tests.
Was this patch authored or co-authored using generative AI tooling?
No.