Skip to content

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ikrommyd
Copy link
Collaborator

@ikrommyd ikrommyd commented Jun 4, 2025

Fixes #3525

As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that ak.values_as_type doesn't cover.

Copy link
Collaborator

@ianna ianna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

Why would it trigger copies to host? It's using self._backend.nplike.asarray(self._data, dtype=dtype), internally. I believe cupy.asarray does device side cast without copying to host.

Regarding the binary float representation, I get that but numpy is doing it anyways.
Numpy explicitly does this for such functions: https://github.com/numpy/numpy/blob/ff1d6cc78322898c02339a529005eb358aeba327/numpy/_core/_methods.py#L160-L170
If we leave it to the user, you expect the users to know when an integer summation is going to overflow and take care of it which I don't think is gonna happen. They will probably be mostly getting incorrect results without them realizing it.
In this specific issue, it resulted in a NaN value which was pretty striking. You can just get a valid positive variance/std that is just wrong and the users will just trust it. In the case of the mean calculation, there is no invalid value even to judge from. I think it’s dangerous to just spit out incorrect numbers and expect the users to figure it out.
I don't think a library should expect the users to consider integer overflow errors. It took me a good few minutes to understand why ak.var is giving an incorrect result.

I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions.
Per the project's README:
"Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they're not."

@agoose77
Copy link
Collaborator

agoose77 commented Jun 4, 2025

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

My whole point here is that even a simple mean calculation sum(values)/number of values could just give you a wrong number if sum(values) is over 2**63 (even worse for int32). The user will just trust that this is the right mean while it's not.

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation.

@ianna
Copy link
Collaborator

ianna commented Jun 4, 2025

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

Thanks, @agoose77

@ikrommyd
Copy link
Collaborator Author

ikrommyd commented Jun 4, 2025

@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ak.std returns nan wrongly when being applied on int array
3 participants