fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

ikrommyd · 2025-06-04T09:21:32Z

As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that ak.values_as_type doesn't cover.

ianna

@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.

ikrommyd · 2025-06-04T11:23:32Z

Why would it trigger copies to host? It's using self._backend.nplike.asarray(self._data, dtype=dtype), internally. I believe cupy.asarray does device side cast without copying to host.

Regarding the binary float representation, I get that but numpy is doing it anyways.
Numpy explicitly does this for such functions: https://github.com/numpy/numpy/blob/ff1d6cc78322898c02339a529005eb358aeba327/numpy/_core/_methods.py#L160-L170
If we leave it to the user, you expect the users to know when an integer summation is going to overflow and take care of it which I don't think is gonna happen. They will probably be mostly getting incorrect results without them realizing it.
In this specific issue, it resulted in a NaN value which was pretty striking. You can just get a valid positive variance/std that is just wrong and the users will just trust it. In the case of the mean calculation, there is no invalid value even to judge from. I think it’s dangerous to just spit out incorrect numbers and expect the users to figure it out.
I don't think a library should expect the users to consider integer overflow errors. It took me a good few minutes to understand why ak.var is giving an incorrect result.

I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions.
Per the project's README:
"Arrays are dynamically typed, but operations on them are compiled and fast. Their behavior coincides with NumPy when array dimensions are regular and generalizes when they're not."

agoose77 · 2025-06-04T17:58:18Z

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

ikrommyd · 2025-06-04T18:01:17Z

My whole point here is that even a simple mean calculation sum(values)/number of values could just give you a wrong number if sum(values) is over 2**63 (even worse for int32). The user will just trust that this is the right mean while it's not.

ikrommyd · 2025-06-04T18:04:06Z

@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation.

ianna · 2025-06-04T18:07:37Z

@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise std and var as integers. We naturally promote x / 2 to a floating-point array, so it would seem reasonable to do the same for var et al.

Thanks, @agoose77

ikrommyd · 2025-06-04T18:09:32Z

@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU.

cast numbers to float64 in descriptive statistics

ee6716f

ikrommyd requested review from agoose77 and ianna June 4, 2025 09:21

ikrommyd deployed to docs June 4, 2025 09:33 — with GitHub Actions View deployment

ianna reviewed Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna left a comment

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

Uh oh!

agoose77 commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Are you sure you want to change the base?

fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527

Uh oh!

Conversation

ikrommyd commented Jun 4, 2025

Uh oh!

ianna left a comment

Choose a reason for hiding this comment

Uh oh!

ikrommyd commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

agoose77 commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025

Uh oh!

ianna commented Jun 4, 2025

Uh oh!

ikrommyd commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ikrommyd commented Jun 4, 2025 •

edited

Loading

ikrommyd commented Jun 4, 2025 •

edited

Loading