-
Notifications
You must be signed in to change notification settings - Fork 93
fix: cast numbers to float64 in descriptive statistics to avoid integer summation overflow errors #3527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikrommyd - Thanks, but I don't think we should do this. I'd rather leave it to the user who knows their data better. Firstly, binary floats cannot represent decimal fractional values accurately, except for those which are an integer divided by a power of 2. Secondly, this implementation would trigger device to host data copies on GPU.
Why would it trigger copies to host? It's using Regarding the binary float representation, I get that but numpy is doing it anyways. I'm fine with leaving it as is but I think the point was to also mimick numpy behavior here when these functions were implemented, especially since we're doing a NEP18 dispatch to such functions. |
@ianna I defer to your judgement on this one, but my two-cents are that we probably should follow NumPy here given that it doesn't seem particularly sensible to quantise |
My whole point here is that even a simple mean calculation |
@agoose77 mean, var, std are not quantized as integers in any case. You get a float back either way. The whole point is how you sum the elements of array. For a mean calculation, if you choose to sum all the elements of the array as integers, you get the wrong mean calculation if the values are large enough to overflow the integer. That's why numpy chooses to cast to float64 first before doing the summation. |
Thanks, @agoose77 |
@ianna I really agree with your "copying to host memory" point though but I don't think that it actually happens. I can't find something explicit in the documentation though. Do you have a source? I was under the impression that you can do dtype casting on the GPU. |
Fixes #3525
As explained in the issue, we should be casting ints and bools to float64 to avoid integer summation overflow errors in descriptive statistics like numpy does. The tests pass but I'm not 100% sure that I'm not missing an edge case that
ak.values_as_type
doesn't cover.