-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add rounding logic and scale zero fix parse_decimal to match parse_string_to_decimal_native behavior #7179
base: main
Are you sure you want to change the base?
Conversation
bbd54d4
to
7b33527
Compare
7e598c9
to
bef2992
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title of this PR now says it adds rounding logic and makes the logic consistent which sounds good. However, I don't see a ticket that describes the problem.
This PR's description still says it
Closes apache/datafusion#10315
and I did see code changes in the e
parsing code but I didn't see any tests 🤔 I don't think we can merge code without tests.
I am sorry to be so pedantic, but arrow-rs is used by many projects now and so evaluating and minimizing downstream impacts is very important. I am trying to avoid the overhead of dealing with releasing regressions like
And I also apologize for the length between review cycles, but as we have mentioned many times, our review bandwidth is very limited.
@@ -850,7 +850,16 @@ fn parse_e_notation<T: DecimalType>( | |||
} | |||
|
|||
if exp < 0 { | |||
result = result.div_wrapping(base.pow_wrapping(-exp as _)); | |||
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this change the behavior of parsing e notation? If so I didn't see any tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I missed porting the tests while splitting the large PR. It rounds instead of current behavior - truncate. I'll add the tests.
@@ -598,7 +599,20 @@ mod tests { | |||
0_i128 | |||
); | |||
assert_eq!( | |||
parse_decimal::<Decimal128Type>("0", 38, 0)?, | |||
parse_string_to_decimal_native::<Decimal128Type>("0", 0)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having the same behavior in these two functions seems like a reasonable change to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we are able to move to using parse_decimal for casting and deprecate parse_string_to_decimal_native , these tests will be changed to assert the value in the message section of the assert.
@@ -1286,7 +1286,7 @@ mod tests { | |||
assert_eq!("53.002666", lat.value_as_string(1)); | |||
assert_eq!("52.412811", lat.value_as_string(2)); | |||
assert_eq!("51.481583", lat.value_as_string(3)); | |||
assert_eq!("12.123456", lat.value_as_string(4)); | |||
assert_eq!("12.123457", lat.value_as_string(4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb you can see the behavior change in this test of arrow-csv reader which uses parse_decimal
Added an issue in arrow-rs #7355 |
Added e-notation tests |
I apologize for making this extra overhead. will be careful in future |
I understand, will keep this in mind in future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @himadripal
result = result.div_wrapping(base.pow_wrapping(-exp as _)); | ||
let result_with_scale = result.div_wrapping(base.pow_wrapping(-exp as _)); | ||
let result_with_one_scale_up = | ||
result.div_wrapping(base.pow_wrapping(-exp.add_wrapping(1) as _)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this logic is correct, but just for me to understand. E.g. for 12345e-5
, would exp
be -5
? why is this adding 1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exp
in the parse_e_notation
method is being overriden couple of times based on which direction the decimal needs to shift and if the original string has fractional in it. ( i.e 1.23e-2 has 2 fractional digits).
before this check, exp
represents number of digits to be removed or added. In this case, exp = -3
now, result_with_scale = 12
result_with_one_scale_up=123
to round up or down, we need to capture the digit next to last digit in the result, in this case 3. How we get it is
rounding_digit= result_with_one_scale_up - result_with_scale * 10
rounding_digit=123- 12*10 = 3
if rounding_digit >=5 then we add +1 to the result
else result remains intact.
result = result.div_wrapping(base.pow_wrapping(fractionals as u32)) | ||
} | ||
//add one if >=5 | ||
if rounding_digit >= 5 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering where >= 5
came from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first we figure out what is the rounding_digit - digit which is next to the last digit in the final result (without rounding logic applied), if the value of the rounding_digit is >=5, then we add +1 to round up the result, else it remains same.
"1265E-4" -> with scale 3 -> 0.127
in scale 3 the number would be 0.126 and rounding digit will be 5, as rounding digit >= 5, the result becomes 0.127
1264E-4" -> with scale 3 -> 0.126
here rounding_digit is 4, which is less than 5, so no need to add 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>= 5
is being used for rounding to the nearest integer
with scale 1
2.47 -> 2.5
2.44 -> 2.4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we could add a comment in the code to explain this point to future readers who may have the same question
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @himadripal I do like tests, perhaps we can also add the tests for negative decimals roundings?
Also tests for very big numbers, or very small would be beneficial. What comes to my mind with help of chatGpt
// **Very Large Numbers**
assert_eq!(round_to_places(1e15, 2), 1e15); // Large integer should remain unchanged
assert_eq!(round_to_places(9999999999999.987, 2), 10000000000000.00);
assert_eq!(round_to_places(-1e15, 3), -1e15);
// **Very Small Numbers (Near Zero)**
assert_eq!(round_to_places(1e-15, 10), 0.0000000000); // Rounds to zero at precision 10
assert_eq!(round_to_places(-1e-15, 10), -0.0000000000); // Rounds to zero
assert_eq!(round_to_places(0.000000000123456, 12), 0.000000000123); // Should retain up to 12 decimal places
// **Extreme Edge Cases**
assert_eq!(round_to_places(f64::MAX, 2), f64::MAX); // Maximum f64 value should remain the same
assert_eq!(round_to_places(f64::MIN, 2), f64::MIN); // Minimum f64 value should remain the same
assert!(round_to_places(f64::NAN, 2).is_nan()); // NaN should remain NaN
assert_eq!(round_to_places(f64::INFINITY, 2), f64::INFINITY); // Infinity should remain Infinity
assert_eq!(round_to_places(f64::NEG_INFINITY, 2), f64::NEG_INFINITY); // Negative Infinity should remain unchanged
@himadripal -- I am preparing to create a new release hopefully tomorrow. Can you please address @comphead 's testing suggestions soon so we can get this PR into that release? |
@comphead and @alamb there are existing edge case tests here I'll add more from @comphead list today. One more clarifying points - Although it is not mandatory to go together, my goal for this change was to make scientific notation support in datafusion - datafusion#10315. For that we need to also merge #7191 - this is moving the cast to use |
fix the type of precision and scale fix the clippy errors
648a7a0
to
136c0a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Omega359 suggested on a call this morning that we also test this PR using the DataFusion tests to see if we can see any differences, including the extended tests
I can try to run these later today if no one has a chance to do so. I have a PR to do this partially here:
("-1e-15", "-0.0000000000", 10), | ||
("1e-15", "0.0000000000", 10), | ||
("1e15", "1000000000000000", 2), | ||
]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also add couple more tests on negative
// Edge cases
assert_eq!(round_to_places(-2.5, 0), -3.0); // Negative half rounding up
assert_eq!(round_to_places(-2.49, 0), -2.0); // Just below rounding negative
// Whole numbers (should remain unchanged)
assert_eq!(round_to_places(-100.0, 2), -100.0);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are existing tests for such scenarios here. Nevertheless, I'll add these as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @himadripal lgtm
However I agree with @alamb and @Omega359 to have another set of tests running in DataFusion on top of this branch before the merge
Can anyone re-trigger the build, it failed with |
I restarted the integration test job to get a clean run |
I am running the I also tested with this branch with datafusion's tests locally and they seem to have worked so far. I am running the extended tests now too warning: `parquet` (lib) generated 2 warnings
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.30s
Running bin/sqllogictests.rs (target/debug/deps/sqllogictests-924fc434a23455ca)
Completed 113 test files in 4 seconds
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ |
My benchmark results seem to show that this PR significantly slows down decimal parsing, anywhere fro 3 to 17% slower than main. I will rerun to see if I can reproduce these numbers cargo bench --bench parse_decimal Here are the results I got: group fix_parse_decimal_for_rounding_scale_zero main
----- ----------------------------------------- ----
-.123 1.10 22.8±0.01ns ? ?/sec 1.00 20.7±0.03ns ? ?/sec
-00.1 1.06 33.5±0.02ns ? ?/sec 1.00 31.6±0.08ns ? ?/sec
-12. 1.03 38.6±0.04ns ? ?/sec 1.00 37.3±0.14ns ? ?/sec
-123 1.07 39.2±0.07ns ? ?/sec 1.00 36.8±0.16ns ? ?/sec
-123. 1.05 42.1±0.09ns ? ?/sec 1.00 40.3±0.24ns ? ?/sec
-123.1 1.07 41.1±0.04ns ? ?/sec 1.00 38.4±0.27ns ? ?/sec
-123.123 1.12 32.5±0.03ns ? ?/sec 1.00 29.0±0.03ns ? ?/sec
-123.1234 1.14 34.3±0.09ns ? ?/sec 1.00 30.0±0.02ns ? ?/sec
-12345678912345678.1234 1.21 88.8±0.13ns ? ?/sec 1.00 73.4±0.06ns ? ?/sec
-99999999999999999.999 1.19 86.8±0.08ns ? ?/sec 1.00 72.8±0.08ns ? ?/sec
.123 1.09 21.8±0.03ns ? ?/sec 1.00 19.9±0.02ns ? ?/sec
0.0000123 1.14 26.3±0.08ns ? ?/sec 1.00 23.1±0.03ns ? ?/sec
00.1 1.08 33.1±0.16ns ? ?/sec 1.00 30.7±0.06ns ? ?/sec
12. 1.07 38.4±0.06ns ? ?/sec 1.00 36.0±0.13ns ? ?/sec
123 1.10 39.1±0.16ns ? ?/sec 1.00 35.7±0.09ns ? ?/sec
123. 1.10 43.3±5.95ns ? ?/sec 1.00 39.4±0.14ns ? ?/sec
123.1 1.11 41.2±0.13ns ? ?/sec 1.00 37.2±0.09ns ? ?/sec
123.123 1.14 32.3±0.03ns ? ?/sec 1.00 28.2±0.09ns ? ?/sec
123.1234 1.17 34.2±0.04ns ? ?/sec 1.00 29.3±0.04ns ? ?/sec
12345678912345678.1234 1.24 90.0±0.09ns ? ?/sec 1.00 72.8±0.13ns ? ?/sec
99999999999999999.999 1.22 88.1±0.11ns ? ?/sec 1.00 72.0±0.07ns ? ?/sec |
Thanks @alamb for pointing it out. Trying to profile it to get some hot spots |
@alamb is there instructions how to get this benchmark result against main branch. I can then make changes and test locally to see if I can improve the time. |
nvm, found the instructions
|
Which issue does this PR close?
Few important consideration -
parse_string_to_decimal_native
parse_string_to_decimal_native
does not have support for e-notationparse_string_to_decimal_native
does rounding at scale, not truncateparse_decimal
an existing method has e-notation support and use elsewhereparse_decimal
parse_decimal
to get support for e-notation.This PR is a 2nd one to break up #6905 , this one add rounding logic to parse_decimal to match the behavior in existing
parse_string_to_decimal_native
.Closes #.
Rationale for this change
At present, string to decimal conversion does not support e-notation, in arrow,
parse_string_to_decimal_native
is called to get generic string to decimal. parse_decimal on the other hand is used from generic parse method and it has e-notation support. This PR is adding rounding and scale 0 handling to match the behavior orparse_string_to_decimal_native
method. Then we can replaceparse_string_to_decimal_native
call withparse_decimal
. This way, we will get e-notation support too.What changes are included in this PR?
Are there any user-facing changes?