-
Notifications
You must be signed in to change notification settings - Fork 2k
fix: correct weight handling in approx_percentile_cont_with_weight #19941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4b35f82
11b5eac
f3596d7
f969a63
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2029,11 +2029,12 @@ statement ok | |
| INSERT INTO t1 VALUES (TRUE); | ||
|
|
||
| # ISSUE: https://github.com/apache/datafusion/issues/12716 | ||
| # This test verifies that approx_percentile_cont_with_weight does not panic when given 'NaN' and returns 'inf' | ||
| # This test verifies that approx_percentile_cont_with_weight does not panic when given 'NaN' | ||
| # With weight=0, the data point does not contribute, so result is NULL | ||
| query R | ||
| SELECT approx_percentile_cont_with_weight(0, 0) WITHIN GROUP (ORDER BY 'NaN'::DOUBLE) FROM t1 WHERE t1.v1; | ||
| ---- | ||
| Infinity | ||
| NULL | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this conform with other implementations? What does clickhouse do in this case?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With weight=0 and NaN value, both systems would effectively have no data (weight=0 means no contribution in both). ClickHouse would return NaN (for a Float64 result) while DataFusion returns NULL. |
||
|
|
||
| statement ok | ||
| DROP TABLE t1; | ||
|
|
@@ -2352,21 +2353,21 @@ e 115 | |
| query TI | ||
| SELECT c1, approx_percentile_cont_with_weight(c2, 0.95) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1 | ||
| ---- | ||
| a 74 | ||
| a 65 | ||
| b 68 | ||
| c 123 | ||
| d 124 | ||
| e 115 | ||
| c 122 | ||
| d 123 | ||
| e 110 | ||
sesteves marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| # approx_percentile_cont_with_weight with centroids | ||
| query TI | ||
| SELECT c1, approx_percentile_cont_with_weight(c2, 0.95, 200) WITHIN GROUP (ORDER BY c3) AS c3_p95 FROM aggregate_test_100 GROUP BY 1 ORDER BY 1 | ||
| ---- | ||
| a 74 | ||
| a 65 | ||
| b 68 | ||
| c 123 | ||
| d 124 | ||
| e 115 | ||
| c 122 | ||
| d 123 | ||
| e 110 | ||
|
|
||
| # csv_query_sum_crossjoin | ||
| query TTI | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change the count to a f64? isn't u64 more precise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using approx_percentile_cont_with_weight, weights can be:
The TDigest algorithm computes percentiles using the total weight/count to determine rank:
let rank = q * self.count;
If count were a u64, it couldn't accurately represent the sum of fractional weights, leading to inaccurate percentile calculations.
This is also consistent with ClickHouse's TDigest implementation: https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L47
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sesteves
Interestingly in ClickHouse I see the count is updated with the current sum
https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L191
count = sum + l_count; // Update count, it might be different due to += inaccuracyI see now this is similar to how
TDigest::new_with_centroidsets count to the existing weighthttps://github.com/alamb/datafusion/blob/4b35f827725ddd47f739b3731b6a3ff2a10a3c2d/datafusion/functions-aggregate-common/src/tdigest.rs#L125-L124
So it seems like
countis actually some sort of weighted total rather than a count which confused meNo changes needed. Thank you for the explanation