-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Describe the bug
The approx_percentile_cont_with_weight function produces incorrect weighted percentile results. The bug is in the TDigest implementation where new_with_centroid() sets count: 1 regardless of the actual centroid weight, while the weight is used elsewhere in centroid merging. This mismatch corrupts the percentile calculation.
Root cause
In datafusion/functions-aggregate-common/src/tdigest.rs, the new_with_centroid function incorrectly hardcodes count: 1:
pub fn new_with_centroid(max_size: usize, centroid: Centroid) -> Self {
TDigest {
centroids: vec![centroid.clone()],
max_size,
sum: centroid.mean * centroid.weight,
count: 1, // BUG: should be centroid.weight
max: centroid.mean,
min: centroid.mean,
}
}
The count field should reflect the actual weight of the centroid, as it's used in estimate_quantile() to compute the rank: let rank = q * self.count. Since the weight is correctly used in sum but not in count, the algorithm produces inconsistent results.
Additionally, count should be f64 (not u64) to properly support fractional weights, consistent with the ClickHouse TDigest implementation (https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87).
To Reproduce
-- Weighted percentile with increasing weights (1,2,3,...,9)
-- Higher values have higher weights, so weighted median should be > 5
SELECT approx_percentile_cont_with_weight(value, weight, 0.5) as weighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), (9,9)) AS t(value, weight);
-- Compare with unweighted percentile
SELECT approx_percentile_cont(value, 0.5) as unweighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), (9,9)) AS t(value, weight);
Actual Behavior
weighted: 3
unweighted: 5
The weighted result (3) is lower than the unweighted median (5), even though higher values have proportionally higher weights. The result is neither correct weighted percentile nor correct unweighted percentile.
Expected behavior
weighted: 6 or 7
unweighted: 5
With weights (1,2,3,4,5,6,7,8,9):
- Total weight = 45
- At 0.5 quantile, rank = 22.5
- Cumulative weights: 1→1, 2→3, 3→6, 4→10, 5→15, 6→21, 7→28
- Rank 22.5 falls between value 6 (cumulative 21) and value 7 (cumulative 28)
- Expected weighted median ≈ 6-7
The weighted median should be pulled toward higher values since they have higher weights.
Additional context
No response