Skip to content

approx_percentile_cont_with_weight produces incorrect results due to wrong weight handling in TDigest #19940

@sesteves

Description

@sesteves

Describe the bug

The approx_percentile_cont_with_weight function produces incorrect weighted percentile results. The bug is in the TDigest implementation where new_with_centroid() sets count: 1 regardless of the actual centroid weight, while the weight is used elsewhere in centroid merging. This mismatch corrupts the percentile calculation.

Root cause

In datafusion/functions-aggregate-common/src/tdigest.rs, the new_with_centroid function incorrectly hardcodes count: 1:
pub fn new_with_centroid(max_size: usize, centroid: Centroid) -> Self {
TDigest {
centroids: vec![centroid.clone()],
max_size,
sum: centroid.mean * centroid.weight,
count: 1, // BUG: should be centroid.weight
max: centroid.mean,
min: centroid.mean,
}
}
The count field should reflect the actual weight of the centroid, as it's used in estimate_quantile() to compute the rank: let rank = q * self.count. Since the weight is correctly used in sum but not in count, the algorithm produces inconsistent results.
Additionally, count should be f64 (not u64) to properly support fractional weights, consistent with the ClickHouse TDigest implementation (https://github.com/ClickHouse/ClickHouse/blob/927af1255adb37ace1b95cc3ec4316553b4cb4b4/src/AggregateFunctions/QuantileTDigest.h#L71-L87).

To Reproduce

-- Weighted percentile with increasing weights (1,2,3,...,9)
-- Higher values have higher weights, so weighted median should be > 5
SELECT approx_percentile_cont_with_weight(value, weight, 0.5) as weighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), (9,9)) AS t(value, weight);
-- Compare with unweighted percentile
SELECT approx_percentile_cont(value, 0.5) as unweighted
FROM (VALUES (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6,6), (7,7), (8,8), (9,9)) AS t(value, weight);

Actual Behavior

weighted: 3
unweighted: 5
The weighted result (3) is lower than the unweighted median (5), even though higher values have proportionally higher weights. The result is neither correct weighted percentile nor correct unweighted percentile.

Expected behavior

weighted: 6 or 7
unweighted: 5
With weights (1,2,3,4,5,6,7,8,9):

  • Total weight = 45
  • At 0.5 quantile, rank = 22.5
  • Cumulative weights: 1→1, 2→3, 3→6, 4→10, 5→15, 6→21, 7→28
  • Rank 22.5 falls between value 6 (cumulative 21) and value 7 (cumulative 28)
  • Expected weighted median ≈ 6-7

The weighted median should be pulled toward higher values since they have higher weights.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions