Skip to content

Conversation

@sesteves
Copy link

The approx_percentile_cont_with_weight function was producing incorrect results due to wrong weight handling in the TDigest implementation.

Root cause: In TDigest::new_with_centroid(), the count field was hardcoded to 1 regardless of the actual centroid weight, while the weight was correctly used in the sum calculation. This mismatch caused incorrect percentile calculations since estimate_quantile() uses count to compute the rank.

Changes:

  • Changed TDigest::count from u64 to f64 to properly support fractional weights (consistent with ClickHouse's TDigest implementation)
  • Fixed new_with_centroid() to use centroid.weight for count
  • Updated state_fields() in approx_percentile_cont and approx_median to use Float64 for the count field
  • Added early return in merge_digests() when all centroids have zero weight to prevent panic
  • Updated test expectations to reflect correct weighted percentile behavior

Which issue does this PR close?

Rationale for this change

The approx_percentile_cont_with_weight function produces incorrect weighted percentile results. The bug is in the TDigest implementation where new_with_centroid() sets count: 1 regardless of the actual centroid weight, while the weight is used elsewhere in centroid merging. This mismatch corrupts the percentile calculation.

What changes are included in this PR?

  • Changed TDigest::count from u64 to f64 to properly support fractional weights (consistent with ClickHouse's TDigest implementation)
  • Fixed new_with_centroid() to use centroid.weight for count
  • Updated state_fields() in approx_percentile_cont and approx_median to use Float64 for the count field
  • Added early return in merge_digests() when all centroids have zero weight to prevent panic
  • Updated test expectations to reflect correct weighted percentile behavior

Are these changes tested?

Yes.

  • All existing unit tests in tdigest.rs pass (7 tests)
  • All SQL logic tests for aggregate functions pass
  • Manual testing confirms correct behavior with various weight distributions (equal weights, heavy low/high values, linear weights, fractional weights)

Are there any user-facing changes?

Yes, this is a breaking change:

  1. Result changes: approx_percentile_cont_with_weight now returns correct weighted percentiles. Queries relying on the previous (incorrect) behavior will see different results.
  2. Serialized state format change: The TDigest state field count changes from UInt64 to Float64. Any existing serialized/checkpointed TDigest state will be incompatible and cannot be restored.
  3. Edge case behavior change: When all weights are zero, the function now returns NULL instead of the previous undefined behavior.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels Jan 22, 2026
@Jefffrey
Copy link
Contributor

Thanks @sesteves

Would you be able to address the test issues? I'd suggest running cargo test locally to ensure it passes successfully before pushing to the PR.

The approx_percentile_cont_with_weight function was producing incorrect
results due to wrong weight handling in the TDigest implementation.

Root cause: In TDigest::new_with_centroid(), the count field was
hardcoded to 1 regardless of the actual centroid weight, while the
weight was correctly used in the sum calculation. This mismatch caused
incorrect percentile calculations since estimate_quantile() uses count
to compute the rank.

Changes:
- Changed TDigest::count from u64 to f64 to properly support fractional
  weights (consistent with ClickHouse's TDigest implementation)
- Fixed new_with_centroid() to use centroid.weight for count
- Updated state_fields() in approx_percentile_cont and approx_median to
  use Float64 for the count field
- Added early return in merge_digests() when all centroids have zero
  weight to prevent panic
- Updated test expectations to reflect correct weighted percentile
  behavior
@sesteves sesteves force-pushed the fix-approx-percentile-cont-with-weight branch from 4570151 to 4b35f82 Compare January 23, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

approx_percentile_cont_with_weight produces incorrect results due to wrong weight handling in TDigest

2 participants