Statistics: Implement SampledDistribution variant to Distribution to … #16614

cj-zhukov · 2025-06-29T11:45:54Z

…support estimated distributions (#14897)

Which issue does this PR close?

Closes Statistics: Implement SampledDistribution variant to Distribution to support estimated distributions #14897.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…support estimated distributions (apache#14897)

alamb · 2025-06-30T22:40:15Z

Thanks @cj-zhukov !

I think it would be super helpful to make an example / test showing how to use this new distribution to estimate cardinality

For example perhaps you could set up a SampledDistribition like

[0, 10]: 100 samples
[20,30]: 200 samples

And then estimate the cardinality of a predicate like x > 25

I would expect the estimate to be 1/3 (half of the 20-30 bucket and none of the 1-10 bucket)

alamb · 2025-06-30T22:40:30Z

FYI @ozankabak and @berkaysynnada as you may be interested in this feature too

cj-zhukov · 2025-07-02T09:36:43Z

Thanks @cj-zhukov !

I think it would be super helpful to make an example / test showing how to use this new distribution to estimate cardinality

For example perhaps you could set up a SampledDistribition like
[0, 10]: 100 samples
[20,30]: 200 samples
And then estimate the cardinality of a predicate like x > 25

I would expect the estimate to be 1/3 (half of the 20-30 bucket and none of the 1-10 bucket)

hi @alamb , I wanted to clarify one thing . I implemented general-purpose methods like mean(), median(), and variance() for SampledDistribution, similar to other Distribution variants. These are designed to summarize the entire distribution. To answer your question about estimating cardinality for predicates like x > 25, I implemented a separate method estimate_selectivity_gt() that works specifically for that use case — it calculates how many values match the condition based on the bin layout and counts. Let me know if you think those general-purpose methods should be reused here, or if you’d prefer to keep predicate-based estimation separate. Happy to adjust based on your guidance.

Statistics: Implement SampledDistribution variant to Distribution to …

147458e

…support estimated distributions (apache#14897)

github-actions bot added the logical-expr Logical plan and expressions label Jun 29, 2025

Sergey Zhukov added 2 commits June 29, 2025 15:00

Fix formatting issues (cargo fmt)

bfc0a30

Fix non-exhaustive patterns

f638acb

github-actions bot added the physical-expr Changes to the physical-expr crates label Jun 29, 2025

Sergey Zhukov added 2 commits June 29, 2025 15:59

Fix formatting issues (cargo clippy)

5755b77

Fix formatting issues (cargo clippy)

2f75452

Test showing how to use new distribution to estimate cardinality

2cb19f9

Fix formatting issues (cargo fmt)

22b9248

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Statistics: Implement SampledDistribution variant to Distribution to … #16614

Statistics: Implement SampledDistribution variant to Distribution to … #16614

cj-zhukov commented Jun 29, 2025

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

cj-zhukov commented Jul 2, 2025

Uh oh!

Uh oh!

Statistics: Implement SampledDistribution variant to Distribution to … #16614

Are you sure you want to change the base?

Statistics: Implement SampledDistribution variant to Distribution to … #16614

Conversation

cj-zhukov commented Jun 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

alamb commented Jun 30, 2025

Uh oh!

cj-zhukov commented Jul 2, 2025

Uh oh!

Uh oh!