Description
Is your feature request related to a problem or challenge?
The MetricValue enum currently exposes only single-value statistics: counts, gauges, timers, timestamps, and a few hard-coded variants such as SpillCount or OutputRows.
For many operational questions we really care about the shape of a metric’s distribution (e.g. What is the p99 elapsed-compute time?, How skewed is memory usage across partitions?).
This is especially true when the ExecutionPlan is dispatched to multiple nodes / workers in a distributed system as part of multiple requests..
Because there is no “distribution” metric type right now we can only track very simple metrics such as (avg / min / max).
This makes it hard to pin-point outliers in terms of latencies or memory usage.
Describe the solution you'd like
Adding a new Distribution
type to the list of MetricValues.
That would look like:
Distribution {
/// The provided name of this metric
name: Cow<'static, str>,
/// A custom implementation of the metric value.
value: Arc<Mutex<TDigest>,
},
Describe alternatives you've considered
An alternative would be to expose something more generic to allow everyone to define their own ways of accumulating metrics throughout the plan execution:
Custom {
/// The provided name of this metric
name: Cow<'static, str>,
/// A custom implementation of the metric value.
value: Arc<dyn CustomMetricValue>,
},
}
trait CustomMetricValue: Debug + Send + Sync {
fn new_empty(self: Arc<Self>) -> Arc<dyn CustomMetricValue>;
fn aggregate(
self: Arc<Self>,
other: &dyn CustomMetricValue,
) -> Arc<dyn CustomMetricValue>;
}
This would allow to have more complex aggregations of metrics. For instance in the context of an execution plan issuing multiple requests, we could track the 5 slowest requests with their metadata.
Additional context
Happy to draft a PR if you think this would fit the Metric model and would be a nice addition.