-
Notifications
You must be signed in to change notification settings - Fork 2k
Pluggable operator-level statistics propagation (StatisticsRegistry) #21443
Description
Is your feature request related to a problem or challenge?
Is your feature request related to a problem or challenge?
DataFusion currently computes operator statistics via each operator's built-in partition_statistics. While this provides basic cross-operator propagation, several gaps limit the effectiveness of cost-based optimizations:
- No way to override statistics for existing operators: if the built-in estimation for, say, a
HashJoinExecorAggregateExecis inaccurate for your workload, there is no way to plug in a better model without modifying DataFusion's source code - No support for richer statistical representations: the framework only carries basic statistics (row count, NDV, min/max); there is no mechanism to propagate richer metadata like histograms or sketches through the plan tree
- No external stats injection: there is no way to feed richer statistics from external catalogs (Hive Metastore, Iceberg stats, custom catalogs) into the physical plan's statistics propagation
Without an extension point, advancing CBO in DataFusion becomes increasingly difficult: more sophisticated estimation strategies are hard to land in core because they add complexity that not all users need, and downstream projects embedding DataFusion have no way to bring their own estimation without forking.
See this discussion for context on how pluggable statistics would help DataFusion evolve its CBO beyond any single reference system while keeping the framework extensible.
Related: #8227 (statistics improvements epic), #20184 (child_stats in partition_statistics)
Describe the solution you'd like
A pluggable chain-of-responsibility framework for operator-level statistics, following the same pattern as RelationPlanner for SQL parsing and ExpressionAnalyzer (#21120) for expression-level statistics:
StatisticsProvidertrait: chain element that computes statistics for a specificExecutionPlanoperator, returning either computed stats or delegating to the next providerStatisticsRegistry: chains providers in priority order (first computed result wins), lives inSessionStateStatsCache: caller-owned per-pass memoization cache, created by the optimizer rule before a pass and dropped afterExtendedStatistics: standardStatisticsplus a type-erased extension map for custom metadata (histograms, sketches, etc.)
The registry walks the plan tree bottom-up: for each node, it first recursively computes enhanced child stats, then passes them to the provider chain. Each provider can use the enhanced child stats to produce a better estimate than the built-in partition_statistics, or delegate. Results are cached per-node in the StatsCache to avoid redundant walks within the same pass.
The framework should:
- Ship with built-in providers for all standard physical operators (Filter, Projection, Aggregate, Join, Limit, Union, etc.)
- Allow users to register custom providers via
SessionStatefor overriding built-in estimation - Provide at least one optimizer rule integration as an example (e.g.,
JoinSelection) - Compose with the expression-level
ExpressionAnalyzer(Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120): the registry feeds enhanced child stats into operators, which internally use the expression analyzer for expression-level estimation - Be purely additive, gated by a config flag
The design avoids breaking changes, but some upstream changes would simplify the architecture if the community is open to them:
- If
partition_statisticsacceptedchild_stats(Let partition_statistics accept pre-computed children statistics #20184), the registry could feed enhanced stats into the built-in path directly, eliminating the separate bottom-up tree walk and ensuring column-level statistics (NDV, min/max) propagate through all operators - If
Statisticscarried a type-erased extension map (similar toExtendedStatistics), extensions (histograms, sketches) would flow naturally throughpartition_statisticsand the separate registry walk for extension propagation could be dropped entirely
Describe alternatives you've considered
Storing enhanced statistics directly on plan nodes via PlanProperties. This would avoid the external walk but requires modifying a core type that every ExecutionPlan implementation depends on, making the change significantly more invasive. The external registry is purely additive and follows standard practice in query optimizers (Calcite, Trino, Spark all compute stats on demand after plan transformations).
Planned work
Framework
- StatisticsProvider trait, StatisticsRegistry, StatsCache, ExtendedStatistics
- DefaultStatisticsProvider (fallback to partition_statistics)
- SessionState integration, config flag
Built-in providers
- FilterStatisticsProvider (selectivity + post-filter NDV adjustment)
- ProjectionStatisticsProvider (column mapping through projections)
- PassthroughStatisticsProvider (schema-preserving, cardinality-preserving operators: Sort, Repartition, CoalesceBatches, etc.)
- AggregateStatisticsProvider (NDV-product estimation for GROUP BY)
- JoinStatisticsProvider (NDV-based join estimation, multi-key, join-type-aware bounds, hash/sort-merge/nested-loop/cross)
- LimitStatisticsProvider (local + global, skip + fetch)
- UnionStatisticsProvider (row count summation with
Absentpropagation)
Optimizer integration
- Provide at least one optimizer rule integration as an example (e.g.,
JoinSelection)
Other physical optimizer decisions that could benefit from enhanced statistics include aggregate mode selection (single vs two-phase based on estimated group count), dynamic join algorithm selection (hash vs sort-merge), and dynamic filter activation based on estimated selectivity, among others.
Additional context
- Related: Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120 (
ExpressionAnalyzer, expression-level complement), Let partition_statistics accept pre-computed children statistics #20184 (child_statsinpartition_statistics, prerequisite for column-level propagation), Estimate aggregate output rows using existing NDV statistics #20926 (aggregate NDV estimation), Epic: Statistics improvements #8227 (statistics improvements epic), Implementpartition_statisticsAPI for more operators #15873 (partition_statisticsoperator support tracking) PhysicalOptimizerRule::optimizecurrently receives onlyConfigOptions; rules needing additional context (statistics registry, expression analyzer) require a richer context parameter or a workaround- RelMetadataQuery / RelMetadataProvider pattern, adapted for Rust's ownership model