Skip to content

Pluggable operator-level statistics propagation (StatisticsRegistry) #21443

@asolimando

Description

@asolimando

Is your feature request related to a problem or challenge?

Is your feature request related to a problem or challenge?

DataFusion currently computes operator statistics via each operator's built-in partition_statistics. While this provides basic cross-operator propagation, several gaps limit the effectiveness of cost-based optimizations:

  • No way to override statistics for existing operators: if the built-in estimation for, say, a HashJoinExec or AggregateExec is inaccurate for your workload, there is no way to plug in a better model without modifying DataFusion's source code
  • No support for richer statistical representations: the framework only carries basic statistics (row count, NDV, min/max); there is no mechanism to propagate richer metadata like histograms or sketches through the plan tree
  • No external stats injection: there is no way to feed richer statistics from external catalogs (Hive Metastore, Iceberg stats, custom catalogs) into the physical plan's statistics propagation

Without an extension point, advancing CBO in DataFusion becomes increasingly difficult: more sophisticated estimation strategies are hard to land in core because they add complexity that not all users need, and downstream projects embedding DataFusion have no way to bring their own estimation without forking.

See this discussion for context on how pluggable statistics would help DataFusion evolve its CBO beyond any single reference system while keeping the framework extensible.

Related: #8227 (statistics improvements epic), #20184 (child_stats in partition_statistics)

Describe the solution you'd like

A pluggable chain-of-responsibility framework for operator-level statistics, following the same pattern as RelationPlanner for SQL parsing and ExpressionAnalyzer (#21120) for expression-level statistics:

  • StatisticsProvider trait: chain element that computes statistics for a specific ExecutionPlan operator, returning either computed stats or delegating to the next provider
  • StatisticsRegistry: chains providers in priority order (first computed result wins), lives in SessionState
  • StatsCache: caller-owned per-pass memoization cache, created by the optimizer rule before a pass and dropped after
  • ExtendedStatistics: standard Statistics plus a type-erased extension map for custom metadata (histograms, sketches, etc.)

The registry walks the plan tree bottom-up: for each node, it first recursively computes enhanced child stats, then passes them to the provider chain. Each provider can use the enhanced child stats to produce a better estimate than the built-in partition_statistics, or delegate. Results are cached per-node in the StatsCache to avoid redundant walks within the same pass.

The framework should:

  • Ship with built-in providers for all standard physical operators (Filter, Projection, Aggregate, Join, Limit, Union, etc.)
  • Allow users to register custom providers via SessionState for overriding built-in estimation
  • Provide at least one optimizer rule integration as an example (e.g., JoinSelection)
  • Compose with the expression-level ExpressionAnalyzer (Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120): the registry feeds enhanced child stats into operators, which internally use the expression analyzer for expression-level estimation
  • Be purely additive, gated by a config flag

The design avoids breaking changes, but some upstream changes would simplify the architecture if the community is open to them:

  • If partition_statistics accepted child_stats (Let partition_statistics accept pre-computed children statistics #20184), the registry could feed enhanced stats into the built-in path directly, eliminating the separate bottom-up tree walk and ensuring column-level statistics (NDV, min/max) propagate through all operators
  • If Statistics carried a type-erased extension map (similar to ExtendedStatistics), extensions (histograms, sketches) would flow naturally through partition_statistics and the separate registry walk for extension propagation could be dropped entirely

Describe alternatives you've considered

Storing enhanced statistics directly on plan nodes via PlanProperties. This would avoid the external walk but requires modifying a core type that every ExecutionPlan implementation depends on, making the change significantly more invasive. The external registry is purely additive and follows standard practice in query optimizers (Calcite, Trino, Spark all compute stats on demand after plan transformations).

Planned work

Framework

  • StatisticsProvider trait, StatisticsRegistry, StatsCache, ExtendedStatistics
  • DefaultStatisticsProvider (fallback to partition_statistics)
  • SessionState integration, config flag

Built-in providers

  • FilterStatisticsProvider (selectivity + post-filter NDV adjustment)
  • ProjectionStatisticsProvider (column mapping through projections)
  • PassthroughStatisticsProvider (schema-preserving, cardinality-preserving operators: Sort, Repartition, CoalesceBatches, etc.)
  • AggregateStatisticsProvider (NDV-product estimation for GROUP BY)
  • JoinStatisticsProvider (NDV-based join estimation, multi-key, join-type-aware bounds, hash/sort-merge/nested-loop/cross)
  • LimitStatisticsProvider (local + global, skip + fetch)
  • UnionStatisticsProvider (row count summation with Absent propagation)

Optimizer integration

  • Provide at least one optimizer rule integration as an example (e.g., JoinSelection)

Other physical optimizer decisions that could benefit from enhanced statistics include aggregate mode selection (single vs two-phase based on estimated group count), dynamic join algorithm selection (hash vs sort-merge), and dynamic filter activation based on estimated selectivity, among others.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request
No fields configured for Feature.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions