feat: transitive predicate propagation across multi-table join chains#21423
Draft
xanderbailey wants to merge 1 commit intoapache:mainfrom
Draft
feat: transitive predicate propagation across multi-table join chains#21423xanderbailey wants to merge 1 commit intoapache:mainfrom
xanderbailey wants to merge 1 commit intoapache:mainfrom
Conversation
dd6a6d9 to
00d7e8d
Compare
00d7e8d to
74201d8
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
Right now,
PushDownFilterdoes single-hop predicate inference through join equi-conditions. If you havea.x = b.yas a join key andWHERE a.x > 5, it'll inferb.y > 5and push it down. Great.But for multi-table join chains like:
...it never derives
c.z > 5. The problem is thatb.y > 5gets pushed below the first join and becomes invisible when processing the second join. This means tablecgets scanned without any filter, which is a missed optimization that matters for any query joining 3+ tables on shared keys.Spark, and Calcite all do transitive predicate closure -- DataFusion should too.
InferFiltersFromConstraintsin the Catalyst optimizerJoinPushTransitivePredicatesRuleWhat changes are included in this PR?
Rather than adding a new optimizer rule, this extends the existing
PushDownFilterrule with equivalence-class-based inference.The key additions in
push_down_filter.rs:ColumnEquivalences-- a simple union-find overColumnthat tracks which columns are transitively equalcollect_descendant_equalities()-- walks the plan subtree collecting column equalities from descendant INNER join ON clauses. Only collects from INNER joins (outer join equalities don't unconditionally hold). Stops at projections, aggregates, limits, and other nodes that change column identity or row cardinality.infer_predicates_from_equivalence_classes()-- for each single-column predicate, generates equivalent predicates for all columns in the same equivalence classinfer_join_predicates()is modified to build equivalence classes from the current join's keys + descendant joins + WHERE-clause column equalities, then use them for inferenceThe existing ON-filter inference path (
infer_join_predicates_from_on_filters) is kept as-is since it correctly handles join-type-specific directionality for ON clause predicates.Also adds
push_down_filter_transitive.sltwith end-to-end SQL tests that verify both EXPLAIN plans and query correctness.Are these changes tested?
Yes -- both unit tests and sqllogictests:
Unit tests (new tests in
push_down_filter.rs):WHERE a.x = b.y AND a.x > 5)a.x + 1 = b.yis not treated as column equality)ColumnEquivalencesunit tests (basic operations + path compression)SQL logic tests (
push_down_filter_transitive.slt):Are there any user-facing changes?
No API changes.