feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044
feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044yongchul wants to merge 3 commits intosubstrait-io:mainfrom
Conversation
| // An optional boolean filter applied to each output record after join | ||
| // matching and null-padding (for outer/mark joins) are complete. | ||
| // Semantically equivalent to placing a FilterRel directly above this join. | ||
| Expression post_join_filter = 6; |
There was a problem hiding this comment.
Is there any known engine that has this feature?
There was a problem hiding this comment.
Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.
@benbellick @vbarua @nielspardon any impact on your sides?
There was a problem hiding this comment.
- datafusion: not supported
- duckdb: does not populate post_join_filter
- gluten: does use post_join_filter but not sure whether it is consistent with this PR.
There was a problem hiding this comment.
The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.
For me I'm fine with both this improved documentation or removing it.
There was a problem hiding this comment.
I just checked and we do not use it internally.
Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.
There was a problem hiding this comment.
@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.
Background
JoinRel, HashJoinRel, and MergeJoinRel have a field named
post_join_filter, causing confusion. The confusing point is whetherpost_join_filteris part of the join predicate (i.e., two rows are matched when it evaluates totrue) or not (i.e., as the name says, this is post join filter, thusfilterevaluates after the join operation).Substrait FAQ calls out that the
post_join_filteris part of join predicate but @yongchul was deeply discontent about the explanation and the naming.After lengthy discussion in the Slack channel, #807 was created. After more back-and-forth, the original intent of
post_join_filterwas indeed post join filter, not part of join predicate.Also, it appears that Calcite has precisely the same name and same semantic, post join filter. Following code point shows placing the filter on top of the join.
https://github.com/apache/calcite/blob/0a39568b167592ded8db1128b5838982ffe264f3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java#L553-L559
What this changes do?
TL;DR; Make
post_join_filteraspost_join_filteragain.post_join_filteris not join condition but semantically filter on top of join.residual_expressionto HashJoinRel and MergeJoinRel so that they can express non-equality join predicate -- I guess this was the confusion who tried to shove non-equality join predicate toHashJoinandMergeJoinsomewhere and distorted the meaning without much thought.equifrom hash join and merge join documentation as withresidual_expressionthey are capable of supporting arbitrary join predicate.Breaking change?
This is debatable. According to previous FAQ,
post_join_filterwasresidual_expressionintroduced in this PR. If we take the FAQ as part of the "spec", then this is a breaking change.If we don't consider FAQ as a spec, then this is not a breaking change but clarification and adding a new feature -- HashJoinRel, MergeJoinRel now supports arbitrary join predicate.
AI disclaimer
The PR is assisted by Claud Opus 4.6.
This change is