Skip to content

feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044

Open
yongchul wants to merge 3 commits intosubstrait-io:mainfrom
yongchul:residual_join_expression
Open

feat: clarify post_join_filter. add residual_expressions to hash join and merge join#1044
yongchul wants to merge 3 commits intosubstrait-io:mainfrom
yongchul:residual_join_expression

Conversation

@yongchul
Copy link
Copy Markdown
Contributor

@yongchul yongchul commented Apr 13, 2026

Background

JoinRel, HashJoinRel, and MergeJoinRel have a field namedpost_join_filter, causing confusion. The confusing point is whether post_join_filter is part of the join predicate (i.e., two rows are matched when it evaluates to true) or not (i.e., as the name says, this is post join filter, thus filter evaluates after the join operation).

Substrait FAQ calls out that the post_join_filter is part of join predicate but @yongchul was deeply discontent about the explanation and the naming.

After lengthy discussion in the Slack channel, #807 was created. After more back-and-forth, the original intent of post_join_filter was indeed post join filter, not part of join predicate.

Also, it appears that Calcite has precisely the same name and same semantic, post join filter. Following code point shows placing the filter on top of the join.
https://github.com/apache/calcite/blob/0a39568b167592ded8db1128b5838982ffe264f3/core/src/main/java/org/apache/calcite/rel/rules/LoptOptimizeJoinRule.java#L553-L559

What this changes do?

TL;DR; Make post_join_filter as post_join_filter again.

  1. Documents post_join_filter is not join condition but semantically filter on top of join.
  2. Introduce residual_expression to HashJoinRel and MergeJoinRel so that they can express non-equality join predicate -- I guess this was the confusion who tried to shove non-equality join predicate to HashJoin and MergeJoin somewhere and distorted the meaning without much thought.
  3. Drop equi from hash join and merge join documentation as with residual_expression they are capable of supporting arbitrary join predicate.
  4. Fix the FAQ.

Breaking change?

This is debatable. According to previous FAQ, post_join_filter was residual_expression introduced in this PR. If we take the FAQ as part of the "spec", then this is a breaking change.

If we don't consider FAQ as a spec, then this is not a breaking change but clarification and adding a new feature -- HashJoinRel, MergeJoinRel now supports arbitrary join predicate.

AI disclaimer

The PR is assisted by Claud Opus 4.6.


This change is Reviewable

Comment thread site/docs/faq.md
Comment thread site/docs/relations/physical_relations.md Outdated
@yongchul yongchul added the PMC Ready PRs ready for review by PMCs label Apr 13, 2026
Comment on lines +866 to 869
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known engine that has this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.

@benbellick @vbarua @nielspardon any impact on your sides?

Copy link
Copy Markdown
Contributor Author

@yongchul yongchul Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • datafusion: not supported
  • duckdb: does not populate post_join_filter
  • gluten: does use post_join_filter but not sure whether it is consistent with this PR.

Copy link
Copy Markdown
Member

@nielspardon nielspardon Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.

For me I'm fine with both this improved documentation or removing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we do not use it internally.

Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.

Copy link
Copy Markdown
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants