Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 37 additions & 7 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -255,12 +255,18 @@ message ProjectRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The binary JOIN relational operator left-join-right, including various join types, a join condition and post_join_filter expression
// The binary JOIN relational operator left-join-right, including various join
// types, a join condition, and an optional filter on the joined output.
message JoinRel {
RelCommon common = 1;
Rel left = 2;
Rel right = 3;
// A boolean condition evaluated over the left and right inputs that
// determines whether a pair of records is a match.
Expression expression = 4;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 5;

JoinType type = 6;
Expand Down Expand Up @@ -828,10 +834,12 @@ message ComparisonJoinKey {
}
}

// The hash equijoin operator will build a hash table out of one input (default `right`) based on a set of join keys.
// It will then probe that hash table for the other input (default `left`), finding matches.
// The hash equijoin operator will build a hash table out of one input (default
// `right`) based on a set of join keys. It will then probe that hash table for
// the other input (default `left`), finding matches.
//
// Two rows are a match if the comparison function returns true for all keys
// Two rows are a match if every key comparison returns true and, when
// specified, `residual_expression` also evaluates to true.
message HashJoinRel {
RelCommon common = 1;
Rel left = 2;
Expand All @@ -855,12 +863,23 @@ message HashJoinRel {
// hash function for a given comparsion function or to reject the plan if it cannot
// do so.
repeated ComparisonJoinKey keys = 8;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;
Comment on lines +866 to 869
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known engine that has this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.

@benbellick @vbarua @nielspardon any impact on your sides?

Copy link
Copy Markdown
Contributor Author

@yongchul yongchul Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • datafusion: not supported
  • duckdb: does not populate post_join_filter
  • gluten: does use post_join_filter but not sure whether it is consistent with this PR.

Copy link
Copy Markdown
Member

@nielspardon nielspardon Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.

For me I'm fine with both this improved documentation or removing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we do not use it internally.

Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.


JoinType type = 7;

// Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`.
// Specifies which side of input to build the hash table for this hash join.
// Defaults to `BUILD_INPUT_RIGHT`.
BuildInput build_input = 9;
// An optional boolean expression evaluated on each candidate key-match to
// determine whether it is a true match. A candidate is only considered a
// match when all `keys` comparisons AND this expression evaluate to true.
// This expression interacts with join-type semantics: for example, in a
// left outer join, a left row that has key matches but no candidate
// satisfying this expression will be emitted with nulls for the right side.
Expression residual_expression = 11;

enum JoinType {
JOIN_TYPE_UNSPECIFIED = 0;
Expand All @@ -887,8 +906,9 @@ message HashJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys.
// This allows the join operation to be done in a streaming fashion.
// The merge equijoin does a join by taking advantage of two sets that are
// sorted on the join keys. This allows the join operation to be done in a
// streaming fashion.
message MergeJoinRel {
RelCommon common = 1;
Rel left = 2;
Expand All @@ -914,9 +934,19 @@ message MergeJoinRel {
// is free to do so as well). If possible, the consumer should verify the sort
// order and reject invalid plans.
repeated ComparisonJoinKey keys = 8;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;

JoinType type = 7;
// An optional boolean expression evaluated on each candidate key-match to
// determine whether it is a true match. A candidate is only considered a
// match when all `keys` comparisons AND this expression evaluate to true.
// This expression interacts with join-type semantics: for example, in a
// left outer join, a left row that has key matches but no candidate
// satisfying this expression will be emitted with nulls for the right side.
Expression residual_expression = 9;

enum JoinType {
JOIN_TYPE_UNSPECIFIED = 0;
Expand Down
7 changes: 5 additions & 2 deletions site/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ title: FAQ

## What is the purpose of the post-join filter field on Join relations?

The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.
`post_join_filter` is an optional filter on the output records of a join. It is applied after matching and any join-type–specific post-processing (e.g., null-padding unmatched rows for outer joins, emitting unmatched rows for anti joins) are complete. It is semantically equivalent to placing a separate `Filter` relation directly above the join. Because it does not participate in matching, it has no effect on which rows are considered matched or unmatched for the purpose of outer, anti, semi, or mark join semantics. When omitted it defaults to `true`.
Comment thread
yongchul marked this conversation as resolved.

See the example [here](https://facebookincubator.github.io/velox/develop/joins.html#hash-join-implementation) that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.
This is distinct from predicates that *do* participate in match determination:

- In `JoinRel`, all match predicates belong in `expression`.
- In `HashJoinRel` and `MergeJoinRel`, equijoin predicates go in `keys` and any remaining match predicates go in `residual_expression`.

## Why does the project relation keep existing columns?

Expand Down
2 changes: 1 addition & 1 deletion site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ The join operation will combine two separate inputs into a single output, based
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the input order of the data. | Required. Can be the literal True. |
| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined below. | Required |

### Join Types
Expand Down
6 changes: 4 additions & 2 deletions site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ The hash equijoin join operator will build a hash table out of one input (defaul
| Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined in the Join operator. | Required |

## NLJ (Nested Loop Join) Operator
Expand Down Expand Up @@ -67,7 +68,8 @@ The merge equijoin does a join by taking advantage of two sets that are sorted o
| Right Input | A relational input. | Required |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined in the Join operator. | Required |

## Exchange Operator
Expand Down