Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 37 additions & 7 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -255,12 +255,18 @@ message ProjectRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The binary JOIN relational operator left-join-right, including various join types, a join condition and post_join_filter expression
// The binary JOIN relational operator left-join-right, including various join
// types, a join condition, and an optional filter on the joined output.
message JoinRel {
RelCommon common = 1;
Rel left = 2;
Rel right = 3;
// A boolean condition evaluated over the left and right inputs that
// determines whether a pair of records is a match.
Expression expression = 4;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 5;

JoinType type = 6;
Expand Down Expand Up @@ -828,10 +834,12 @@ message ComparisonJoinKey {
}
}

// The hash equijoin operator will build a hash table out of one input (default `right`) based on a set of join keys.
// It will then probe that hash table for the other input (default `left`), finding matches.
// The hash join operator will build a hash table out of one input (default
// `right`) based on a set of join keys. It will then probe that hash table for
// the other input (default `left`), finding matches.
//
// Two rows are a match if the comparison function returns true for all keys
// Two rows are a match if every key comparison returns true and, when
// specified, `residual_expression` also evaluates to true.
message HashJoinRel {
RelCommon common = 1;
Rel left = 2;
Expand All @@ -855,12 +863,23 @@ message HashJoinRel {
// hash function for a given comparsion function or to reject the plan if it cannot
// do so.
repeated ComparisonJoinKey keys = 8;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;
Comment on lines +866 to 869
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any known engine that has this feature?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I know one that I can fix (it is under my control lol). I will check data fusion, duckdb, and velox.

@benbellick @vbarua @nielspardon any impact on your sides?

Copy link
Copy Markdown
Contributor Author

@yongchul yongchul Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • datafusion: not supported
  • duckdb: does not populate post_join_filter
  • gluten: does use post_join_filter but not sure whether it is consistent with this PR.

Copy link
Copy Markdown
Member

@nielspardon nielspardon Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The engines we are currently dealing with would also handle this as a separate FilterRel instead of merging this into the Join operation. We may have some use cases in the future for representing fused operators in Substrait though I would look at this separately.

For me I'm fine with both this improved documentation or removing it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we do not use it internally.

Clarifying the documentation is a good idea, though I would prefer removing it if no user of substrait is actually using it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon @benbellick any final thoughts on the PR? This one I can wait until next week meeting.


JoinType type = 7;

// Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`.
// Specifies which side of input to build the hash table for this hash join.
// Defaults to `BUILD_INPUT_RIGHT`.
BuildInput build_input = 9;
// An optional boolean expression evaluated on each candidate key-match to
// determine whether it is a true match. A candidate is only considered a
// match when all `keys` comparisons AND this expression evaluate to true.
// This expression interacts with join-type semantics: for example, in a
// left outer join, a left row that has key matches but no candidate
// satisfying this expression will be emitted with nulls for the right side.
Expression residual_expression = 11;

enum JoinType {
JOIN_TYPE_UNSPECIFIED = 0;
Expand All @@ -887,8 +906,9 @@ message HashJoinRel {
substrait.extensions.AdvancedExtension advanced_extension = 10;
}

// The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys.
// This allows the join operation to be done in a streaming fashion.
// The merge join does a join by taking advantage of two sets that are
// sorted on the join keys. This allows the join operation to be done in a
// streaming fashion.
message MergeJoinRel {
RelCommon common = 1;
Rel left = 2;
Expand All @@ -914,9 +934,19 @@ message MergeJoinRel {
// is free to do so as well). If possible, the consumer should verify the sort
// order and reject invalid plans.
repeated ComparisonJoinKey keys = 8;
// An optional boolean filter applied to each output record after join
// matching and null-padding (for outer/mark joins) are complete.
// Semantically equivalent to placing a FilterRel directly above this join.
Expression post_join_filter = 6;

JoinType type = 7;
// An optional boolean expression evaluated on each candidate key-match to
// determine whether it is a true match. A candidate is only considered a
// match when all `keys` comparisons AND this expression evaluate to true.
// This expression interacts with join-type semantics: for example, in a
// left outer join, a left row that has key matches but no candidate
// satisfying this expression will be emitted with nulls for the right side.
Expression residual_expression = 9;

enum JoinType {
JOIN_TYPE_UNSPECIFIED = 0;
Expand Down
7 changes: 5 additions & 2 deletions site/docs/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,12 @@ title: FAQ

## What is the purpose of the post-join filter field on Join relations?

The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join.
`post_join_filter` is an optional filter on the output records of a join. It is applied after matching and any join-type–specific post-processing (e.g., null-padding unmatched rows for outer joins, emitting unmatched rows for anti joins) are complete. It is semantically equivalent to placing a separate `Filter` relation directly above the join. Because it does not participate in matching, it has no effect on which rows are considered matched or unmatched for the purpose of outer, anti, semi, or mark join semantics. When omitted it defaults to `true`.
Comment thread
yongchul marked this conversation as resolved.

See the example [here](https://facebookincubator.github.io/velox/develop/joins.html#hash-join-implementation) that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join.
This is distinct from predicates that *do* participate in match determination:

- In `JoinRel`, all match predicates belong in `expression`.
- In `HashJoinRel` and `MergeJoinRel`, equijoin predicates go in `keys` and any remaining match predicates go in `residual_expression`.

## Why does the project relation keep existing columns?

Expand Down
2 changes: 1 addition & 1 deletion site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,7 @@ The join operation will combine two separate inputs into a single output, based
| Left Input | A relational input. | Required |
| Right Input | A relational input. | Required |
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the input order of the data. | Required. Can be the literal True. |
| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined below. | Required |

### Join Types
Expand Down
16 changes: 9 additions & 7 deletions site/docs/relations/physical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the "physical plan" but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait.

## Hash Equijoin Operator
## Hash Join Operator

The hash equijoin join operator will build a hash table out of one input (default `right`) based on a set of join keys. It will then probe that hash table for the other input (default `left`), finding matches.
The hash join operator will build a hash table out of one input (default `right`) based on a set of join keys. It will then probe that hash table for the other input (default `left`), finding matches, then use `residual_expression` to determine whether the matches are true matches.

| Signature | Value |
| -------------------- | ----------------------------------------------------------------- |
Expand All @@ -14,7 +14,7 @@ The hash equijoin join operator will build a hash table out of one input (defaul
| Input Order | Same as the [Join](logical_relations.md#join-operation) operator. |
| Direct Output Order | Same as the [Join](logical_relations.md#join-operation) operator. |

### Hash Equijoin Properties
### Hash Join Properties

| Property | Description | Required |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- |
Expand All @@ -23,7 +23,8 @@ The hash equijoin join operator will build a hash table out of one input (defaul
| Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined in the Join operator. | Required |

## NLJ (Nested Loop Join) Operator
Expand All @@ -47,9 +48,9 @@ The nested loop join operator does a join by holding the entire right input and
| Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. | Optional. Defaults to true (a Cartesian join). |
| Join Type | One of the join types defined in the Join operator. | Required |

## Merge Equijoin Operator
## Merge Join Operator

The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion.
The merge join does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion. Once the join keys are matched, then use `residual_expression` to determine whether the matches are true matches.

| Signature | Value |
| -------------------- | ----------------------------------------------------------------- |
Expand All @@ -67,7 +68,8 @@ The merge equijoin does a join by taking advantage of two sets that are sorted o
| Right Input | A relational input. | Required |
| Left Keys | References to the fields to join on in the left input. | Required |
| Right Keys | References to the fields to join on in the right input. | Required |
| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. |
| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. |
| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. | Optional, defaults to True. |
| Join Type | One of the join types defined in the Join operator. | Required |

## Exchange Operator
Expand Down