diff --git a/proto/substrait/algebra.proto b/proto/substrait/algebra.proto index 6fa504dc8..ad2091c59 100644 --- a/proto/substrait/algebra.proto +++ b/proto/substrait/algebra.proto @@ -261,12 +261,18 @@ message ProjectRel { substrait.extensions.AdvancedExtension advanced_extension = 10; } -// The binary JOIN relational operator left-join-right, including various join types, a join condition and post_join_filter expression +// The binary JOIN relational operator left-join-right, including various join +// types, a join condition, and an optional filter on the joined output. message JoinRel { RelCommon common = 1; Rel left = 2; Rel right = 3; + // A boolean condition evaluated over the left and right inputs that + // determines whether a pair of records is a match. Expression expression = 4; + // An optional boolean filter applied to each output record after + // join-type-specific output formation is complete. + // Semantically equivalent to placing a FilterRel directly above this join. Expression post_join_filter = 5; JoinType type = 6; @@ -834,10 +840,12 @@ message ComparisonJoinKey { } } -// The hash equijoin operator will build a hash table out of one input (default `right`) based on a set of join keys. -// It will then probe that hash table for the other input (default `left`), finding matches. +// The hash join operator will build a hash table out of one input (default +// `right`) based on a set of join keys. It will then probe that hash table for +// the other input (default `left`), finding matches. // -// Two rows are a match if the comparison function returns true for all keys +// Two rows are a match if every key comparison returns true and, when +// specified, `residual_expression` also evaluates to true. message HashJoinRel { RelCommon common = 1; Rel left = 2; @@ -861,12 +869,23 @@ message HashJoinRel { // hash function for a given comparsion function or to reject the plan if it cannot // do so. repeated ComparisonJoinKey keys = 8; + // An optional boolean filter applied to each output record after + // join-type-specific output formation is complete. + // Semantically equivalent to placing a FilterRel directly above this join. Expression post_join_filter = 6; JoinType type = 7; - // Specifies which side of input to build the hash table for this hash join. Default is `BUILD_INPUT_RIGHT`. + // Specifies which side of input to build the hash table for this hash join. + // Defaults to `BUILD_INPUT_RIGHT`. BuildInput build_input = 9; + // An optional boolean expression evaluated on each candidate key-match to + // determine whether it is a true match. A candidate is only considered a + // match when all `keys` comparisons AND this expression evaluate to true. + // This expression interacts with join-type semantics: for example, in a + // left outer join, a left row that has key matches but no candidate + // satisfying this expression will be emitted with nulls for the right side. + Expression residual_expression = 11; enum JoinType { JOIN_TYPE_UNSPECIFIED = 0; @@ -893,8 +912,9 @@ message HashJoinRel { substrait.extensions.AdvancedExtension advanced_extension = 10; } -// The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. -// This allows the join operation to be done in a streaming fashion. +// The merge join does a join by taking advantage of two sets that are +// sorted on the join keys. This allows the join operation to be done in a +// streaming fashion. message MergeJoinRel { RelCommon common = 1; Rel left = 2; @@ -920,9 +940,19 @@ message MergeJoinRel { // is free to do so as well). If possible, the consumer should verify the sort // order and reject invalid plans. repeated ComparisonJoinKey keys = 8; + // An optional boolean filter applied to each output record after + // join-type-specific output formation is complete. + // Semantically equivalent to placing a FilterRel directly above this join. Expression post_join_filter = 6; JoinType type = 7; + // An optional boolean expression evaluated on each candidate key-match to + // determine whether it is a true match. A candidate is only considered a + // match when all `keys` comparisons AND this expression evaluate to true. + // This expression interacts with join-type semantics: for example, in a + // left outer join, a left row that has key matches but no candidate + // satisfying this expression will be emitted with nulls for the right side. + Expression residual_expression = 9; enum JoinType { JOIN_TYPE_UNSPECIFIED = 0; diff --git a/site/docs/faq.md b/site/docs/faq.md index c6859c67a..424a4e886 100644 --- a/site/docs/faq.md +++ b/site/docs/faq.md @@ -6,9 +6,12 @@ title: FAQ ## What is the purpose of the post-join filter field on Join relations? -The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join. +`post_join_filter` is an optional filter on the output records of a join. It is applied after matching and any join-type–specific post-processing (e.g., null-padding unmatched rows for outer joins, emitting unmatched rows for anti joins) are complete. It is semantically equivalent to placing a separate `Filter` relation directly above the join. Because it does not participate in matching, it has no effect on which rows are considered matched or unmatched for the purpose of outer, anti, semi, or mark join semantics. When omitted it defaults to `true`. -See the example [here](https://facebookincubator.github.io/velox/develop/joins.html#hash-join-implementation) that highlights how the post-join filter behaves differently than a Filter relation in the case of a left join. +This is distinct from predicates that *do* participate in match determination: + +- In `JoinRel`, all match predicates belong in `expression`. +- In `HashJoinRel` and `MergeJoinRel`, equijoin predicates go in `keys` and any remaining match predicates go in `residual_expression`. ## Why does the project relation keep existing columns? diff --git a/site/docs/relations/logical_relations.md b/site/docs/relations/logical_relations.md index 9d08e37f3..dcbbd2159 100644 --- a/site/docs/relations/logical_relations.md +++ b/site/docs/relations/logical_relations.md @@ -237,8 +237,8 @@ The join operation will combine two separate inputs into a single output, based | Inputs | 2 | | Outputs | 1 | | Property Maintenance | Distribution is maintained. Orderedness is empty post operation. Physical relations may provide better property maintenance. | -| Input Order | The input order is the left input followed by the right input. All field references of [Join Properties](#join-properties) are over this order. | -| Direct Output Order | For semi joins and anti joins, the emit order is either left or right only. For mark joins, the emit order is either left or right with a "mark" column appended at the end. See [Join Types](#join-types) for detail. Otherwise, the same as `Input Order`. | +| Input Order | The input order is the left input followed by the right input. All field references of [Join Properties](#join-properties) except Post-Join Filter are over this order. | +| Direct Output Order | For semi joins and anti joins, the emit order is either left or right only. For mark joins, the emit order is either left or right with a "mark" column appended at the end. See [Join Types](#join-types) for detail. Otherwise, the same as `Input Order`. All field references of Post-Join Filter are over this order. | ### Join Properties @@ -247,7 +247,7 @@ The join operation will combine two separate inputs into a single output, based | Left Input | A relational input. | Required | | Right Input | A relational input. | Required | | Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. Field references correspond to the input order of the data. | Required. Can be the literal True. | -| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional | +| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](#filter-operation) directly above the join. Does not influence which rows are considered matches. Field references correspond to the direct output order of the join operation. | Optional, defaults to True. | | Join Type | One of the join types defined below. | Required | ### Join Types diff --git a/site/docs/relations/physical_relations.md b/site/docs/relations/physical_relations.md index e98c2f7ae..4b05890d7 100644 --- a/site/docs/relations/physical_relations.md +++ b/site/docs/relations/physical_relations.md @@ -2,9 +2,9 @@ There is no true distinction between logical and physical operations in Substrait. By convention, certain operations are classified as physical, but all operations can be potentially used in any kind of plan. A particular set of transformations or target operators may (by convention) be considered the "physical plan" but this is a characteristic of the system consuming substrait as opposed to a definition within Substrait. -## Hash Equijoin Operator +## Hash Join Operator -The hash equijoin join operator will build a hash table out of one input (default `right`) based on a set of join keys. It will then probe that hash table for the other input (default `left`), finding matches. +The hash join operator will build a hash table out of one input (default `right`) based on a set of join keys. It will then probe that hash table for the other input (default `left`), finding matches, then use `residual_expression` to determine whether the matches are true matches. | Signature | Value | | -------------------- | ----------------------------------------------------------------- | @@ -14,7 +14,7 @@ The hash equijoin join operator will build a hash table out of one input (defaul | Input Order | Same as the [Join](logical_relations.md#join-operation) operator. | | Direct Output Order | Same as the [Join](logical_relations.md#join-operation) operator. | -### Hash Equijoin Properties +### Hash Join Properties | Property | Description | Required | | ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | @@ -23,7 +23,8 @@ The hash equijoin join operator will build a hash table out of one input (defaul | Build Input | Specifies which input is the `Build`. | Optional, defaults to build `Right`, probe `Left`. | | Left Keys | References to the fields to join on in the left input. | Required | | Right Keys | References to the fields to join on in the right input. | Required | -| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. | +| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. | +| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. All field references are over direct output order. | Optional, defaults to True. | | Join Type | One of the join types defined in the Join operator. | Required | ## NLJ (Nested Loop Join) Operator @@ -47,9 +48,9 @@ The nested loop join operator does a join by holding the entire right input and | Join Expression | A boolean condition that describes whether each record from the left set "match" the record from the right set. | Optional. Defaults to true (a Cartesian join). | | Join Type | One of the join types defined in the Join operator. | Required | -## Merge Equijoin Operator +## Merge Join Operator -The merge equijoin does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion. +The merge join does a join by taking advantage of two sets that are sorted on the join keys. This allows the join operation to be done in a streaming fashion. Once the join keys are matched, then use `residual_expression` to determine whether the matches are true matches. | Signature | Value | | -------------------- | ----------------------------------------------------------------- | @@ -67,7 +68,8 @@ The merge equijoin does a join by taking advantage of two sets that are sorted o | Right Input | A relational input. | Required | | Left Keys | References to the fields to join on in the left input. | Required | | Right Keys | References to the fields to join on in the right input. | Required | -| Post Join Predicate | An additional expression that can be used to reduce the output of the join operation post the equality condition. Minimizes the overhead of secondary join conditions that cannot be evaluated using the equijoin keys. | Optional, defaults true. | +| Residual Expression | An optional boolean expression evaluated on each candidate key-match to determine whether it is a true match. Use this for join predicates that cannot be expressed as equijoin keys (e.g., range or inequality conditions). This expression participates in join-type semantics (outer, anti, mark, etc.). | Optional, defaults to True. | +| Post-Join Filter | An optional boolean condition applied to the output of the join. Semantically equivalent to placing a [Filter](logical_relations.md#filter-operation) directly above the join. Does not influence which rows are considered matches. All field references are over direct output order. | Optional, defaults to True. | | Join Type | One of the join types defined in the Join operator. | Required | ## Exchange Operator