feat: introduce lateral join in the JoinRel for a correlated subquery evaluation#973
feat: introduce lateral join in the JoinRel for a correlated subquery evaluation#973yongchul wants to merge 6 commits intosubstrait-io:mainfrom
Conversation
vbarua
left a comment
There was a problem hiding this comment.
I'm not super familiar with LATERAL in SQL, but if I was to rephrase the problem we're trying to solve it's that to express LATERAL-style joins in a JOIN like:
JoinRel
/ \
LHS RHS
we need to be able to reference LHS outputs in the RHS (or vice-versa).
This type capability has come up before for me in the context of dynamic filtering, where we need to be able to reference fields from other join inputs in the
substrait/proto/substrait/algebra.proto
Line 109 in dc27654
field to express a (potential) data dependency.
I'm a little wary of re-using Outer References like you've suggested, because as I understand it nothing about the construction were talking prevents from putting actual subqueries in the subtree between the Join relation and where the field is referenced.
To avoid this ambiguity, would it make sense to have a specialized reference type like JoinReference in the same vein as OuterReference and LambdaParameterReference?
| // When true, the right input is evaluated once per row of the left input | ||
| // (lateral join / correlated subquery). The right input may reference fields | ||
| // from the current left row using FieldReference with OuterReference as the | ||
| // root_type and steps_out = 1. |
There was a problem hiding this comment.
How does this work if there are also proper subqueries in the right input? I don't think that reference to steps_out = 1 in the RHS will always point to the join.
There was a problem hiding this comment.
If there is a subquery, it will need to use steps_out = 2 or N appropriately depending on its nesting level. Yes, 1 only works when it is not under the nesting.
JoinRel
/. \
LHS. FilterRel
|
AND ( {outerRef {steps_out=1, 0} == fieldRef[1], // outerference to LHS column
EXISTS (fieldRef[2], subquery {
... Filter {fieldRef[7] == outerRef {steps_out=2, 2} ) ) // outerference to LHS column. steps out is 2 now
The language should be updated. Good catch!
There was a problem hiding this comment.
Just removed using steps_out and preferring the id based approach (#1031) for the lateral join. Let me know what do you think. We can add the description back how to use it with steps_out.
No, not vice-versa. The correlation is typically one direction (from left to right) due to scoping.
I wish we just ditch the current outer reference honestly and do it differently -- more like global id reference to avoid hair-pulling wiring outer references in the other PR. I don't understand "putting actual subqueries in the subtree join relation"... could you elaborate your concern? Or, it is actually better to do a quick call if you have time. Anyhow, each subquery would put the "depth" -- only in the context of set predicate today-- and this PR proposing lateral join will put another depth in the nested stack of schema. So I don't see any problem...
I don't know. OuterReference is OuterReference, you are referencing something outside of your usual column resolution scope. LambdaParameterReference is making sense because there is a clear binding point. OuterReference is generic, and based on the "correlation" boundary -- either the Rel containing set predicate or the left input Rel of lateral join in this PR. If you prefer more explicit like LambdaParameterReference, we could explore id based declaration, and binding, something like Basically, the idea is for outer references, going with more like name or identifier based approach rather than offset based approach. Yes, it is departure from the basic referencing scheme but this is much simpler to reuse expressions and binding the correct columns/fields without going through the hassles of massaging offsets and depth, which depends on context. I can draft this as well and we can see how it look like. |
- Add bool lateral field to CrossRel for lateral cross product semantics - Add RelCommon.id and OuterReference.id_reference from PR substrait-io#1031 - Update JoinRel and CrossRel lateral comments to use id_reference - Update logical_relations.md with lateral docs for both JoinRel and CrossRel Depends-on: substrait-io#1031
Background
SQL or relational query plan may have subqueries. Some subqueries can be easily decorrelated in terms of joins but in other cases it is non-trivial to do such rewriting. Some queries can be de-correlated (i.e., rewritten as a join) but sometimes it is non-trivial to decorrelate complex subqueries.
ApplyRel is one of the way to model a generic correlated subquery execution and implemented in multiple systems (e.g., Google Spanner, Oracle, SQL Server family).
The semantic of apply is that the subquery is executed per row from the input rather than operating on set of the rows at a time like join. The subquery can reference columns in the input row as outer references (thus each subquery is correlated to each input row).
Apply operation is inherently expensive thus typically used in a special context such as index lookup join or in a pinch where a system couldn't decorrelate the operation.
Proposal
Extend JoinRel with
lateralflag to represent lateral join.LATERALkeyword was introduced in SQL 99, as part of table reference. When the table source is wrapped with LATERAL, the subquery MAY reference columns and tables preceding the LATERAL table reference, and allows correlated subquery.By this nature of scoping of tables and references, we propose to extend
JoinRelwith lateral flag to denote lateral join, and let right input as the correlated subquery, which can reference the fields from left input.In this way, we reuse all the join semantics automatically without introducing a dedicated Rel.
Same change was made to CrossRel.
The change is backward compatible.
This PR depends on #1031 .
AI disclaimer
This PR was assisted by Claude Opus 4.6.
This change is