Skip to content

feat: clarify the expected behavior, and rationale, of the post join filter#807

Closed
westonpace wants to merge 1 commit intosubstrait-io:mainfrom
westonpace:feat/clarify-post-join-filter
Closed

feat: clarify the expected behavior, and rationale, of the post join filter#807
westonpace wants to merge 1 commit intosubstrait-io:mainfrom
westonpace:feat/clarify-post-join-filter

Conversation

@westonpace
Copy link
Copy Markdown
Member

@westonpace westonpace commented Apr 23, 2025

The post join filter has very little explanation. It can also be confusing because, from a purely logical perspective, it is possible to see the post join filter as redundant. This PR attempts to clarify the description of the post join filter.


This change is Reviewable

| Post-Join Filter | A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition. | Optional |
| Post-Join Filter | A boolean condition to be applied to each potential match between the left and right
inputs. If it evaluates to false then the potential match is not considered a match. A join relation with
Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y. | Optional |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Much better than my local draft! :)

Two more things.

  • Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.
  • Should this be Optional, default True like hash/merge join?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align Hash/MergeJoin post-join filter description with this. We could refer JoinRel there and leave what's different.

Can you expand on what you mean here? The PR does currently update the hash/merge join descriptions. I don't include the A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y statement because this is not true for hash/merge join (the join expression for these relations is a series of equality conditions).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant the language of the description. The way you describe is more explicit that the post_join_filter IS part of the join condition, say saying what "matches" and what "does not match". This is not for try to reduce the output.

Also, can we drop Equi form HashEquiJoin? :)

@drin
Copy link
Copy Markdown
Member

drin commented Apr 23, 2025

A join relation with Join Expression X and Post-Join Filter Y is equivalent to a join relation with Join Expression X AND Y

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, I think it'd be nice to add a comment in the .proto file to the effect of "post_join_filter should be resolved in conjunction (AND) with expression."

Comment on lines +258 to +260
// The post-join filter is a filter that is applied to the result of the join before an output
// record is produced. If the filter evaluates to false then the record is not considered a
// match.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ambiguous for functions that aggregate over many tuples. I think a "simple" example is:

  1. the post_join_filter is a comparison (lte) that uses a window function (count)
  2. the expression is a predicate with selectivity between 0 and 100%
  3. the expression produces many tuples from one input for a single tuple of the other input

(2) and (3) are necessary for ambiguous scenarios to occur and (1) is where the ambiguity is expressed.

Copy link
Copy Markdown
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I think "applied to the result of the join before an output record is produced" lends itself to being misunderstood because the "result" of the join sounds like the result of applying expression, but I think to be accurate to "equivalent to a join relation with Join Expression X AND Y" you must evaluate post_join_filter on the inputs to expression even if you only evaluate its "truthyness" on joined records that expression evaluates as true.

Maybe something more like "applied to the inputs of the join, before an output record is produced" is better and equally concise?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's better to clarify the predicates are evaluated over the inputs. Like @drin 's suggestion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to "evaluated over the inputs". As for when it's applied, I'm still not too sure about what is the supposed behavior tbh. Let's say you're joining two tables a LEFT JOIN b ON ... with a post-join filter that has a.Col1 = b.Col2. Is a.Col1 = b.Col2 expression also supposed to follow join type semantics and leave the unmatched records from the left side in the output? Or will the result be as if it had been an inner join instead of a left join?

Copy link
Copy Markdown
Member Author

@westonpace westonpace Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get very confused easily when talking about when this filter is applied. Here is my understanding, in naive pseudocode, of how it is applied. I'm omitting right joins, full outer joins, single joins, and mark joins for simplicity.

for left_record in left_records:
  has_match = False
  for right_record in right_records:
    if join_expression(left_record, right_record) and post_join_filter(left_record, right_record):
      has_match = True
      if join_type == Inner or join_type == Left:
        emit(combine(left_record, right_record):
  if has_match and join_type == LeftSemi:
    emit(left_record)
  elif not has_match and join_type == LeftAnti:
    emit(left_record)
  elif not has_match and join_type == Left:
    emit(combine(left_record, null))

If someone has an alternate proposal, is it possible to share your own pseudocode representation?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko the predicates in join condition does not follow join type. The join type becomes into play depending on whether there is a matching row (i.e., intersection) or not. outer joins and antisemi joins should ensure that you have no rows that matches according to JoinRel.expression AND JoinRel.post_join_filter to correctly behave (i.e., whether to produce null padded rows (outer) or include in the output (antisemi)).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. For JoinRel, can't we write something like "post-join filter is supposed to be evaluated as if it's part of the join expression" or something similar? It would be a lot simpler to understand imho rather than thinking through when during the operation it's supposed to be applied/evaluated.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tokoko that's why I initially proposed to drop post_join_filter from JoinRel in the slack discussion. :)

Copy link
Copy Markdown
Member

@drin drin Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that pseudocode weston. My intention is to make the wording clearly reflect that post_join_filter(left_record, right_record) is valid and post_join_filter(combine(left_record, right_record)) is invalid.

Note (for completeness) that my naive reading of "post join" was incorrect and would have been to implement:

# above pseudocode here
...
if post_join_filter(emitted_record):
  really_emit(emitted_record)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup 'post_join' really tripped me and reason i started the thread. In the systems i worked, used residual join condition/predicate rather than post.

@jacques-n
Copy link
Copy Markdown
Contributor

It's difficult to follow the threads in this discussion.

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

The post join filter does not logically interact with the join type at all. The composite exists because many systems have it and it can be a beneficial physical pattern. The reason it has to be stated separately from the join predicate is to have covering behavior of all possible filter conditions. I always have to remind myself of which conditions can and cannot be moved into a join evaluation clause.

I'm supportive of clarifying the text if people are unclear as to what post join filter means.

@westonpace
Copy link
Copy Markdown
Member Author

westonpace commented May 5, 2025

One can think of a a join with a post join filter as a composite operation that is a join followed by a filter relation. It is entirely valid translation to take a post join filter out of the join and put in a filter relation directly afterwards and vice versa.

@jacques-n

This is not the conclusion we came to. I believe the content of the PR is still accurate with the threads, so you can just review the content and ignore the discussion.

For example:

SELECT * FROM a LEFT OUTER JOIN b ON a.id = b.id AND b.other_field IS NOT NULL

As it stands this filter will emit one row for each row in a. If the filter is moved into the WHERE clause then it will emit fewer rows (less or equal to the number of rows emitted by an inner join).

From your GPT link this matches:

Conditions on the non-preserved side that would otherwise eliminate rows that should remain when there's no match

@drin
Copy link
Copy Markdown
Member

drin commented May 5, 2025

There is a thread that discusses the description that would go in the website: discussion on website description

I think further discussion on this can be deferred until agreement on post_join_filter semantics is finalized.

Then, there's a thread discussing the comment in the .proto file for post_join_filter: discussion on comment in spec

This discussion assumes that Weston's assertion in the description is the correct semantics of post_join_filter. As Weston points out, that description is in contradiction to Jacques's comment.

This is not the conclusion we came to

In this PR, we never collectively discussed what it should be versus what it is. The description says:

The post join filter has very little explanation... from a purely logical perspective, it is possible to see the post join filter as redundant.

And I asked:

Is this strictly true? As in a consumer must resolve both expressions on the same inputs? If so, ...

One thing that was referenced in slack is the substrait FAQ: "The post-join filter on the various Join relations is not always equivalent to an explicit Filter relation AFTER the Join." This FAQ then references velox hash-join implementation, which says: "Filter is optional. If specified it can be any expression over the results of the join."

It occurs to me that the FAQ says "post-join filter... is not always equivalent to an explicit Filter relation AFTER the join," yet the referenced velox documentation says "If specified, it can be any expression over the results of the join." These seem directly contradictory to me, since "the results of of the join" sounds quite a bit like "AFTER the join".

@github-actions
Copy link
Copy Markdown

This PR has been automatically marked as stale because it has not had
recent activity. It will be closed in 7 days if no further activity occurs.

@github-actions github-actions Bot added the Stale label Dec 18, 2025
@github-actions
Copy link
Copy Markdown

This PR has been automatically closed due to inactivity.
If you believe this was closed in error, please reopen it.

@yongchul
Copy link
Copy Markdown
Contributor

Resurrecting this PR as I really can't get past on this post join filter. 😆 @westonpace do you mind if I carry this forward?

@westonpace
Copy link
Copy Markdown
Member Author

Oh yes, please carry on.

@github-actions github-actions Bot removed the Stale label Apr 11, 2026
@jacques-n
Copy link
Copy Markdown
Contributor

This is not the conclusion we came to...

I think the conclusion and content updates are redefining the original purpose of this field when we should simply add another field. The changes do not align with my intentions when I added the field).

I think the spec content was clear (albeit a bit terse). The key bit is (my bold added):

A boolean condition to be applied to each result record after the inputs have been joined, yielding only the records that satisfied the condition.

IOW, this filter is done after the join algorithm is completed. It does not interplay with join's matching decision.

To clarify things, I think there are two different concepts being discussed:

  1. A filter that happens could be run in a separate relation but just happens to be included in the join operator.
  2. A filter that is part of the join matching condition and influences matching.

This patch tries to redefine the behavior as (2). My original intention was (1).

Both (1) and (2) are useful. While I'm not an originalist, I think it is important to point out that I think the addition of a field for (2) only makes sense for HashJoinRel. it doesn't make sense for JoinRel or MergeJoinRel (since those already carry arbitrary expressions that are applied as part of matching). I would further say that the name isn't quite right if it applies as part of matching as that isn't "post join", it's during join. If we stick with the post_join_filter equating to (1) the field serves a useful purpose in all three of JoinRel, MergeJoinRel and HashJoinRel.

My recommendation is improve content to clarify that the purpose of this field is (1) and introduce a separate (2) for HashJoinRel if so desired . If we do that, the following statements in this patch are incorrect:

  • For purely logical plans, this field is redundant, as the post_join_filter could be included in
    expression
  • A boolean condition to be applied to each potential match between the left and right
    inputs. If it evaluates to false then the potential match is not considered a match

@yongchul
Copy link
Copy Markdown
Contributor

yongchul commented Apr 12, 2026

@jacques-n Thanks for much needed clarification! So, the original intent was really (join - filter) (i.e., a filter happens to be squeezed into join). This was my interpretation when I just saw the name but unfortunately the FAQ was implying this is actually the residual join condition (this is my way of describing something that is part of join condition but not natively incorporated by the join algorithms (i.e., non-equality predicates).

Having said that, I much prefer to do following.

  1. keep the post_join_filter as post_join_filter with much clearer documentation.
  2. introduce redisual_expression field to HashJoinRel and MergeJoinRel to represent the extra predicates not accounted by the algorithms.
  3. rewrite or remove the confusing FAQ entry (or refer to the latest documentation)

How does that sound? @jacques-n @drin @tokoko @westonpace

@yongchul
Copy link
Copy Markdown
Contributor

Created #1044 , making post_join_filter as post_join_filter and introduce residual_expression to incorporate non-equality join predicate to HashJoinRel and MergeJoinRel.

@westonpace westonpace closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants