add join compatibility based on physical_location to destination configurations#3905
add join compatibility based on physical_location to destination configurations#3905Travior wants to merge 4 commits into
physical_location to destination configurations#3905Conversation
Deploying with
|
| Status | Name | Latest Commit | Updated (UTC) |
|---|---|---|---|
| ❌ Deployment failed View logs |
docs | a5c8b91 | May 19 2026, 11:02 AM |
ce0fc0d to
274bcbb
Compare
rudolfix
left a comment
There was a problem hiding this comment.
- I think you wanted to name
physical_destination->physical_location? fingerprintcontract was changed. let's discuss that. impact is potentially pretty big- why sqlalchemy fingerprint test is passing? was it compatible or we do not test it at all?
- we need another ticket to extend #3747 - to allow to ATTACH one duckdb client to another. I can take care of that
- Test setup in
load/pipelineis great. let's reuse it for end to end tests for #3747
| @@ -498,12 +498,11 @@ def get_dataset_sql_client(dataset: dlt.Dataset) -> SqlClientBase[Any]: | |||
|
|
|||
|
|
|||
| def is_same_physical_destination(dataset1: dlt.Dataset, dataset2: dlt.Dataset) -> bool: | |||
There was a problem hiding this comment.
let's deprecate this (we have decorator somewhere) and then remove when dlthub is fixed
|
|
||
| def physical_destination(self) -> str: | ||
| """Returns the database file path or ':memory:'.""" | ||
| if self.credentials and self.credentials.database: |
There was a problem hiding this comment.
duckdb is an interesting case. we could easily join any database with another (including this new "hosted" duckdb available via host:port) if we do ATTACH command. that is something to do in the future ie. by extending #3747 implementation - to do ATTACH on demand when foreign dataset is detected.
| # FilesystemConfiguration.fingerprint() (which hashes the raw bucket URL) | ||
| # over DestinationClientConfiguration.fingerprint() (which hashes | ||
| # physical_destination()). Do not remove. | ||
| return DestinationClientStagingConfiguration.fingerprint(self) |
There was a problem hiding this comment.
please tests this. MRO can change ie. someone may add another base class to it
| return "" | ||
|
|
||
| if self.is_local_path(self.bucket_url): | ||
| return "" |
There was a problem hiding this comment.
I think we should return normalized (abs) path. AFAIK fingerprint didn't do it - it is by design forcing all local paths into a single fingerprint not to inflate telemetry destinations. here it is not a concern
| can access multiple storage backends in a single query, so join | ||
| compatibility is determined by the engine, not by the storage location. | ||
| """ | ||
| if isinstance(other, FilesystemDestinationClientConfiguration): |
There was a problem hiding this comment.
NOTE: this will require extension of #3747 - ATTACH foreign duckdb with all its views and tables to current duckdb in the dataset and using 3 part qualification when binding queries. should be "easily" doable. not this ticket though!
|
|
||
| def can_join_with(self, other: DestinationClientConfiguration) -> bool: | ||
| """Returns True for MotherDuck configs with the same token.""" | ||
| if not isinstance(other, MotherDuckClientConfiguration): |
There was a problem hiding this comment.
good for now. but I'm pretty sure we'll be able to join databases from different accounts via ATTACH
| return host | ||
| return "" | ||
|
|
||
| def can_join_with(self, other: DestinationClientConfiguration) -> bool: |
There was a problem hiding this comment.
you could eliminate some code below (IMO) via 2 helper classes:
- we can join two physical destinations
- as above but also database name must mach
helper 1 can call helper 2. also helper 1 looks like super().can_join_with() where we just compare physical locations.
|
|
||
| __recommended_sections__: ClassVar[Sequence[str]] = (known_sections.DESTINATION, "") | ||
|
|
||
| def physical_destination(self) -> str: |
There was a problem hiding this comment.
physical_location? the original called for location
| """Returns a non-secret destination identity, or "" when unavailable.""" | ||
| return "" | ||
|
|
||
| def fingerprint(self) -> str: |
There was a problem hiding this comment.
I know that it would be really good to share some code with fingerprint() but this one is used by telemetry and if we do that - several reports will be affected. let's talk about it
| pytestmark = pytest.mark.essential | ||
|
|
||
|
|
||
| SAME_DATABASE_JOIN_COMPATIBILITY_CONFIGS = destinations_configs( |
There was a problem hiding this comment.
this is good. can be reused for for external/foreign join test!
|
Automated review brought this (I didn't verify):
|
These are the currently enforced rules, together with the rules that diverge from the original ticket:
Another note: the accessor is called
physical_destination(), as big query already defines a config property calleddestination.Here are the tables formatted in Markdown:
Compatibility Overview vs Spec
can_join_withRuleSQLAlchemy Overview vs Spec
Additional Current Definitions Not In Spec
Still TODO/unclear:
destination.snowflake.join_compatibility_database