Skip to content

add join compatibility based on physical_location to destination configurations#3905

Open
Travior wants to merge 4 commits into
develfrom
feat/join_compat
Open

add join compatibility based on physical_location to destination configurations#3905
Travior wants to merge 4 commits into
develfrom
feat/join_compat

Conversation

@Travior
Copy link
Copy Markdown
Contributor

@Travior Travior commented Apr 29, 2026

These are the currently enforced rules, together with the rules that diverge from the original ticket:
Another note: the accessor is called physical_destination(), as big query already defines a config property called destination.

Here are the tables formatted in Markdown:

Compatibility Overview vs Spec

Destination Current Location / Identity Current can_join_with Rule Divergence From Spec
Postgres host:port Same host:port and same database Matches
Redshift host:port Same host:port and same database Matches
Snowflake account host Same account host Matches
BigQuery project_id Same project ID Matches
MSSQL host:port Same host:port; database may differ Minor: spec says host, implementation includes port
Synapse host:port Same host:port; database may differ Minor: spec says inherited host, implementation includes port
ClickHouse host:port Same host:port; database may differ Minor: spec says host, implementation includes port
Databricks server_hostname Same server hostname Matches
Athena region/catalog Same AWS region and same data catalog Diverges: spec says location is only aws_data_catalog; implementation also requires same AWS region
Dremio host:port Same host:port Minor: spec says host, implementation includes port
DuckDB database file path or :memory: Same database path Matches
MotherDuck physical_destination() is empty; fingerprint/token used separately Same non-empty access token Diverges structurally: spec says location is access token and default same-location rule; implementation intentionally does not expose token as location and overrides can_join_with
Filesystem remote: scheme://netloc; local: "" Always True with another filesystem config Matches
DuckLake credential-free catalog identity + ducklake_name Same catalog identity and same DuckLake name Matches

SQLAlchemy Overview vs Spec

Dialect Current Rule Divergence From Spec
postgresql Same host:port and same database Matches
mysql Same host:port; database may differ Matches
mssql Same host:port; database may differ Matches
oracle Same host:port; database may differ Matches
db2 Same host:port; database may differ Matches
sqlite Same database file path Matches
unknown/other Same host:port and same database Matches

Additional Current Definitions Not In Spec

Destination Current Rule
LanceDB Same LanceDB URI and same dataset_separator; dataset name may differ
Lance Same catalog root and same bound dataset name
Weaviate Never joinable
Qdrant Never joinable

Still TODO/unclear:

  • CI secrets.toml (or config?) needs destination.snowflake.join_compatibility_database
  • integration tests intentionally don't execute explicit joins yet as query APIs don't properly support cross destination/database data access (e.g. snowflake qualified name doesn't include the database / filesystem client doesn't expose option to register "foreign" file as duckdb view)
  • Should we explicitly add more databases to the tests (when destination supports the join like snowflake), if so we need to manually set up different more databases / catalogs for those providers

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 29, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
❌ Deployment failed
View logs
docs a5c8b91 May 19 2026, 11:02 AM

@Travior Travior self-assigned this Apr 29, 2026
@Travior Travior force-pushed the feat/join_compat branch 3 times, most recently from ce0fc0d to 274bcbb Compare April 29, 2026 13:53
@burnash burnash linked an issue May 11, 2026 that may be closed by this pull request
@Travior Travior force-pushed the feat/join_compat branch from ededae6 to ca6f0d4 Compare May 19, 2026 10:51
@Travior Travior force-pushed the feat/join_compat branch from ca6f0d4 to a5c8b91 Compare May 19, 2026 10:54
@Travior Travior marked this pull request as ready for review May 19, 2026 12:57
@Travior Travior requested a review from rudolfix May 19, 2026 12:58
Copy link
Copy Markdown
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think you wanted to name physical_destination -> physical_location?
  2. fingerprint contract was changed. let's discuss that. impact is potentially pretty big
  3. why sqlalchemy fingerprint test is passing? was it compatible or we do not test it at all?
  4. we need another ticket to extend #3747 - to allow to ATTACH one duckdb client to another. I can take care of that
  5. Test setup in load/pipeline is great. let's reuse it for end to end tests for #3747

Comment thread dlt/dataset/dataset.py
@@ -498,12 +498,11 @@ def get_dataset_sql_client(dataset: dlt.Dataset) -> SqlClientBase[Any]:


def is_same_physical_destination(dataset1: dlt.Dataset, dataset2: dlt.Dataset) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's deprecate this (we have decorator somewhere) and then remove when dlthub is fixed


def physical_destination(self) -> str:
"""Returns the database file path or ':memory:'."""
if self.credentials and self.credentials.database:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duckdb is an interesting case. we could easily join any database with another (including this new "hosted" duckdb available via host:port) if we do ATTACH command. that is something to do in the future ie. by extending #3747 implementation - to do ATTACH on demand when foreign dataset is detected.

# FilesystemConfiguration.fingerprint() (which hashes the raw bucket URL)
# over DestinationClientConfiguration.fingerprint() (which hashes
# physical_destination()). Do not remove.
return DestinationClientStagingConfiguration.fingerprint(self)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please tests this. MRO can change ie. someone may add another base class to it

return ""

if self.is_local_path(self.bucket_url):
return ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return normalized (abs) path. AFAIK fingerprint didn't do it - it is by design forcing all local paths into a single fingerprint not to inflate telemetry destinations. here it is not a concern

can access multiple storage backends in a single query, so join
compatibility is determined by the engine, not by the storage location.
"""
if isinstance(other, FilesystemDestinationClientConfiguration):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: this will require extension of #3747 - ATTACH foreign duckdb with all its views and tables to current duckdb in the dataset and using 3 part qualification when binding queries. should be "easily" doable. not this ticket though!


def can_join_with(self, other: DestinationClientConfiguration) -> bool:
"""Returns True for MotherDuck configs with the same token."""
if not isinstance(other, MotherDuckClientConfiguration):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for now. but I'm pretty sure we'll be able to join databases from different accounts via ATTACH

return host
return ""

def can_join_with(self, other: DestinationClientConfiguration) -> bool:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could eliminate some code below (IMO) via 2 helper classes:

  • we can join two physical destinations
  • as above but also database name must mach

helper 1 can call helper 2. also helper 1 looks like super().can_join_with() where we just compare physical locations.


__recommended_sections__: ClassVar[Sequence[str]] = (known_sections.DESTINATION, "")

def physical_destination(self) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

physical_location? the original called for location

"""Returns a non-secret destination identity, or "" when unavailable."""
return ""

def fingerprint(self) -> str:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that it would be really good to share some code with fingerprint() but this one is used by telemetry and if we do that - several reports will be affected. let's talk about it

pytestmark = pytest.mark.essential


SAME_DATABASE_JOIN_COMPATIBILITY_CONFIGS = destinations_configs(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good. can be reused for for external/foreign join test!

@rudolfix
Copy link
Copy Markdown
Collaborator

rudolfix commented May 26, 2026

Automated review brought this (I didn't verify):

Duplicate parametrize cases (tests/destinations/test_join_compatibility.py:407-420): fabric_port and fabric_default_port are byte-identical — same factory, same expected value. One
should set an explicit port (e.g., credentials.port = 1433) to actually distinguish "default port" from "explicit port".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: implement join compatibility check for destinations

2 participants