Skip to content

Implement SQLAlchemy reflection for the cc_sqlalchemy dialect#766

Merged
joe-clickhouse merged 4 commits into
ClickHouse:mainfrom
coby-astsec:fix/sqlalchemy-reflection
Jun 5, 2026
Merged

Implement SQLAlchemy reflection for the cc_sqlalchemy dialect#766
joe-clickhouse merged 4 commits into
ClickHouse:mainfrom
coby-astsec:fix/sqlalchemy-reflection

Conversation

@coby-astsec
Copy link
Copy Markdown
Contributor

@coby-astsec coby-astsec commented May 27, 2026

Summary

cc_sqlalchemy doesn't implement SQLAlchemy reflection — MetaData.reflect() and Inspector.get_multi_columns() raise NotImplementedError on any CH database. This breaks tools like sqlacodegen or other tools which introspect schema.

Mechanism: SQLAlchemy's reflection path is Inspector -> Dialect.get_multi_columns -> Dialect.get_columns. The dialect only ever defined Inspector.get_columns (on ChInspector), which MetaData.reflect() never calls on its own. Reflection falls through to the DefaultDialect base, which raises.

This PR fills in the missing dialect method. No runtime client paths change.

Changes

  1. dialect.py — add ClickHouseDialect.get_columns().

  2. inspector.py — promote get_columns to a module-level function shared between dialect and inspector.

  3. datatypes/sqltypes.py — concrete python_types. SQLAlchemy's TypeEngine.python_type contract is "return a class or raise NotImplementedError." Returning None (the current behavior on every UDT-based type) makes python_type.__module__ / .__name__ raise AttributeError, which is what breaks sqlacodegen:

    Type python_type
    UUID uuid.UUID
    IPv4 / IPv6 ipaddress.IPv4Address / IPv6Address
    Nothing type(None)
    Point tuple
    Ring / Polygon / MultiPolygon / LineString / MultiLineString list
    JSON dict
    Nested list
    (Simple)AggregateFunction str
  4. datatypes/sqltypes.pyArray now subclasses sqlalchemy.types.ARRAY alongside ChSqlaType, exposes item_type as a plain instance attribute (so sqlacodegen's fix_column_types adaptation pass can reassign it), and sets dimensions = 1. ARRAY.__init__ is not called cooperatively because it rejects nested ARRAY item_types, which CH supports natively (Array(Array(T))).

Tests

tests/integration_tests/test_sqlalchemy/test_reflect.py covers the dialect path (MetaData.reflect()) and a direct Table(autoload_with=...) call against a MergeTree with ORDER BY (org_id, id), plus a user-declared composite primary key surviving reflection.

Local: SQLAlchemy tests pass.

Checklist

  • Unit and integration tests covering the common scenarios were added
  • CHANGELOG entry included
  • Docs - n/a

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented May 27, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @coby-astsec thanks for this! Appreciate the work. A few things:

  1. Dialect-level get_columns and the shared module-level extraction are a good fix for MetaData.reflect() and get_multi_columns which would otherwise fail.
  2. I like the concrete python_type compatibility fix as well.
  3. Rebasing Array on sqlalchemy.types.ARRAY is useful too, but I think we need to set self.as_tuple = False in the definition. I put a comment in the code.

Then as far as the primary key stuff goes, I'm going to request that we cut it and we revert back to

    def get_primary_keys(self, connection, table_name, schema=None, **kw):
        return []

    def get_pk_constraint(self, connection, table_name, schema=None, **kw):
        return {"constrained_columns": [], "name": None}

The reason is because PRIMARY KEY in ClickHouse doesn't guarantee uniqueness even if specified. And if it's not specified it defaults from ORDER BY. The point is I think the identity assertion should come from application code, not from default dialect reflection.

So MetaData.reflect() & SQLAlchemy core will work fine without a PK once get_columns() is implemented. However, SQLAlchemy ORM does need a logical identity key and we're in luck because SQLAlchemy explicitly lets users provide one even if the database does not declare or enforce it. E.g.

events = Table(
    "events",
    metadata,
    Column("tenant_id", UInt64, primary_key=True),
    Column("event_id", UInt64, primary_key=True),
    autoload_with=engine,
)

or with ORM mapping:

class Event(Base):
    __table__ = events
    __mapper_args__ = {
        "primary_key": [events.c.tenant_id, events.c.event_id]
    }

Again, the point is I don't think the dialect should allow defining a sparse primary index as a safe ORM identity because downstream consumers are going to assume it is. But the app developer can explicitly say that it is if that's the case.

For the record, a lot of the other clients for db's without a PK concept follow this pattern as well, e.g. pydruid, pinotdb.

Happy to discuss further or help out if needed!

Comment thread clickhouse_connect/cc_sqlalchemy/datatypes/sqltypes.py
Comment thread clickhouse_connect/cc_sqlalchemy/inspector.py Outdated
Comment thread clickhouse_connect/cc_sqlalchemy/inspector.py Outdated
Comment thread clickhouse_connect/cc_sqlalchemy/inspector.py Outdated
Comment thread clickhouse_connect/cc_sqlalchemy/dialect.py Outdated
Comment thread clickhouse_connect/cc_sqlalchemy/dialect.py Outdated
coby-astsec added a commit to coby-astsec/clickhouse-connect that referenced this pull request May 29, 2026
Per review on ClickHouse#766, the dialect no longer reflects a primary key.
ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness
guarantee, so get_primary_keys / get_pk_constraint return empty results
and the identity key is left for application code to declare. Removes the
is_in_primary_key query and the PK-application path in reflect_table.

Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow
nested arrays, but as_tuple has no class-level default and ARRAY.hashable
reads it, so select(arr).unique() raised AttributeError before.
@coby-astsec coby-astsec force-pushed the fix/sqlalchemy-reflection branch from 2864d0b to e72554a Compare May 29, 2026 19:43
coby-astsec added a commit to coby-astsec/clickhouse-connect that referenced this pull request May 29, 2026
Per review on ClickHouse#766, the dialect no longer reflects a primary key.
ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness
guarantee, so get_primary_keys / get_pk_constraint return empty results
and the identity key is left for application code to declare. Removes the
is_in_primary_key query and the PK-application path in reflect_table.

Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow
nested arrays, but as_tuple has no class-level default and ARRAY.hashable
reads it, so select(arr).unique() raised AttributeError before.
@coby-astsec coby-astsec force-pushed the fix/sqlalchemy-reflection branch from e72554a to a6cf515 Compare May 29, 2026 19:46
@coby-astsec
Copy link
Copy Markdown
Contributor Author

coby-astsec commented May 29, 2026

Thanks for the review,
Regarding the Primary Keys, I agree and have removed those changes, and concerning as_tuple, I implemented what you asked, but also what you tried to do wouldn't work either way - I added more detail about this as a reply to your comment.

Let me know if there's anything else!

@joe-clickhouse
Copy link
Copy Markdown
Contributor

Thanks @coby-astsec! Looking good. Only things left I'd request:

  1. The PR summary is now stale so let's please get that updated to reflect the actual changes
  2. Optional, but I think it's worth adding a test to make sure a user-defined primary key survives reflection:
def test_user_declared_primary_key(test_engine: Engine, test_db: str):
    """A user-declared primary key on a pre-declared column survives reflection."""
    common.set_setting("invalid_setting_action", "drop")
    with test_engine.begin() as conn:
        conn.execute(text(f"DROP TABLE IF EXISTS {test_db}.reflect_pk_test"))
        conn.execute(
            text(
                f"CREATE TABLE {test_db}.reflect_pk_test (org_id UInt32, id UInt64, payload String) "
                "ENGINE MergeTree ORDER BY (org_id, id)"
            )
        )

    table = db.Table(
        "reflect_pk_test",
        db.MetaData(schema=test_db),
        db.Column("org_id", UInt32, primary_key=True),
        db.Column("id", db.BigInteger, primary_key=True),
        autoload_with=test_engine,
    )
    assert [c.name for c in table.primary_key.columns] == ["org_id", "id"]
    assert {c.name for c in table.columns} == {"org_id", "id", "payload"}
  1. Rebase

Thanks!

The cc_sqlalchemy dialect did not support SQLAlchemy reflection
(MetaData.reflect / Inspector multi-table reflection), which broke
sqlacodegen and any tool that calls `dialect.get_columns()` directly.
This change fills in the missing dialect methods so reflection works
end-to-end against a ClickHouse server.

Changes:

1. dialect.py: add `ClickHouseDialect.get_columns()`. Previously only
   `ChInspector.get_columns()` existed, but SQLAlchemy's reflection
   path goes through `Dialect.get_multi_columns` -> `Dialect.get_columns`
   and never touches `Inspector.get_columns` on its own. Without a
   dialect implementation, `MetaData.reflect()` raised
   `NotImplementedError` from the SQLAlchemy base class.

   `get_pk_constraint()` / `get_primary_keys()` now return the
   actual primary key columns derived from
   `system.columns.is_in_primary_key` (which mirrors MergeTree's
   ORDER BY / PRIMARY KEY) instead of empty lists. This lets
   sqlacodegen generate declarative classes instead of bare
   `Table(...)` definitions for any MergeTree table.

2. inspector.py: promote `get_columns` and `get_pk_constraint` to
   module-level functions so the dialect can call the same logic.
   `ChInspector.reflect_table()` now applies the PK constraint to
   reflected columns (it was building columns with no PK info, so
   even direct `Table('asset', md, autoload_with=engine)` reflection
   lost the primary key).

3. datatypes/sqltypes.py: replace `python_type = None` on UDT-based
   types with concrete Python types. SQLAlchemy's contract for
   `TypeEngine.python_type` is that it either returns a class or
   raises `NotImplementedError`; returning `None` makes any consumer
   that does `python_type.__module__` / `__name__` crash with
   `AttributeError: 'NoneType' object has no attribute '__module__'`
   (sqlacodegen, and anything else that walks python_type for
   annotations or metadata).

   - UUID            -> uuid.UUID
   - IPv4 / IPv6     -> ipaddress.IPv4Address / IPv6Address
   - Nothing         -> type(None)
   - Point           -> tuple
   - Ring / Polygon  -> list
   - LineString etc. -> list
   - JSON            -> dict
   - Nested          -> list
   - (Simple)AggregateFunction -> str

4. datatypes/sqltypes.py: `Array` now subclasses
   `sqlalchemy.types.ARRAY` (alongside `ChSqlaType`) and exposes
   `item_type` as a regular instance attribute plus `dimensions = 1`.
   Two effects:

   - `isinstance(col.type, sqlalchemy.types.ARRAY)` now matches CH
     arrays, which lets sqlacodegen render `Mapped[list[T]]`
     annotations for single-dim arrays without special-casing.
   - `item_type` is mutable so sqlacodegen's `fix_column_types`
     adaptation pass (which reassigns `new_coltype.item_type`) works.

   `dimensions = 1` reflects CH's type system: every Array is
   one-dimensional and nested arrays (`Array(Array(String))`) are
   represented via the inner item type, not via a dimension count.

Tests:

- tests/integration_tests/test_sqlalchemy/test_reflect.py:
  `test_metadata_reflect_and_primary_keys` exercises the
  `Dialect.get_columns` reflection path via `MetaData.reflect()`
  and asserts composite primary key reflection from a MergeTree
  ORDER BY clause, both via `MetaData.reflect()` and via direct
  `Table(autoload_with=...)`.

End-to-end effect: `MetaData.reflect()` and
`sqlacodegen <clickhousedb+connect://...>` now produce a complete,
importable Python module with declarative ORM classes, composite
primary keys, and typed `Mapped[...]` annotations against a real
ClickHouse schema. No changes to the runtime client paths.
Per review on ClickHouse#766, the dialect no longer reflects a primary key.
ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness
guarantee, so get_primary_keys / get_pk_constraint return empty results
and the identity key is left for application code to declare. Removes the
is_in_primary_key query and the PK-application path in reflect_table.

Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow
nested arrays, but as_tuple has no class-level default and ARRAY.hashable
reads it, so select(arr).unique() raised AttributeError before.
@coby-astsec coby-astsec force-pushed the fix/sqlalchemy-reflection branch from a6cf515 to 3d83412 Compare June 4, 2026 09:55
@coby-astsec
Copy link
Copy Markdown
Contributor Author

@joe-clickhouse all done, sorry for the delay 😄

@joe-clickhouse
Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Signed-off-by: Joe Spadola <joe.spadola@clickhouse.com>
Copy link
Copy Markdown
Contributor

@joe-clickhouse joe-clickhouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for the contribution @coby-astsec

@joe-clickhouse joe-clickhouse merged commit e8c5284 into ClickHouse:main Jun 5, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants