Implement SQLAlchemy reflection for the cc_sqlalchemy dialect#766
Conversation
joe-clickhouse
left a comment
There was a problem hiding this comment.
Hi @coby-astsec thanks for this! Appreciate the work. A few things:
- Dialect-level
get_columnsand the shared module-level extraction are a good fix forMetaData.reflect()andget_multi_columnswhich would otherwise fail. - I like the concrete
python_typecompatibility fix as well. - Rebasing
Arrayonsqlalchemy.types.ARRAYis useful too, but I think we need to setself.as_tuple = Falsein the definition. I put a comment in the code.
Then as far as the primary key stuff goes, I'm going to request that we cut it and we revert back to
def get_primary_keys(self, connection, table_name, schema=None, **kw):
return []
def get_pk_constraint(self, connection, table_name, schema=None, **kw):
return {"constrained_columns": [], "name": None}The reason is because PRIMARY KEY in ClickHouse doesn't guarantee uniqueness even if specified. And if it's not specified it defaults from ORDER BY. The point is I think the identity assertion should come from application code, not from default dialect reflection.
So MetaData.reflect() & SQLAlchemy core will work fine without a PK once get_columns() is implemented. However, SQLAlchemy ORM does need a logical identity key and we're in luck because SQLAlchemy explicitly lets users provide one even if the database does not declare or enforce it. E.g.
events = Table(
"events",
metadata,
Column("tenant_id", UInt64, primary_key=True),
Column("event_id", UInt64, primary_key=True),
autoload_with=engine,
)or with ORM mapping:
class Event(Base):
__table__ = events
__mapper_args__ = {
"primary_key": [events.c.tenant_id, events.c.event_id]
}Again, the point is I don't think the dialect should allow defining a sparse primary index as a safe ORM identity because downstream consumers are going to assume it is. But the app developer can explicitly say that it is if that's the case.
For the record, a lot of the other clients for db's without a PK concept follow this pattern as well, e.g. pydruid, pinotdb.
Happy to discuss further or help out if needed!
Per review on ClickHouse#766, the dialect no longer reflects a primary key. ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness guarantee, so get_primary_keys / get_pk_constraint return empty results and the identity key is left for application code to declare. Removes the is_in_primary_key query and the PK-application path in reflect_table. Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow nested arrays, but as_tuple has no class-level default and ARRAY.hashable reads it, so select(arr).unique() raised AttributeError before.
2864d0b to
e72554a
Compare
Per review on ClickHouse#766, the dialect no longer reflects a primary key. ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness guarantee, so get_primary_keys / get_pk_constraint return empty results and the identity key is left for application code to declare. Removes the is_in_primary_key query and the PK-application path in reflect_table. Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow nested arrays, but as_tuple has no class-level default and ARRAY.hashable reads it, so select(arr).unique() raised AttributeError before.
e72554a to
a6cf515
Compare
|
Thanks for the review, Let me know if there's anything else! |
|
Thanks @coby-astsec! Looking good. Only things left I'd request:
def test_user_declared_primary_key(test_engine: Engine, test_db: str):
"""A user-declared primary key on a pre-declared column survives reflection."""
common.set_setting("invalid_setting_action", "drop")
with test_engine.begin() as conn:
conn.execute(text(f"DROP TABLE IF EXISTS {test_db}.reflect_pk_test"))
conn.execute(
text(
f"CREATE TABLE {test_db}.reflect_pk_test (org_id UInt32, id UInt64, payload String) "
"ENGINE MergeTree ORDER BY (org_id, id)"
)
)
table = db.Table(
"reflect_pk_test",
db.MetaData(schema=test_db),
db.Column("org_id", UInt32, primary_key=True),
db.Column("id", db.BigInteger, primary_key=True),
autoload_with=test_engine,
)
assert [c.name for c in table.primary_key.columns] == ["org_id", "id"]
assert {c.name for c in table.columns} == {"org_id", "id", "payload"}
Thanks! |
The cc_sqlalchemy dialect did not support SQLAlchemy reflection
(MetaData.reflect / Inspector multi-table reflection), which broke
sqlacodegen and any tool that calls `dialect.get_columns()` directly.
This change fills in the missing dialect methods so reflection works
end-to-end against a ClickHouse server.
Changes:
1. dialect.py: add `ClickHouseDialect.get_columns()`. Previously only
`ChInspector.get_columns()` existed, but SQLAlchemy's reflection
path goes through `Dialect.get_multi_columns` -> `Dialect.get_columns`
and never touches `Inspector.get_columns` on its own. Without a
dialect implementation, `MetaData.reflect()` raised
`NotImplementedError` from the SQLAlchemy base class.
`get_pk_constraint()` / `get_primary_keys()` now return the
actual primary key columns derived from
`system.columns.is_in_primary_key` (which mirrors MergeTree's
ORDER BY / PRIMARY KEY) instead of empty lists. This lets
sqlacodegen generate declarative classes instead of bare
`Table(...)` definitions for any MergeTree table.
2. inspector.py: promote `get_columns` and `get_pk_constraint` to
module-level functions so the dialect can call the same logic.
`ChInspector.reflect_table()` now applies the PK constraint to
reflected columns (it was building columns with no PK info, so
even direct `Table('asset', md, autoload_with=engine)` reflection
lost the primary key).
3. datatypes/sqltypes.py: replace `python_type = None` on UDT-based
types with concrete Python types. SQLAlchemy's contract for
`TypeEngine.python_type` is that it either returns a class or
raises `NotImplementedError`; returning `None` makes any consumer
that does `python_type.__module__` / `__name__` crash with
`AttributeError: 'NoneType' object has no attribute '__module__'`
(sqlacodegen, and anything else that walks python_type for
annotations or metadata).
- UUID -> uuid.UUID
- IPv4 / IPv6 -> ipaddress.IPv4Address / IPv6Address
- Nothing -> type(None)
- Point -> tuple
- Ring / Polygon -> list
- LineString etc. -> list
- JSON -> dict
- Nested -> list
- (Simple)AggregateFunction -> str
4. datatypes/sqltypes.py: `Array` now subclasses
`sqlalchemy.types.ARRAY` (alongside `ChSqlaType`) and exposes
`item_type` as a regular instance attribute plus `dimensions = 1`.
Two effects:
- `isinstance(col.type, sqlalchemy.types.ARRAY)` now matches CH
arrays, which lets sqlacodegen render `Mapped[list[T]]`
annotations for single-dim arrays without special-casing.
- `item_type` is mutable so sqlacodegen's `fix_column_types`
adaptation pass (which reassigns `new_coltype.item_type`) works.
`dimensions = 1` reflects CH's type system: every Array is
one-dimensional and nested arrays (`Array(Array(String))`) are
represented via the inner item type, not via a dimension count.
Tests:
- tests/integration_tests/test_sqlalchemy/test_reflect.py:
`test_metadata_reflect_and_primary_keys` exercises the
`Dialect.get_columns` reflection path via `MetaData.reflect()`
and asserts composite primary key reflection from a MergeTree
ORDER BY clause, both via `MetaData.reflect()` and via direct
`Table(autoload_with=...)`.
End-to-end effect: `MetaData.reflect()` and
`sqlacodegen <clickhousedb+connect://...>` now produce a complete,
importable Python module with declarative ORM classes, composite
primary keys, and typed `Mapped[...]` annotations against a real
ClickHouse schema. No changes to the runtime client paths.
Per review on ClickHouse#766, the dialect no longer reflects a primary key. ClickHouse PRIMARY KEY / ORDER BY is a sparse index, not a uniqueness guarantee, so get_primary_keys / get_pk_constraint return empty results and the identity key is left for application code to declare. Removes the is_in_primary_key query and the PK-application path in reflect_table. Also set Array.as_tuple = False. Array bypasses ARRAY.__init__ to allow nested arrays, but as_tuple has no class-level default and ARRAY.hashable reads it, so select(arr).unique() raised AttributeError before.
a6cf515 to
3d83412
Compare
|
@joe-clickhouse all done, sorry for the delay 😄 |
|
@copilot resolve the merge conflicts in this pull request |
Signed-off-by: Joe Spadola <joe.spadola@clickhouse.com>
joe-clickhouse
left a comment
There was a problem hiding this comment.
Looks good! Thanks for the contribution @coby-astsec
Summary
cc_sqlalchemydoesn't implement SQLAlchemy reflection —MetaData.reflect()andInspector.get_multi_columns()raiseNotImplementedErroron any CH database. This breaks tools likesqlacodegenor other tools which introspect schema.Mechanism: SQLAlchemy's reflection path is
Inspector -> Dialect.get_multi_columns -> Dialect.get_columns. The dialect only ever definedInspector.get_columns(onChInspector), whichMetaData.reflect()never calls on its own. Reflection falls through to theDefaultDialectbase, which raises.This PR fills in the missing dialect method. No runtime client paths change.
Changes
dialect.py— addClickHouseDialect.get_columns().inspector.py— promoteget_columnsto a module-level function shared between dialect and inspector.datatypes/sqltypes.py— concretepython_types. SQLAlchemy'sTypeEngine.python_typecontract is "return a class or raiseNotImplementedError." ReturningNone(the current behavior on every UDT-based type) makespython_type.__module__/.__name__raiseAttributeError, which is what breaks sqlacodegen:python_typeUUIDuuid.UUIDIPv4/IPv6ipaddress.IPv4Address/IPv6AddressNothingtype(None)PointtupleRing/Polygon/MultiPolygon/LineString/MultiLineStringlistJSONdictNestedlist(Simple)AggregateFunctionstrdatatypes/sqltypes.py—Arraynow subclassessqlalchemy.types.ARRAYalongsideChSqlaType, exposesitem_typeas a plain instance attribute (so sqlacodegen'sfix_column_typesadaptation pass can reassign it), and setsdimensions = 1.ARRAY.__init__is not called cooperatively because it rejects nested ARRAYitem_types, which CH supports natively (Array(Array(T))).Tests
tests/integration_tests/test_sqlalchemy/test_reflect.pycovers the dialect path (MetaData.reflect()) and a directTable(autoload_with=...)call against a MergeTree withORDER BY (org_id, id), plus a user-declared composite primary key surviving reflection.Local: SQLAlchemy tests pass.
Checklist