Skip to content

fix(clickhouse): extend MV TO-clause regex to cover POPULATE, UUID, and three-part names (#26265)#27832

Open
vaishnavidixit12 wants to merge 4 commits intoopen-metadata:mainfrom
vaishnavidixit12:vdixit/26265-clickhouse-mv-lineage-improvements
Open

fix(clickhouse): extend MV TO-clause regex to cover POPULATE, UUID, and three-part names (#26265)#27832
vaishnavidixit12 wants to merge 4 commits intoopen-metadata:mainfrom
vaishnavidixit12:vdixit/26265-clickhouse-mv-lineage-improvements

Conversation

@vaishnavidixit12
Copy link
Copy Markdown

Why

Fixes #26265

ClickHouse CREATE MATERIALIZED VIEW ... TO <target> DDL has several optional clauses that the existing regex in get_mv_to_target_table() did not handle, causing downstream lineage edges to be silently dropped.

What changed

Extended MATERIALIZED_VIEW_TO_PATTERN in clickhouse/utils.py to handle:

Gap Example DDL
POPULATE keyword before TO CREATE MATERIALIZED VIEW mv POPULATE TO target AS …
UUID clause (system.tables DDL) CREATE MATERIALIZED VIEW db.mv UUID '550e8…' TO db.target AS …
Three-part qualified names CREATE MATERIALIZED VIEW cat.schema.mv TO cat2.schema2.target AS …
FROM as valid post-target token CREATE MATERIALIZED VIEW mv TO target FROM source

Also fixed a typing.Tuple shadowing issue — removed Tuple from the typing import and used the built-in tuple[str, str] annotation (Python 3.10+).

Test plan

  • 4 new tests added to test_clickhouse_utils.py:
    • test_populate_keyword_before_to
    • test_on_cluster_and_populate_before_to
    • test_uuid_clause
    • test_three_part_qualified_names
  • All 13 existing TestGetMvToTargetTable tests continue to pass

🤖 Generated with Claude Code

abhay-codes07 and others added 4 commits April 20, 2026 16:45
… clause (open-metadata#26265)

Signed-off-by: Abhay Singh <abhaysingh0293@gmail.com>
Signed-off-by: Abhay Singh <abhaysingh0293@gmail.com>
…nd three-part names

Fixes gaps in MATERIALIZED_VIEW_TO_PATTERN missed by PR open-metadata#27553:
- POPULATE keyword before TO (valid ClickHouse DDL form)
- UUID clause emitted by system.tables DDL output
- Three-part qualified names (catalog.schema.table)
- FROM keyword after target (bare FROM without AS keyword)

Adds four regression tests for each new case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… annotation

Tuple from typing was shadowed by Tuple = create_sqlalchemy_type("Tuple")
on line 81. Using the built-in tuple[str, str] (Python 3.10+) removes the
ambiguity entirely and is safe under from __future__ import annotations.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vaishnavidixit12 vaishnavidixit12 requested a review from a team as a code owner April 29, 2026 17:17
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment on lines +59 to +60
def _strip_quotes(identifier: str) -> str:
return ".".join(part.strip('`"') for part in identifier.split("."))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Edge Case: _strip_quotes mishandles quoted identifiers containing dots

The _strip_quotes function splits on . then strips quotes from each part. If a backtick-quoted identifier contains a literal dot (e.g., `my.db`.mv), split('.') will split inside the quoted portion, producing ['my', 'db', 'mv'] instead of ['my.db', 'mv']. The resulting synthetic query INSERT INTO my.db.mv … would be parsed as a three-part name rather than two-part, potentially breaking lineage resolution.

This is an uncommon edge case (dots inside quoted identifiers are rare in ClickHouse), but worth noting for correctness.

Suggested fix:

def _strip_quotes(identifier: str) -> str:
    import re
    parts = re.findall(r'`[^`]+`|"[^"]+"|[^.]+', identifier)
    return '.'.join(p.strip('`"') for p in parts if p != '.')

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

return ".".join(part.strip('`"') for part in identifier.split("."))


def get_mv_to_target_table(query: str) -> Optional[tuple[str, str]]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Type annotation allows None at runtime but not statically

get_mv_to_target_table(query: str) is annotated as accepting str, but the guard if not query on line 70 and the test on line 224 (get_mv_to_target_table(None)) show it intentionally accepts None. Static type checkers (mypy, pyright) would flag the test as an error.

Suggested fix:

def get_mv_to_target_table(query: Optional[str]) -> Optional[tuple[str, str]]:

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Apr 29, 2026

Code Review 👍 Approved with suggestions 0 resolved / 2 findings

Expands the ClickHouse regex to correctly parse complex TO-clause structures. Please update the _strip_quotes identifier handling for dotted strings and align the get_mv_to_target_table type annotations with its runtime None support.

💡 Edge Case: _strip_quotes mishandles quoted identifiers containing dots

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py:59-60

The _strip_quotes function splits on . then strips quotes from each part. If a backtick-quoted identifier contains a literal dot (e.g., `my.db`.mv), split('.') will split inside the quoted portion, producing ['my', 'db', 'mv'] instead of ['my.db', 'mv']. The resulting synthetic query INSERT INTO my.db.mv … would be parsed as a three-part name rather than two-part, potentially breaking lineage resolution.

This is an uncommon edge case (dots inside quoted identifiers are rare in ClickHouse), but worth noting for correctness.

Suggested fix
def _strip_quotes(identifier: str) -> str:
    import re
    parts = re.findall(r'`[^`]+`|"[^"]+"|[^.]+', identifier)
    return '.'.join(p.strip('`"') for p in parts if p != '.')
💡 Quality: Type annotation allows None at runtime but not statically

📄 ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py:63 📄 ingestion/tests/unit/topology/database/test_clickhouse_utils.py:224

get_mv_to_target_table(query: str) is annotated as accepting str, but the guard if not query on line 70 and the test on line 224 (get_mv_to_target_table(None)) show it intentionally accepts None. Static type checkers (mypy, pyright) would flag the test as an error.

Suggested fix
def get_mv_to_target_table(query: Optional[str]) -> Optional[tuple[str, str]]:
🤖 Prompt for agents
Code Review: Expands the ClickHouse regex to correctly parse complex TO-clause structures. Please update the _strip_quotes identifier handling for dotted strings and align the `get_mv_to_target_table` type annotations with its runtime `None` support.

1. 💡 Edge Case: `_strip_quotes` mishandles quoted identifiers containing dots
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py:59-60

   The `_strip_quotes` function splits on `.` then strips quotes from each part. If a backtick-quoted identifier contains a literal dot (e.g., `` `my.db`.mv ``), `split('.')` will split *inside* the quoted portion, producing `['my', 'db', 'mv']` instead of `['my.db', 'mv']`. The resulting synthetic query `INSERT INTO my.db.mv …` would be parsed as a three-part name rather than two-part, potentially breaking lineage resolution.
   
   This is an uncommon edge case (dots inside quoted identifiers are rare in ClickHouse), but worth noting for correctness.

   Suggested fix:
   def _strip_quotes(identifier: str) -> str:
       import re
       parts = re.findall(r'`[^`]+`|"[^"]+"|[^.]+', identifier)
       return '.'.join(p.strip('`"') for p in parts if p != '.')

2. 💡 Quality: Type annotation allows `None` at runtime but not statically
   Files: ingestion/src/metadata/ingestion/source/database/clickhouse/utils.py:63, ingestion/tests/unit/topology/database/test_clickhouse_utils.py:224

   `get_mv_to_target_table(query: str)` is annotated as accepting `str`, but the guard `if not query` on line 70 and the test on line 224 (`get_mv_to_target_table(None)`) show it intentionally accepts `None`. Static type checkers (mypy, pyright) would flag the test as an error.

   Suggested fix:
   def get_mv_to_target_table(query: Optional[str]) -> Optional[tuple[str, str]]:

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Clickhouse Linage] Missing downsteam for MATERIALIZED VIEW

2 participants