Skip to content

Enhance dediplication of parts#324

Open
EilRoviSoft wants to merge 3 commits into
yandex:mainfrom
EilRoviSoft:main
Open

Enhance dediplication of parts#324
EilRoviSoft wants to merge 3 commits into
yandex:mainfrom
EilRoviSoft:main

Conversation

@EilRoviSoft
Copy link
Copy Markdown

@EilRoviSoft EilRoviSoft commented May 20, 2026

Summary by Sourcery

Improve deduplication handling for ClickHouse parts whose stored names may differ from their current names, including mutation suffix changes, and propagate the original storage name through metadata and restore flows.

New Features:

  • Track the original stored part name separately from the current part name in part metadata to support deduplication when names differ by mutation suffix.

Enhancements:

  • Relax deduplication matching logic to consider parts equal when their names only differ by mutation-related suffix segments.
  • Adjust data download, integrity checks, and disk copy routines to use the stored part name when accessing backup storage while keeping the current name for local operations.
  • Extend deduplication info queries to join on checksum and select current part name, deduplicating rows per current part.

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented May 20, 2026

Reviewer's Guide

This PR improves deduplication handling by distinguishing between the logical (current) part name and the stored source part name for deduplicated ClickHouse parts, propagating this distinction through metadata, backup layout operations, ClickHouse deduplication queries, and disk copy routines so that data is read and written using the correct underlying storage identifiers while keeping logical names in logs and metadata.

Sequence diagram for using storage_name for deduplicated parts

sequenceDiagram
    participant Deduplication as deduplication_py
    participant PartMetadata as PartMetadata
    participant BackupLayout as layout_py
    participant ClickHouseDisks as disks_py
    participant StorageLoader as StorageLoader

    Deduplication->>PartMetadata: deduplicate_parts(database, table, existing_parts)
    Note over Deduplication,PartMetadata: For deduplicated parts:
    Note over Deduplication,PartMetadata: name = current_name
    Note over Deduplication,PartMetadata: link = backup_name
    Note over Deduplication,PartMetadata: link_part_name = stored source part name

    BackupLayout->>BackupLayout: download_data_part(backup_name, part)
    BackupLayout->>BackupLayout: storage_name = part.link_part_name or part.name
    BackupLayout->>StorageLoader: get_backup_path(backup_name)
    BackupLayout->>StorageLoader: download_files(remote_dir_path using storage_name)

    BackupLayout->>BackupLayout: check_data_part(backup_name, part)
    BackupLayout->>BackupLayout: storage_name = part.link_part_name or part.name
    BackupLayout->>StorageLoader: list_dir(remote_dir_path using storage_name)

    ClickHouseDisks->>ClickHouseDisks: _run_copy_command(table, part, backup_meta)
    ClickHouseDisks->>ClickHouseDisks: storage_name = part.link_part_name or part.name
    ClickHouseDisks->>ClickHouseDisks: build target_path and source_path with storage_name
Loading

File-Level Changes

Change Details Files
Track and expose the stored source part name separately from the logical part name in part metadata.
  • Extend RawMetadata and PartMetadata to include a link_part_name field representing the source part name used in storage for deduplicated parts.
  • Wire link_part_name through PartMetadata construction so it is available anywhere part metadata is consumed.
  • Document link_part_name semantics as the backup source part name for deduplicated parts, or None otherwise.
ch_backup/backup/metadata/part_metadata.py
Tighten deduplication logic to distinguish mutation renames from other name changes and map logical names to stored names.
  • Introduce _is_mutation_renamed helper to compare part name segments and detect when a part differs only by mutation suffix.
  • Skip deduplication when current_name and backup_name differ by more than a mutation suffix, logging a debug message instead of reusing the part.
  • Construct PartMetadata for deduplicated parts using current_name as the logical part name, backup_name as the source backup link, and the original system part name as link_part_name, and enhance logging to show both logical and storage part identifiers.
ch_backup/backup/deduplication.py
Use the stored source part name when accessing data files in backup layout operations while preserving logical names in logs.
  • Derive a storage_name per part (link_part_name if present, otherwise name) and use it for directory and tarball paths in download_data_part.
  • Apply the same storage_name logic in check_data_part when resolving remote paths, listing files, and validating sizes, while keeping log messages informative about the actual stored name.
ch_backup/backup/layout.py
Adjust ClickHouse deduplication metadata query to join on checksum and avoid duplicate mappings while exposing the current logical part name.
  • Modify GET_DEDUPLICATED_PARTS_SQL to join _deduplication_info with _deduplication_info_current on checksum instead of name+checksum, selecting current_name explicitly.
  • Add LIMIT 1 BY current_name to prevent duplicate rows per logical part when multiple deduplicated parts share the same checksum.
ch_backup/clickhouse/control.py
Use the stored source part name in disk copy routines so filesystem operations target the correct on-disk directory.
  • Compute storage_name from link_part_name or name in _run_copy_command and use it in the routine_tag for logging/identification.
  • Update target and source paths in the copy command to use storage_name, ensuring ClickHouse disk operations refer to the actual stored directory name for deduplicated parts.
ch_backup/clickhouse/disks.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@kirillgarbar kirillgarbar self-requested a review May 20, 2026 10:23
@aalexfvk aalexfvk marked this pull request as ready for review May 25, 2026 10:00
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The storage_name = part.link_part_name if part.link_part_name else part.name pattern is repeated in multiple places (layout.download_data_part/check_data_part and disks._run_copy_command); consider adding a PartMetadata.storage_name (or similar) property to centralize this logic and avoid future divergence.
  • In GET_DEDUPLICATED_PARTS_SQL the join only on checksum plus LIMIT 1 BY current_name can arbitrarily choose a source row when multiple historical parts share the same checksum; consider adding an extra disambiguator (e.g., min/max by backup name or timestamp) to make the mapping deterministic.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The `storage_name = part.link_part_name if part.link_part_name else part.name` pattern is repeated in multiple places (layout.download_data_part/check_data_part and disks._run_copy_command); consider adding a `PartMetadata.storage_name` (or similar) property to centralize this logic and avoid future divergence.
- In `GET_DEDUPLICATED_PARTS_SQL` the join only on `checksum` plus `LIMIT 1 BY current_name` can arbitrarily choose a source row when multiple historical parts share the same checksum; consider adding an extra disambiguator (e.g., min/max by backup name or timestamp) to make the mapping deterministic.

## Individual Comments

### Comment 1
<location path="ch_backup/clickhouse/disks.py" line_range="263" />
<code_context>
                 target_path = os.path.join(table_path, "detached")
                 if self._ch_ctl.ch_version_ge("23.7"):
-                    target_path = os.path.join(target_path, part.name, "")
+                    target_path = os.path.join(target_path, storage_name, "")
                 source_path = os.path.join(
                     "shadow",
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `storage_name` for the target path may attach the part under the wrong name after mutation-based renames.

For CH ≥ 23.7, the directory under `detached/` must match the logical part name (`part.name`) that ClickHouse knows about. Here, `storage_name` may be `link_part_name` (backup name), which can differ from `part.name` for mutation-renamed parts. This would restore the directory under the old backup name instead of the current logical name and can break `ATTACH`. To avoid that, keep `target_path` based on `part.name` and only use `storage_name` for `source_path` in `shadow/`.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

target_path = os.path.join(table_path, "detached")
if self._ch_ctl.ch_version_ge("23.7"):
target_path = os.path.join(target_path, part.name, "")
target_path = os.path.join(target_path, storage_name, "")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Using storage_name for the target path may attach the part under the wrong name after mutation-based renames.

For CH ≥ 23.7, the directory under detached/ must match the logical part name (part.name) that ClickHouse knows about. Here, storage_name may be link_part_name (backup name), which can differ from part.name for mutation-renamed parts. This would restore the directory under the old backup name instead of the current logical name and can break ATTACH. To avoid that, keep target_path based on part.name and only use storage_name for source_path in shadow/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant