Enhance dediplication of parts#324
Conversation
Reviewer's GuideThis PR improves deduplication handling by distinguishing between the logical (current) part name and the stored source part name for deduplicated ClickHouse parts, propagating this distinction through metadata, backup layout operations, ClickHouse deduplication queries, and disk copy routines so that data is read and written using the correct underlying storage identifiers while keeping logical names in logs and metadata. Sequence diagram for using storage_name for deduplicated partssequenceDiagram
participant Deduplication as deduplication_py
participant PartMetadata as PartMetadata
participant BackupLayout as layout_py
participant ClickHouseDisks as disks_py
participant StorageLoader as StorageLoader
Deduplication->>PartMetadata: deduplicate_parts(database, table, existing_parts)
Note over Deduplication,PartMetadata: For deduplicated parts:
Note over Deduplication,PartMetadata: name = current_name
Note over Deduplication,PartMetadata: link = backup_name
Note over Deduplication,PartMetadata: link_part_name = stored source part name
BackupLayout->>BackupLayout: download_data_part(backup_name, part)
BackupLayout->>BackupLayout: storage_name = part.link_part_name or part.name
BackupLayout->>StorageLoader: get_backup_path(backup_name)
BackupLayout->>StorageLoader: download_files(remote_dir_path using storage_name)
BackupLayout->>BackupLayout: check_data_part(backup_name, part)
BackupLayout->>BackupLayout: storage_name = part.link_part_name or part.name
BackupLayout->>StorageLoader: list_dir(remote_dir_path using storage_name)
ClickHouseDisks->>ClickHouseDisks: _run_copy_command(table, part, backup_meta)
ClickHouseDisks->>ClickHouseDisks: storage_name = part.link_part_name or part.name
ClickHouseDisks->>ClickHouseDisks: build target_path and source_path with storage_name
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The
storage_name = part.link_part_name if part.link_part_name else part.namepattern is repeated in multiple places (layout.download_data_part/check_data_part and disks._run_copy_command); consider adding aPartMetadata.storage_name(or similar) property to centralize this logic and avoid future divergence. - In
GET_DEDUPLICATED_PARTS_SQLthe join only onchecksumplusLIMIT 1 BY current_namecan arbitrarily choose a source row when multiple historical parts share the same checksum; consider adding an extra disambiguator (e.g., min/max by backup name or timestamp) to make the mapping deterministic.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The `storage_name = part.link_part_name if part.link_part_name else part.name` pattern is repeated in multiple places (layout.download_data_part/check_data_part and disks._run_copy_command); consider adding a `PartMetadata.storage_name` (or similar) property to centralize this logic and avoid future divergence.
- In `GET_DEDUPLICATED_PARTS_SQL` the join only on `checksum` plus `LIMIT 1 BY current_name` can arbitrarily choose a source row when multiple historical parts share the same checksum; consider adding an extra disambiguator (e.g., min/max by backup name or timestamp) to make the mapping deterministic.
## Individual Comments
### Comment 1
<location path="ch_backup/clickhouse/disks.py" line_range="263" />
<code_context>
target_path = os.path.join(table_path, "detached")
if self._ch_ctl.ch_version_ge("23.7"):
- target_path = os.path.join(target_path, part.name, "")
+ target_path = os.path.join(target_path, storage_name, "")
source_path = os.path.join(
"shadow",
</code_context>
<issue_to_address>
**issue (bug_risk):** Using `storage_name` for the target path may attach the part under the wrong name after mutation-based renames.
For CH ≥ 23.7, the directory under `detached/` must match the logical part name (`part.name`) that ClickHouse knows about. Here, `storage_name` may be `link_part_name` (backup name), which can differ from `part.name` for mutation-renamed parts. This would restore the directory under the old backup name instead of the current logical name and can break `ATTACH`. To avoid that, keep `target_path` based on `part.name` and only use `storage_name` for `source_path` in `shadow/`.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| target_path = os.path.join(table_path, "detached") | ||
| if self._ch_ctl.ch_version_ge("23.7"): | ||
| target_path = os.path.join(target_path, part.name, "") | ||
| target_path = os.path.join(target_path, storage_name, "") |
There was a problem hiding this comment.
issue (bug_risk): Using storage_name for the target path may attach the part under the wrong name after mutation-based renames.
For CH ≥ 23.7, the directory under detached/ must match the logical part name (part.name) that ClickHouse knows about. Here, storage_name may be link_part_name (backup name), which can differ from part.name for mutation-renamed parts. This would restore the directory under the old backup name instead of the current logical name and can break ATTACH. To avoid that, keep target_path based on part.name and only use storage_name for source_path in shadow/.
Summary by Sourcery
Improve deduplication handling for ClickHouse parts whose stored names may differ from their current names, including mutation suffix changes, and propagate the original storage name through metadata and restore flows.
New Features:
Enhancements: