WIP: Add workload scheduling backup#303
Conversation
Reviewer's GuideAdds backup and restore support for ClickHouse workload scheduling entities (WORKLOADs and RESOURCEs), including storage abstraction (local/ZooKeeper, optional encryption), metadata tracking, layout and control-plane operations, and integration tests. Sequence diagram for workload entities backup flowsequenceDiagram
participant User
participant CHBackup
participant BackupSources
participant BackupContext
participant WorkloadEntitiesBackup
participant ClickhouseControl
participant ClickhouseConfig
participant WorkloadEntitiesStorageConfig
participant BackupMetadata
participant BackupLayout
participant ClickHouseEncryption
participant ZookeeperCTL
participant Storage
User->>CHBackup: backup(sources)
CHBackup->>BackupSources: for_backup(..., workload_entities=true, ...)
CHBackup->>WorkloadEntitiesBackup: backup(context)
WorkloadEntitiesBackup->>ClickhouseControl: ch_version_ge("24.11")
ClickhouseControl-->>WorkloadEntitiesBackup: bool
alt version < 24.11
WorkloadEntitiesBackup-->>CHBackup: return
else version >= 24.11
WorkloadEntitiesBackup->>WorkloadEntitiesStorageConfig: from_ch_config(ch_ctl_conf, ch_config)
WorkloadEntitiesStorageConfig-->>WorkloadEntitiesBackup: we_config
WorkloadEntitiesBackup->>ClickhouseControl: get_workload_entities_query()
ClickhouseControl-->>WorkloadEntitiesBackup: workload_entities:list
alt no workload entities
WorkloadEntitiesBackup-->>CHBackup: return
else workload entities exist
loop for each entity_name
WorkloadEntitiesBackup->>BackupMetadata: add_workload_entity(entity_name)
end
WorkloadEntitiesBackup->>WorkloadEntitiesStorageConfig: is_local_storage()
alt local storage
WorkloadEntitiesBackup->>Storage: copy_directory_content(storage_path, backup_tmp_path)
else zookeeper storage
WorkloadEntitiesBackup->>ZookeeperCTL: _copy_directory_content_from_zookeeper(...)
ZookeeperCTL->>Storage: read nodes into backup_tmp_path
end
WorkloadEntitiesBackup->>WorkloadEntitiesStorageConfig: is_encrypted()
alt encrypted
WorkloadEntitiesBackup->>ClickHouseEncryption: decrypt_directory_content(backup_tmp_path, key_hex)
ClickHouseEncryption->>Storage: decrypt files
end
loop for each entity_name
WorkloadEntitiesBackup->>BackupLayout: upload_workload_entity_ddl_from_file(local_path, backup_name, entity_name)
BackupLayout->>Storage: upload_file(encrypted)
end
end
end
Sequence diagram for workload entities restore flowsequenceDiagram
participant User
participant CHBackup
participant BackupSources
participant BackupContext
participant WorkloadEntitiesBackup
participant ClickhouseControl
participant BackupMetadata
participant BackupLayout
User->>CHBackup: restore(sources)
CHBackup->>BackupSources: for_restore(..., workload_entities=true, ...)
CHBackup->>WorkloadEntitiesBackup: restore(context)
WorkloadEntitiesBackup->>ClickhouseControl: ch_version_ge("24.11")
ClickhouseControl-->>WorkloadEntitiesBackup: bool
alt version < 24.11
WorkloadEntitiesBackup-->>CHBackup: return
else version >= 24.11
WorkloadEntitiesBackup->>BackupMetadata: get_workload_entities()
BackupMetadata-->>WorkloadEntitiesBackup: we_list
alt we_list empty
WorkloadEntitiesBackup-->>CHBackup: return
else we_list not empty
WorkloadEntitiesBackup->>ClickhouseControl: get_workload_entities_query()
ClickhouseControl-->>WorkloadEntitiesBackup: we_on_clickhouse_list
loop for each entity_name in we_list
WorkloadEntitiesBackup->>BackupLayout: get_workload_entity_create_statement(backup_meta, entity_name)
BackupLayout-->>WorkloadEntitiesBackup: statement
alt entity exists on ClickHouse
WorkloadEntitiesBackup->>BackupLayout: get_local_we_create_statement(entity_name)
BackupLayout-->>WorkloadEntitiesBackup: we_on_clickhouse_statement or null
alt statements differ
WorkloadEntitiesBackup->>ClickhouseControl: drop_workload_entity(entity_name)
ClickhouseControl->>ClickhouseControl: try DROP WORKLOAD then DROP RESOURCE
WorkloadEntitiesBackup->>ClickhouseControl: restore_workload_entity(statement)
else statements equal
Note over WorkloadEntitiesBackup,ClickhouseControl: No action taken
end
else entity missing on ClickHouse
WorkloadEntitiesBackup->>ClickhouseControl: restore_workload_entity(statement)
end
end
end
end
Class diagram for workload entities backup and related structuresclassDiagram
class BackupMetadata {
-List~str~ _workload_entities
+add_workload_entity(entity_name: str) void
+get_workload_entities() List~str~
+dump(light: bool) dict
+load(data: dict) BackupMetadata
}
class WorkloadEntitiesBackup {
+backup(context: BackupContext) void
+restore(context: BackupContext) void
+get_workload_entities_list(context: BackupContext) List~str~
-_copy_directory_content_from_zookeeper(zk_ctl: ZookeeperCTL, from_path_dir: str, to_path_dir: str) void
}
class WorkloadEntitiesStorageConfig {
+storage_type: StorageType
+storage_path: str
+encryption_key_hex: str
+is_local_storage() bool
+is_storage_zookeeper() bool
+is_encrypted() bool
+from_ch_config(ch_backup_config: dict, ch_config: ClickhouseConfig) WorkloadEntitiesStorageConfig
}
class StorageType {
<<enumeration>>
LOCAL
LOCAL_ENCRYPTED
ZOOKEEPER
ZOOKEEPER_ENCRYPTED
}
class BackupSources {
+access: bool
+data: bool
+schema: bool
+udf: bool
+named_collections: bool
+workload_entities: bool
+for_backup(access: bool, data: bool, schema: bool, udf: bool, named_collections: bool, workload_entities: bool, schema_only: bool) BackupSources
+for_restore(access: bool, data: bool, schema: bool, udf: bool, named_collections: bool, workload_entities: bool, schema_only: bool) BackupSources
+schemas_included() bool
}
class ClickhouseControl {
+get_workload_entities_query() List~str~
+restore_workload_entity(entity_statement: str) void
+drop_workload_entity(entity_name: str) void
+ch_version_ge(version: str) bool
}
class BackupLayout {
+upload_workload_entity_ddl_from_file(local_path: str, backup_name: str, entity_name: str) void
+get_local_we_create_statement(entity_name: str) str
+get_workload_entity_create_statement(backup_meta: BackupMetadata, filename: str) str
}
class BackupContext {
+backup_meta: BackupMetadata
+backup_layout: BackupLayout
+ch_ctl: ClickhouseControl
+ch_ctl_conf: dict
+ch_config: ClickhouseConfig
+zk_ctl: ZookeeperCTL
}
class CHBackup {
-_we_backup_manager: WorkloadEntitiesBackup
+backup(sources: BackupSources, databases: dict) void
+_restore(sources: BackupSources) void
}
class ZookeeperCTL {
+zk_client
+zk_root_path: str
}
class ClickhouseConfig {
+config: dict
}
class BackupManager
class ClickHouseEncryption
WorkloadEntitiesBackup ..|> BackupManager
WorkloadEntitiesStorageConfig o-- StorageType
WorkloadEntitiesBackup --> BackupContext
WorkloadEntitiesBackup --> WorkloadEntitiesStorageConfig
WorkloadEntitiesBackup --> ClickHouseEncryption
WorkloadEntitiesBackup --> ZookeeperCTL
BackupContext --> BackupMetadata
BackupContext --> BackupLayout
BackupContext --> ClickhouseControl
CHBackup --> WorkloadEntitiesBackup
CHBackup --> BackupSources
BackupMetadata --> BackupLayout
ClickhouseControl --> BackupLayout
Flow diagram for workload entities storage configuration selectionflowchart TD
A["Start from_ch_config"] --> B["Read workload_entity_storage from ClickhouseConfig"]
B --> C{"workload_entity_storage present?"}
C -- "No" --> D["Read workload_path from ch_backup_config"]
D --> E{"workload_path set?"}
E -- "No" --> F["Assert error: workload_path missing"]
E -- "Yes" --> G["Create WorkloadEntitiesStorageConfig with<br/>storage_type=LOCAL<br/>storage_path=workload_path<br/>encryption_key_hex=''"]
G --> Z["Return config"]
C -- "Yes" --> H["Read type, path, key_hex from workload_entity_storage"]
H --> I{"type present?"}
I -- "No" --> J["storage_type=LOCAL"]
I -- "Yes" --> K["storage_type=StorageType(type)"]
J --> L["storage_path=path"]
K --> L
L --> M["encryption_key_hex=key_hex"]
M --> N["Create WorkloadEntitiesStorageConfig(storage_type, storage_path, encryption_key_hex)"]
N --> Z
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- In
WorkloadEntitiesStorageConfig.from_ch_config, whenworkload_entity_storageis present butpathis missing,storage_pathcan becomeNonewithout any validation; consider asserting or raising a clear error ifpath(and when required,key_hex) are absent to avoid hard-to-debug runtime issues. - In
ClickhouseCTL.drop_workload_entity, you currently fall back to dropping a RESOURCE on anyExceptionfrom the WORKLOAD drop; it would be safer to restrict the fallback to expected errors (e.g. entity-not-found) rather than swallowing all failures, so that real errors (permissions, syntax, connectivity) are not masked.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In `WorkloadEntitiesStorageConfig.from_ch_config`, when `workload_entity_storage` is present but `path` is missing, `storage_path` can become `None` without any validation; consider asserting or raising a clear error if `path` (and when required, `key_hex`) are absent to avoid hard-to-debug runtime issues.
- In `ClickhouseCTL.drop_workload_entity`, you currently fall back to dropping a RESOURCE on any `Exception` from the WORKLOAD drop; it would be safer to restrict the fallback to expected errors (e.g. entity-not-found) rather than swallowing all failures, so that real errors (permissions, syntax, connectivity) are not masked.
## Individual Comments
### Comment 1
<location path="ch_backup/backup/layout.py" line_range="347-354" />
<code_context>
+ local_path, remote_path=remote_path, encryption=True
+ )
+ except Exception as e:
+ msg = f"Failed to create async upload of {remote_path}"
+ raise StorageError(msg) from e
+
</code_context>
<issue_to_address>
**suggestion:** Error message mentions 'async upload' although the call is synchronous, which can mislead debugging.
Since `upload_file` is used synchronously, this message is misleading for anyone reading logs or debugging. Please update it to describe the actual failure (for example, `Failed to upload workload entity DDL to {remote_path}`) without implying async behavior.
```suggestion
try:
logging.debug('Uploading workload entity create statement "{}"', entity_name)
self._storage_loader.upload_file(
local_path, remote_path=remote_path, encryption=True
)
except Exception as e:
msg = f"Failed to upload workload entity DDL to {remote_path}"
raise StorageError(msg) from e
```
</issue_to_address>
### Comment 2
<location path="ch_backup/clickhouse/control.py" line_range="1006-1008" />
<code_context>
+ remote_path = _workload_entities_data_path(
+ self.get_backup_path(backup_name), entity_name
+ )
+ try:
+ logging.debug('Uploading workload entity create statement "{}"', entity_name)
+ self._storage_loader.upload_file(
</code_context>
<issue_to_address>
**issue (bug_risk):** Catching a broad Exception when dropping workload entities may hide real errors.
Because any `DROP WORKLOAD` failure is treated as “must be a RESOURCE”, unrelated errors (permissions, syntax, connection issues, etc.) get silently misclassified and retried. If ClickHouse provides a specific code or message for “WORKLOAD not found”, please check for that explicitly before falling back to `DROP RESOURCE`, and re-raise all other errors.
</issue_to_address>
### Comment 3
<location path="ch_backup/logic/workload_entities.py" line_range="233-234" />
<code_context>
+ return WorkloadEntitiesStorageConfig(storage_path=storage_path)
+
+ storage_type_from_config = we_config.get("type")
+ storage_path_from_config = we_config.get("path")
+ encryption_key_hex_from_config = we_config.get("key_hex")
+
+ storage_type = WorkloadEntitiesStorageConfig.StorageType.LOCAL
</code_context>
<issue_to_address>
**issue:** Storage path from `workload_entity_storage` config is not validated and may be missing.
If `workload_entity_storage` exists but `path` is missing or empty, `storage_path_from_config` becomes `None` and is passed into `WorkloadEntitiesStorageConfig`. Subsequent operations that expect a valid directory (e.g., `copy_directory_content`, ZooKeeper reads) will then fail opaquely. Consider validating `storage_path_from_config` here and either raising a clear error or falling back to `workload_path` when it is not set.
</issue_to_address>
### Comment 4
<location path="ch_backup/logic/workload_entities.py" line_range="121-127" />
<code_context>
+ )
+
+ if entity_name in we_on_clickhouse_list:
+ we_on_clickhouse_statement = (
+ context.backup_layout.get_local_we_create_statement(entity_name)
+ )
+ if we_on_clickhouse_statement != statement:
+ context.ch_ctl.drop_workload_entity(entity_name)
</code_context>
<issue_to_address>
**suggestion:** Handling of a missing local workload entity statement relies on implicit `None != statement` comparison.
If `get_local_we_create_statement` returns `None` (e.g., missing file), this condition treats it as a mismatch and always drops/recreates the entity. If that behavior is desired, consider handling the `None` case explicitly (e.g., log a specific message, then drop/restore) so the intent and expectations around missing local files are clearer than relying on `None != statement`.
```suggestion
if entity_name in we_on_clickhouse_list:
we_on_clickhouse_statement = (
context.backup_layout.get_local_we_create_statement(entity_name)
)
if we_on_clickhouse_statement is None:
logging.warning(
"Local workload entity definition for %s is missing; "
"dropping and restoring %s from backup",
entity_name,
entity_name,
)
context.ch_ctl.drop_workload_entity(entity_name)
context.ch_ctl.restore_workload_entity(statement)
elif we_on_clickhouse_statement != statement:
logging.info(
"Local workload entity definition for %s differs from backup; "
"recreating entity from backup definition",
entity_name,
)
context.ch_ctl.drop_workload_entity(entity_name)
context.ch_ctl.restore_workload_entity(statement)
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| try: | ||
| self._ch_client.query(DROP_WORKLOAD_SQL.format(entity_name=escape(entity_name))) | ||
| except Exception: |
There was a problem hiding this comment.
issue (bug_risk): Catching a broad Exception when dropping workload entities may hide real errors.
Because any DROP WORKLOAD failure is treated as “must be a RESOURCE”, unrelated errors (permissions, syntax, connection issues, etc.) get silently misclassified and retried. If ClickHouse provides a specific code or message for “WORKLOAD not found”, please check for that explicitly before falling back to DROP RESOURCE, and re-raise all other errors.
| storage_path_from_config = we_config.get("path") | ||
| encryption_key_hex_from_config = we_config.get("key_hex") |
There was a problem hiding this comment.
issue: Storage path from workload_entity_storage config is not validated and may be missing.
If workload_entity_storage exists but path is missing or empty, storage_path_from_config becomes None and is passed into WorkloadEntitiesStorageConfig. Subsequent operations that expect a valid directory (e.g., copy_directory_content, ZooKeeper reads) will then fail opaquely. Consider validating storage_path_from_config here and either raising a clear error or falling back to workload_path when it is not set.
| if entity_name in we_on_clickhouse_list: | ||
| we_on_clickhouse_statement = ( | ||
| context.backup_layout.get_local_we_create_statement(entity_name) | ||
| ) | ||
| if we_on_clickhouse_statement != statement: | ||
| context.ch_ctl.drop_workload_entity(entity_name) | ||
| context.ch_ctl.restore_workload_entity(statement) |
There was a problem hiding this comment.
suggestion: Handling of a missing local workload entity statement relies on implicit None != statement comparison.
If get_local_we_create_statement returns None (e.g., missing file), this condition treats it as a mismatch and always drops/recreates the entity. If that behavior is desired, consider handling the None case explicitly (e.g., log a specific message, then drop/restore) so the intent and expectations around missing local files are clearer than relying on None != statement.
| if entity_name in we_on_clickhouse_list: | |
| we_on_clickhouse_statement = ( | |
| context.backup_layout.get_local_we_create_statement(entity_name) | |
| ) | |
| if we_on_clickhouse_statement != statement: | |
| context.ch_ctl.drop_workload_entity(entity_name) | |
| context.ch_ctl.restore_workload_entity(statement) | |
| if entity_name in we_on_clickhouse_list: | |
| we_on_clickhouse_statement = ( | |
| context.backup_layout.get_local_we_create_statement(entity_name) | |
| ) | |
| if we_on_clickhouse_statement is None: | |
| logging.warning( | |
| "Local workload entity definition for %s is missing; " | |
| "dropping and restoring %s from backup", | |
| entity_name, | |
| entity_name, | |
| ) | |
| context.ch_ctl.drop_workload_entity(entity_name) | |
| context.ch_ctl.restore_workload_entity(statement) | |
| elif we_on_clickhouse_statement != statement: | |
| logging.info( | |
| "Local workload entity definition for %s differs from backup; " | |
| "recreating entity from backup definition", | |
| entity_name, | |
| ) | |
| context.ch_ctl.drop_workload_entity(entity_name) | |
| context.ch_ctl.restore_workload_entity(statement) |
| remote_path = _named_collections_data_path(backup_meta.path, filename) | ||
| return self._storage_loader.download_data(remote_path, encryption=True) | ||
|
|
||
| def upload_workload_entity_ddl_from_file( |
There was a problem hiding this comment.
Let's move this function above next to the other upload_ functions
| msg = f"Failed to create async upload of {remote_path}" | ||
| raise StorageError(msg) from e | ||
|
|
||
| def get_local_we_create_statement(self, entity_name: str) -> Optional[str]: |
There was a problem hiding this comment.
we is misleading. Lets use workload_entity
| """ | ||
| try: | ||
| self._ch_client.query(DROP_WORKLOAD_SQL.format(entity_name=escape(entity_name))) | ||
| except Exception: |
There was a problem hiding this comment.
Exception is too broad here.
Consider keeping entities (workload and resources) separate, or keeping the type together with each entity.
| ) | ||
| if we_on_clickhouse_statement != statement: | ||
| context.ch_ctl.drop_workload_entity(entity_name) | ||
| context.ch_ctl.restore_workload_entity(statement) |
There was a problem hiding this comment.
А workload can reference some resource or another workload through the parent field, forming a hierarchy.
If you restore objects in an arbitrary order, a command like CREATE WORKLOAD child PARENT parent will fail if the parent has not been created yet.
Named collections do not have dependencies on each other, so this problem does not occur with them.
In short resources must be created before workloads, and workloads must be created in topological order (with parents before their children).
|
Pls, resolve the Sourcery comments (either agree and fix or explain and deny as appropriate). |
https://clickhouse.com/docs/operations/workload-scheduling
Summary by Sourcery
Add backup and restore support for ClickHouse workload scheduling entities (WORKLOADs and RESOURCEs) for servers running ClickHouse 24.11 and later.
New Features:
Tests: