Add StorageVersionMigrator controller (opt-in, default off)#5362
Add StorageVersionMigrator controller (opt-in, default off)#5362ChrisJBurns wants to merge 5 commits into
Conversation
Adds a StorageVersionMigrator controller to the operator behind a default-off feature flag (ENABLE_STORAGE_VERSION_MIGRATOR). The controller reconciles status.storedVersions on opted-in toolhive.stacklok.dev CRDs by issuing a Get+Update against each CR, which re-encodes etcd objects to the current storage version. This is required before a future release can drop a deprecated apiVersion (e.g. v1alpha1) from spec.versions without orphaning rows. The flag defaults to false, so this PR has no functional impact for any user until they explicitly opt in via operator.env. Helm chart exposure (values.yaml flag entry and deployment env-var wiring) is deferred to a follow-up PR to reduce accidental enablement during the deprecation window. RBAC for the controller's verbs ships in this PR so the flag can be flipped on at any time without permission-denied errors. Opt-in label markers on the 12 v1beta1 root types and the marker-coverage CI test are deferred to a follow-up PR; the envtest suite creates its own CRDs at runtime and is independent of those changes. Part of #4969. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5362 +/- ##
==========================================
- Coverage 68.72% 68.48% -0.25%
==========================================
Files 625 626 +1
Lines 63422 63624 +202
==========================================
- Hits 43587 43572 -15
- Misses 16585 16809 +224
+ Partials 3250 3243 -7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ChrisJBurns
left a comment
There was a problem hiding this comment.
Multi-Agent Consensus Review
Agents consulted: kubernetes-go-expert, code-reviewer, go-architect, toolhive-expert (go-security-reviewer's sandbox couldn't read the diff — its threat-model checklist was independently covered by the others)
Consensus Summary
| # | Finding | Consensus | Severity | Action |
|---|---|---|---|---|
| F1 | Doc comment "Enabled by default" contradicts default-OFF | 10/10 | HIGH | Fix |
| F2 | Silent env-var parse fallback (fail-loudly rule) | 9/10 | MEDIUM | Fix |
| F3 | Wildcard RBAC widens group-wide permissions | 7/10 | MEDIUM | Discuss |
| F6 | Status().Patch exception to MutateAndPatchStatus needs in-code note |
7/10 | INFO | Polish |
| F7 | Disabled-state INFO log should be DEBUG | 5/10 | LOW | Optional |
| F8 | Env var name lacks TOOLHIVE_ prefix |
5/10 | LOW | Optional |
Overall
The mechanism is sound and the test design is genuinely strong. The cross-version RV-bump test ("re-encodes CRs that are stored at a prior storage version") is the load-bearing empirical proof that the apiserver actually re-encodes etcd bytes on cross-version Update — that's the only thing reviewers really need to see to trust the design, and it's there with explicit pre/post resourceVersion assertions. The pagination test wraps APIReader with a list-call counter for direct continue-token proof; the partial-failure and conflict-retry tests pin the storedVersions invariant; the use of APIReader (informer bypass) plus runtime isManagedCRD re-verification correctly handles the TOCTOU window between watch-time and reconcile-time. The defense-in-depth around the wildcard RBAC (RBAC bounds the API group; the label gate narrows within it) is well-reasoned and the errMigrationRetriedDueToConflicts sentinel correctly prevents trimming storedVersions when any per-CR write is unverified.
The findings here are mostly polish, not architectural. One HIGH and two MEDIUMs:
- F1 (HIGH) is straightforward doc/code drift — the type doc says one thing, the env-var gate does another, and the referenced
operator.features.storageVersionMigratorhelm path doesn't exist yet. Fixable in a single comment block. - F2 (MEDIUM) is the only operability concern:
ENABLE_STORAGE_VERSION_MIGRATOR=turesilently disables the feature with an INFO log. Per.claude/rules/go-style.md"Constructor Validation: Fail Loudly", that misconfiguration should fail startup or at minimum surface at WARN. - F3 (MEDIUM) is the wildcard RBAC. The runtime gate (
isManagedCRD) defends against the broad RBAC at execution time, but the regenerated chart now shipsgroups=toolhive.stacklok.dev,resources=*on every install regardless of whether the migrator is on. Worth surfacing in the PR's user-facing-change section even if the design is intentional.
Nothing here blocks the merge in a strict sense; all findings have unambiguous, contained fixes. The HIGH is HIGH only because doc/code drift on the activation mechanism is the kind of thing that causes operator escalations during a deprecation window.
Documentation
cmd/thv-operator/controllers/storageversionmigrator_controller.go:97-101 carries the contradicting doc comment (F1). No other documentation files in this PR need updating, but I'd suggest noting the operator SA permission widening (F3) in the PR's "Does this introduce a user-facing change?" section so chart consumers see it on review.
Generated with Claude Code
- F1: fix doc comment claiming "Enabled by default" — the controller is default-off in this release, opt-in via env var only; chart surface lands in a follow-up PR - F2: fail loudly on unparsable env-var value instead of silently defaulting to disabled (per go-style "Fail Loudly") - F6: document why patchStoredVersions cannot use the operator-wide MutateAndPatchStatus helper (CRD storedVersions is co-owned with kube-apiserver, optimistic lock is load-bearing) - F7: demote disabled-state log from INFO to V(1) (silent success) - F8: prefix env var with TOOLHIVE_ to match sibling operator env vars and avoid cross-operator collisions - Codespell: unparseable→unparsable, overrideable→overridable - Chainsaw RBAC fixture: include the migrator's apiextensions verbs and the wildcard toolhive.stacklok.dev rules so the multi-tenancy setup assertion matches the regenerated ClusterRole Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
Same update as the multi-tenancy fixture in the previous commit — single-tenancy/setup also snapshots the operator ClusterRole and the migrator's RBAC rules need to appear in its expected YAML too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a
StorageVersionMigratorcontroller to the operator behind an opt-in feature flag (TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR). The flag defaults to false — this PR has no functional impact until an admin explicitly opts in. Required before a future release can drop a deprecatedapiVersion(e.g.v1alpha1) fromspec.versionswithout orphaning rows in etcd.This is the first of three sequential PRs that together deliver the migrator described in #4969. The full work lives in #5011 as a reference; this PR extracts just the controller code.
Part of #4969.
Medium level
StorageVersionMigratorReconcileratcmd/thv-operator/controllers/storageversionmigrator_controller.go— issues a Get+Update against each CR on opted-in CRDs, which re-encodes etcd bytes at the current storage version, then trimsstatus.storedVersionscmd/thv-operator/app/app.go:apiextensions/v1scheme registration,setupStorageVersionMigrator()setup,isStorageVersionMigratorEnabled()helper defaulting tofalseand failing loudly on an unparsable env-var valuecmd/thv-operator/test-integration/storageversionmigrator/with 8 scenarios including a cross-version re-encode testdeployment.yamlenv-var wiring — deferred to PR-C alongside the default-on flip. Reduces the chance of accidental enablement during the deprecation window. Early adopters who know about the feature can opt in viaoperator.env.Low level
cmd/thv-operator/controllers/storageversionmigrator_controller.gocmd/thv-operator/app/app.goapiextensions/v1in scheme; addsetupStorageVersionMigrator()+isStorageVersionMigratorEnabled()helper (defaultsfalse, fails loudly on unparsable values); gate registration on the flagcmd/thv-operator/test-integration/storageversionmigrator/suite_test.gocmd/thv-operator/test-integration/storageversionmigrator/controller_test.godeploy/charts/operator/templates/clusterrole/role.yamltask operator-manifests— picks up the controller's+kubebuilder:rbac:markerstest/e2e/chainsaw/operator/multi-tenancy/setup/assert-rbac-clusterrole.yamlType of change
Does this introduce a user-facing change?
Yes — operator ServiceAccount permissions widen on this PR. The regenerated
toolhive-operator-manager-roleClusterRole now grants:get/list/watchonapiextensions.k8s.io/customresourcedefinitionspatch/updateoncustomresourcedefinitions/statusget/list/updateontoolhive.stacklok.dev/*(wildcard within the operator's own API group)updateontoolhive.stacklok.dev/*/status(wildcard within the operator's own API group)These ship with the chart on every install regardless of whether the migrator is opted in. The runtime gate (
isManagedCRDrequiring the opt-in label) prevents the controller from touching anything other than labeled CRDs, but the RBAC widening is unconditional — this is intentional so admins can flip the env var on at any time without hitting permission-denied. Templating the RBAC behind a chart value is deferred to PR-C alongside the chart surface for the feature flag.No CR schema changes. No behavior change for the operator until an admin sets
TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=true.Test plan
task buildpassesgo test ./cmd/thv-operator/... | grep -v /test-integrationgo test ./cmd/thv-operator/test-integration/storageversionmigrator/...TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=turereturns a startup error rather than silently disablingtask lint-fix— blocked by a pre-existing tooling issue (golangci-lintbuilt with go1.25 vs. go1.26 target); not introduced by this PRLarge PR Justification
This PR exceeds 1000 lines, but the bulk is non-production code that cannot reasonably be split from the production code it covers:
controller_test.go+suite_test.go) — 8 ginkgo specs including a cross-version re-encode probe that empirically verifies the apiserver's storage-encoder behavior. Shipping the controller without its integration tests would be poor practice.role.yaml) — driven by+kubebuilder:rbac:markers in the controller; cannot be authored separately.The PR ships behind a default-off feature flag, so the production code is dormant unless explicitly opted in. PR-B (opt-in labels + marker-coverage CI) and PR-C (default-on flip + chart surface + docs) follow as separate, smaller PRs.
Special notes for reviewers
helm install --setwithout understanding the semantics.TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR(with prefix) matches sibling operator env vars likeTOOLHIVE_DEFAULT_IMAGE_PULL_SECRETSand avoids collisions with cluster-wide variables set by other operators or CI tooling.v1beta1root types + marker-coverage CI testvalues.yamlflag +deployment.yamlenv var) + user docs + upgrade-guide walkthroughGenerated with Claude Code