Skip to content

Add StorageVersionMigrator controller (opt-in, default off)#5362

Open
ChrisJBurns wants to merge 5 commits into
mainfrom
chris/svm-controller
Open

Add StorageVersionMigrator controller (opt-in, default off)#5362
ChrisJBurns wants to merge 5 commits into
mainfrom
chris/svm-controller

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented May 21, 2026

Summary

Adds a StorageVersionMigrator controller to the operator behind an opt-in feature flag (TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR). The flag defaults to false — this PR has no functional impact until an admin explicitly opts in. Required before a future release can drop a deprecated apiVersion (e.g. v1alpha1) from spec.versions without orphaning rows in etcd.

This is the first of three sequential PRs that together deliver the migrator described in #4969. The full work lives in #5011 as a reference; this PR extracts just the controller code.

Part of #4969.

Medium level
  • New StorageVersionMigratorReconciler at cmd/thv-operator/controllers/storageversionmigrator_controller.go — issues a Get+Update against each CR on opted-in CRDs, which re-encodes etcd bytes at the current storage version, then trims status.storedVersions
  • App wiring in cmd/thv-operator/app/app.go: apiextensions/v1 scheme registration, setupStorageVersionMigrator() setup, isStorageVersionMigratorEnabled() helper defaulting to false and failing loudly on an unparsable env-var value
  • RBAC regenerated to include the controller's verbs (so the flag can be flipped at any time without permission errors)
  • envtest suite under cmd/thv-operator/test-integration/storageversionmigrator/ with 8 scenarios including a cross-version re-encode test
  • No helm chart flag entry and no deployment.yaml env-var wiring — deferred to PR-C alongside the default-on flip. Reduces the chance of accidental enablement during the deprecation window. Early adopters who know about the feature can opt in via operator.env.
  • No documentation — also lands in PR-C
Low level
File Change
cmd/thv-operator/controllers/storageversionmigrator_controller.go New controller (reconciler, migration cache, RBAC markers, SetupWithManager)
cmd/thv-operator/app/app.go Register apiextensions/v1 in scheme; add setupStorageVersionMigrator() + isStorageVersionMigratorEnabled() helper (defaults false, fails loudly on unparsable values); gate registration on the flag
cmd/thv-operator/test-integration/storageversionmigrator/suite_test.go envtest harness
cmd/thv-operator/test-integration/storageversionmigrator/controller_test.go 8 ginkgo scenarios
deploy/charts/operator/templates/clusterrole/role.yaml Regenerated by task operator-manifests — picks up the controller's +kubebuilder:rbac: markers
test/e2e/chainsaw/operator/multi-tenancy/setup/assert-rbac-clusterrole.yaml Updated to match the regenerated ClusterRole

Type of change

  • New feature
  • Bug fix
  • Breaking change
  • Refactoring
  • Documentation
  • Other

Does this introduce a user-facing change?

Yes — operator ServiceAccount permissions widen on this PR. The regenerated toolhive-operator-manager-role ClusterRole now grants:

  • get/list/watch on apiextensions.k8s.io/customresourcedefinitions
  • patch/update on customresourcedefinitions/status
  • get/list/update on toolhive.stacklok.dev/* (wildcard within the operator's own API group)
  • update on toolhive.stacklok.dev/*/status (wildcard within the operator's own API group)

These ship with the chart on every install regardless of whether the migrator is opted in. The runtime gate (isManagedCRD requiring the opt-in label) prevents the controller from touching anything other than labeled CRDs, but the RBAC widening is unconditional — this is intentional so admins can flip the env var on at any time without hitting permission-denied. Templating the RBAC behind a chart value is deferred to PR-C alongside the chart surface for the feature flag.

No CR schema changes. No behavior change for the operator until an admin sets TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=true.

Test plan

  • task build passes
  • Operator unit tests pass: go test ./cmd/thv-operator/... | grep -v /test-integration
  • envtest suite passes 8/8: go test ./cmd/thv-operator/test-integration/storageversionmigrator/...
  • Default-OFF verified: with the flag unset, the operator suppresses the disabled-state log at INFO and the controller is not registered
  • Fails loudly on an unparsable env-var value: TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR=ture returns a startup error rather than silently disabling
  • task lint-fix — blocked by a pre-existing tooling issue (golangci-lint built with go1.25 vs. go1.26 target); not introduced by this PR

Large PR Justification

This PR exceeds 1000 lines, but the bulk is non-production code that cannot reasonably be split from the production code it covers:

  • ~712 lines of envtest scenarios (controller_test.go + suite_test.go) — 8 ginkgo specs including a cross-version re-encode probe that empirically verifies the apiserver's storage-encoder behavior. Shipping the controller without its integration tests would be poor practice.
  • ~481 lines of controller production code — the reconciler is one cohesive unit (migration cache, list pagination, conflict counting, optimistic-locked status patch). Splitting it further would create no independently reviewable boundary.
  • ~52 lines of regenerated RBAC YAML (role.yaml) — driven by +kubebuilder:rbac: markers in the controller; cannot be authored separately.
  • ~76 lines of app wiring + chainsaw fixture update.

The PR ships behind a default-off feature flag, so the production code is dormant unless explicitly opted in. PR-B (opt-in labels + marker-coverage CI) and PR-C (default-on flip + chart surface + docs) follow as separate, smaller PRs.

Special notes for reviewers

  • Why no chart surface: per design discussion, we want to keep this feature deliberately obscure until docs and the default-on flip ship together in PR-C. Reduces the chance of someone toggling it via casual helm install --set without understanding the semantics.
  • Why ship RBAC now: if RBAC landed only with the default-on flip, anyone manually opting in between PR-A and PR-C would hit permission-denied. Pre-shipping RBAC is harmless when the controller isn't registered.
  • Env-var name: TOOLHIVE_ENABLE_STORAGE_VERSION_MIGRATOR (with prefix) matches sibling operator env vars like TOOLHIVE_DEFAULT_IMAGE_PULL_SECRETS and avoids collisions with cluster-wide variables set by other operators or CI tooling.
  • Follow-ups:
    1. PR-B: opt-in label markers on the 12 v1beta1 root types + marker-coverage CI test
    2. PR-C: flip default to ON + add helm chart surface (values.yaml flag + deployment.yaml env var) + user docs + upgrade-guide walkthrough

Generated with Claude Code

Adds a StorageVersionMigrator controller to the operator behind a
default-off feature flag (ENABLE_STORAGE_VERSION_MIGRATOR). The
controller reconciles status.storedVersions on opted-in
toolhive.stacklok.dev CRDs by issuing a Get+Update against each CR,
which re-encodes etcd objects to the current storage version. This
is required before a future release can drop a deprecated apiVersion
(e.g. v1alpha1) from spec.versions without orphaning rows.

The flag defaults to false, so this PR has no functional impact for
any user until they explicitly opt in via operator.env. Helm chart
exposure (values.yaml flag entry and deployment env-var wiring) is
deferred to a follow-up PR to reduce accidental enablement during
the deprecation window.

RBAC for the controller's verbs ships in this PR so the flag can be
flipped on at any time without permission-denied errors. Opt-in
label markers on the 12 v1beta1 root types and the marker-coverage
CI test are deferred to a follow-up PR; the envtest suite creates
its own CRDs at runtime and is independent of those changes.

Part of #4969.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions Bot added the size/XL Extra large PR: 1000+ lines changed label May 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 0% with 202 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.48%. Comparing base (9a28521) to head (b491b9a).

Files with missing lines Patch % Lines
...r/controllers/storageversionmigrator_controller.go 0.00% 174 Missing ⚠️
cmd/thv-operator/app/app.go 0.00% 28 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5362      +/-   ##
==========================================
- Coverage   68.72%   68.48%   -0.25%     
==========================================
  Files         625      626       +1     
  Lines       63422    63624     +202     
==========================================
- Hits        43587    43572      -15     
- Misses      16585    16809     +224     
+ Partials     3250     3243       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChrisJBurns

This comment was marked as low quality.

Copy link
Copy Markdown
Collaborator Author

@ChrisJBurns ChrisJBurns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-Agent Consensus Review

Agents consulted: kubernetes-go-expert, code-reviewer, go-architect, toolhive-expert (go-security-reviewer's sandbox couldn't read the diff — its threat-model checklist was independently covered by the others)

Consensus Summary

# Finding Consensus Severity Action
F1 Doc comment "Enabled by default" contradicts default-OFF 10/10 HIGH Fix
F2 Silent env-var parse fallback (fail-loudly rule) 9/10 MEDIUM Fix
F3 Wildcard RBAC widens group-wide permissions 7/10 MEDIUM Discuss
F6 Status().Patch exception to MutateAndPatchStatus needs in-code note 7/10 INFO Polish
F7 Disabled-state INFO log should be DEBUG 5/10 LOW Optional
F8 Env var name lacks TOOLHIVE_ prefix 5/10 LOW Optional

Overall

The mechanism is sound and the test design is genuinely strong. The cross-version RV-bump test ("re-encodes CRs that are stored at a prior storage version") is the load-bearing empirical proof that the apiserver actually re-encodes etcd bytes on cross-version Update — that's the only thing reviewers really need to see to trust the design, and it's there with explicit pre/post resourceVersion assertions. The pagination test wraps APIReader with a list-call counter for direct continue-token proof; the partial-failure and conflict-retry tests pin the storedVersions invariant; the use of APIReader (informer bypass) plus runtime isManagedCRD re-verification correctly handles the TOCTOU window between watch-time and reconcile-time. The defense-in-depth around the wildcard RBAC (RBAC bounds the API group; the label gate narrows within it) is well-reasoned and the errMigrationRetriedDueToConflicts sentinel correctly prevents trimming storedVersions when any per-CR write is unverified.

The findings here are mostly polish, not architectural. One HIGH and two MEDIUMs:

  • F1 (HIGH) is straightforward doc/code drift — the type doc says one thing, the env-var gate does another, and the referenced operator.features.storageVersionMigrator helm path doesn't exist yet. Fixable in a single comment block.
  • F2 (MEDIUM) is the only operability concern: ENABLE_STORAGE_VERSION_MIGRATOR=ture silently disables the feature with an INFO log. Per .claude/rules/go-style.md "Constructor Validation: Fail Loudly", that misconfiguration should fail startup or at minimum surface at WARN.
  • F3 (MEDIUM) is the wildcard RBAC. The runtime gate (isManagedCRD) defends against the broad RBAC at execution time, but the regenerated chart now ships groups=toolhive.stacklok.dev,resources=* on every install regardless of whether the migrator is on. Worth surfacing in the PR's user-facing-change section even if the design is intentional.

Nothing here blocks the merge in a strict sense; all findings have unambiguous, contained fixes. The HIGH is HIGH only because doc/code drift on the activation mechanism is the kind of thing that causes operator escalations during a deprecation window.

Documentation

cmd/thv-operator/controllers/storageversionmigrator_controller.go:97-101 carries the contradicting doc comment (F1). No other documentation files in this PR need updating, but I'd suggest noting the operator SA permission widening (F3) in the PR's "Does this introduce a user-facing change?" section so chart consumers see it on review.


Generated with Claude Code

Comment thread cmd/thv-operator/controllers/storageversionmigrator_controller.go Outdated
Comment thread cmd/thv-operator/app/app.go Outdated
Comment thread cmd/thv-operator/controllers/storageversionmigrator_controller.go
Comment thread cmd/thv-operator/controllers/storageversionmigrator_controller.go
Comment thread cmd/thv-operator/app/app.go Outdated
Comment thread cmd/thv-operator/app/app.go Outdated
- F1: fix doc comment claiming "Enabled by default" — the controller is
  default-off in this release, opt-in via env var only; chart surface
  lands in a follow-up PR
- F2: fail loudly on unparsable env-var value instead of silently
  defaulting to disabled (per go-style "Fail Loudly")
- F6: document why patchStoredVersions cannot use the operator-wide
  MutateAndPatchStatus helper (CRD storedVersions is co-owned with
  kube-apiserver, optimistic lock is load-bearing)
- F7: demote disabled-state log from INFO to V(1) (silent success)
- F8: prefix env var with TOOLHIVE_ to match sibling operator env vars
  and avoid cross-operator collisions
- Codespell: unparseable→unparsable, overrideable→overridable
- Chainsaw RBAC fixture: include the migrator's apiextensions verbs
  and the wildcard toolhive.stacklok.dev rules so the multi-tenancy
  setup assertion matches the regenerated ClusterRole

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions Bot dismissed their stale review May 21, 2026 19:22

Large PR justification has been provided. Thank you!

Same update as the multi-tenancy fixture in the previous commit —
single-tenancy/setup also snapshots the operator ClusterRole and
the migrator's RBAC rules need to appear in its expected YAML too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 21, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 21, 2026
@github-actions github-actions Bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants