Skip to content

resourcemanager: make controller config updates atomic#10504

Open
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/issue-10335-atomic-controller-config
Open

resourcemanager: make controller config updates atomic#10504
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:codex/issue-10335-atomic-controller-config

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Mar 27, 2026

What problem does this PR solve?

Issue Number: Close #10335

POST /resource-manager/api/v1/config/controller validates request keys before
applying updates, but it still persists each field one at a time through
UpdateControllerConfigItem. A mixed valid/invalid payload can therefore write
an earlier field before a later invalid value returns 400.

What is changed and how does it work?

resourcemanager: make controller config updates atomic

Batch controller config updates so mixed valid/invalid payloads no longer
persist earlier fields before later validation errors are returned.
  • collect all resolved controller-config fields before applying any update
  • add UpdateControllerConfigItems to clone the current controller config,
    apply every requested field to the clone, and persist once on success
  • route the existing single-item helper through the batch path so callers keep
    the same entry point
  • add unit and integration regression coverage for mixed valid/invalid payloads

Check List

Tests

  • Unit test
  • Integration test

Release note

Fix a bug where the resource manager controller config API could partially
persist a multi-field update before returning a validation error.

Summary by CodeRabbit

  • Bug Fixes

    • Controller configuration updates are now atomic: either all requested changes apply successfully, or the entire operation fails without modifying any settings.
    • Enhanced error handling to prevent partial configuration state when invalid values are provided.
  • Tests

    • Added validation tests for atomic configuration update behavior and API error response handling.

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

This pull request refactors controller configuration updates to be atomic at the API level. Previously, individual config items were updated one-at-a-time through separate storage operations. The changes introduce a batch update method that validates all items before persisting, ensuring that mixed valid/invalid payloads either fully succeed or fully fail without partial persistence.

Changes

Cohort / File(s) Summary
Batch Update Logic
pkg/mcs/resourcemanager/server/manager.go
Introduced UpdateControllerConfigItems method to atomically update multiple controller config keys in a single locked transaction. Existing UpdateControllerConfigItem now delegates to the batch method. Saves occur once per batch operation, and validation is applied per item through a shared applyControllerConfigItem helper.
Configuration Service Layer
pkg/mcs/resourcemanager/metadataapi/config_service.go
Extended ConfigStore interface with UpdateControllerConfigItems method. Updated SetControllerConfig to invoke batch updates instead of iterating per-item, simplifying error handling and ensuring atomicity at the API boundary.
Unit Tests
pkg/mcs/resourcemanager/server/manager_test.go, pkg/mcs/resourcemanager/metadataapi/config_service_test.go
Added atomic behavior verification in manager unit test; added mock UpdateControllerConfigItems method to test store to track bulk updates.
Integration Tests
tests/integrations/mcs/resourcemanager/api_test.go
Refactored test helpers to separate request/response handling. Added TestControllerConfigAPIAllOrNothing to validate that mixed valid/invalid config payloads fail atomically without partial state changes, and that error responses contain relevant validation details.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

lgtm, approved

Suggested reviewers

  • nolouch
  • lhy1024

Poem

🐰 Batch by batch, we now hold tight,
All-or-nothing, updates done right,
No partial saves when values slip,
Atomic dances, atomicity's grip!
Config updates, strong and bright! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'resourcemanager: make controller config updates atomic' clearly and concisely describes the main change - making controller config updates atomic.
Description check ✅ Passed The PR description includes the required issue number (Close #10335), explains the problem being solved, details the changes made, lists included tests (unit and integration), and provides a release note.
Linked Issues check ✅ Passed The PR fully addresses all requirements from issue #10335: makes controller config updates atomic, prevents partial persistence on validation errors, and includes regression tests for mixed valid/invalid payloads.
Out of Scope Changes check ✅ Passed All changes are directly scoped to implementing atomic controller config updates: new UpdateControllerConfigItems method, refactored UpdateControllerConfigItem to delegate to it, updated API layer to use batch updates, and added comprehensive test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: okjiang <819421878@qq.com>
@okJiang okJiang force-pushed the codex/issue-10335-atomic-controller-config branch from ae1bc34 to f7de7cc Compare March 27, 2026 10:20
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/integrations/mcs/resourcemanager/api_test.go`:
- Around line 318-323: The test sends "true" as a string for the
"enable-controller-trace-log" field which makes the request's failure
non-deterministic; update the call to tryToSetControllerConfig so the map value
for "enable-controller-trace-log" is a boolean true (not the string "true")
while keeping "ltb-max-wait-duration" as the invalid string "not-a-duration" so
the request deterministically exercises the valid+invalid mix; locate the call
to tryToSetControllerConfig in the test and change that map entry accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f0d9242a-516b-4128-81df-8fb780263953

📥 Commits

Reviewing files that changed from the base of the PR and between 99eb5b5 and ae1bc34.

📒 Files selected for processing (4)
  • pkg/mcs/resourcemanager/server/apis/v1/api.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • tests/integrations/mcs/resourcemanager/api_test.go

Comment on lines +318 to +323
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": "true",
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use a real boolean here to keep the regression deterministic.

Line 319 sends "true" as a string, so this request can fail on either field instead of exercising the intended valid+invalid mix. That makes the "time:" assertion order-dependent and weakens the atomicity regression.

Suggested fix
 	resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
-		"enable-controller-trace-log": "true",
+		"enable-controller-trace-log": true,
 		"ltb-max-wait-duration":       "not-a-duration",
 	})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": "true",
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
"enable-controller-trace-log": true,
"ltb-max-wait-duration": "not-a-duration",
})
re.Equal(http.StatusBadRequest, statusCode)
re.Contains(resp, "time:")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/api_test.go` around lines 318 - 323,
The test sends "true" as a string for the "enable-controller-trace-log" field
which makes the request's failure non-deterministic; update the call to
tryToSetControllerConfig so the map value for "enable-controller-trace-log" is a
boolean true (not the string "true") while keeping "ltb-max-wait-duration" as
the invalid string "not-a-duration" so the request deterministically exercises
the valid+invalid mix; locate the call to tryToSetControllerConfig in the test
and change that map entry accordingly.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
tests/integrations/mcs/resourcemanager/api_test.go (1)

343-348: ⚠️ Potential issue | 🟠 Major

Use a real boolean in the mixed-payload regression.

Line 344 sends "true" as a string, so this request contains two invalid values instead of one valid field plus one invalid field. That weakens the all-or-nothing regression and can make the "time:" assertion fail for the wrong reason.

Suggested fix
 	resp, statusCode := tryToSetControllerConfig(re, suite.cluster.GetLeaderServer().GetAddr(), map[string]any{
-		"enable-controller-trace-log": "true",
+		"enable-controller-trace-log": true,
 		"ltb-max-wait-duration":       "not-a-duration",
 	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/resourcemanager/api_test.go` around lines 343 - 348,
The test is sending the boolean as a string which creates two invalid fields;
update the payload in the call to tryToSetControllerConfig so
"enable-controller-trace-log" is sent as a real boolean true (not the string
"true") while leaving "ltb-max-wait-duration": "not-a-duration" unchanged, so
the request has one valid field and one invalid duration field and the existing
assertion against tryToSetControllerConfig's response containing "time:" remains
meaningful.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/resourcemanager/metadataapi/config_service.go`:
- Around line 251-257: The current handler collapses all non-permission errors
from s.configStore.UpdateControllerConfigItems(resolvedConf) into 400 Bad
Request; change the error handling to distinguish validation errors from
persistence/storage failures (e.g., errors returned by SaveControllerConfig in
the store). Specifically, detect validation-related errors (the same error
type/value returned by your validation code) and continue to return 400 for
those, but treat store persistence/etcd/write errors (wrap/inspect errors coming
from UpdateControllerConfigItems/SaveControllerConfig or provide a
store.IsPersistenceError helper) as server-side failures and return an
appropriate 5xx (e.g., 500 or 503) with a clear log message; keep the existing
IsMetadataWriteDisabledError check for forbidden. Ensure the store layer
wraps/save failures so the handler can reliably distinguish the error kinds.

---

Duplicate comments:
In `@tests/integrations/mcs/resourcemanager/api_test.go`:
- Around line 343-348: The test is sending the boolean as a string which creates
two invalid fields; update the payload in the call to tryToSetControllerConfig
so "enable-controller-trace-log" is sent as a real boolean true (not the string
"true") while leaving "ltb-max-wait-duration": "not-a-duration" unchanged, so
the request has one valid field and one invalid duration field and the existing
assertion against tryToSetControllerConfig's response containing "time:" remains
meaningful.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 01871f43-b5bb-429e-80b5-dbc8ca11a1a6

📥 Commits

Reviewing files that changed from the base of the PR and between ae1bc34 and f7de7cc.

📒 Files selected for processing (5)
  • pkg/mcs/resourcemanager/metadataapi/config_service.go
  • pkg/mcs/resourcemanager/metadataapi/config_service_test.go
  • pkg/mcs/resourcemanager/server/manager.go
  • pkg/mcs/resourcemanager/server/manager_test.go
  • tests/integrations/mcs/resourcemanager/api_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/mcs/resourcemanager/server/manager_test.go

Comment on lines +251 to +257
if err := s.configStore.UpdateControllerConfigItems(resolvedConf); err != nil {
if rmserver.IsMetadataWriteDisabledError(err) {
c.String(http.StatusForbidden, err.Error())
return
}
c.String(http.StatusBadRequest, err.Error())
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't collapse persistence failures into 400 Bad Request.

Line 251 now calls the batch path, and that path can fail after validation when SaveControllerConfig hits storage. Returning 400 for every non-permission error will mislabel etcd/write failures as client input problems; please distinguish validation errors from persistence errors here, even if that means wrapping save failures from the store layer. As per coding guidelines, "HTTP handlers must validate payloads and return proper status codes; avoid panics".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/resourcemanager/metadataapi/config_service.go` around lines 251 -
257, The current handler collapses all non-permission errors from
s.configStore.UpdateControllerConfigItems(resolvedConf) into 400 Bad Request;
change the error handling to distinguish validation errors from
persistence/storage failures (e.g., errors returned by SaveControllerConfig in
the store). Specifically, detect validation-related errors (the same error
type/value returned by your validation code) and continue to return 400 for
those, but treat store persistence/etcd/write errors (wrap/inspect errors coming
from UpdateControllerConfigItems/SaveControllerConfig or provide a
store.IsPersistenceError helper) as server-side failures and return an
appropriate 5xx (e.g., 500 or 503) with a clear log message; keep the existing
IsMetadataWriteDisabledError check for forbidden. Ensure the store layer
wraps/save failures so the handler can reliably distinguish the error kinds.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 2, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 2, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-02 03:30:41.572788789 +0000 UTC m=+408646.778148836: ☑️ agreed by lhy1024.

@ti-chi-bot ti-chi-bot bot added the approved label Apr 2, 2026
@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Apr 8, 2026

/retest

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.95%. Comparing base (3eb99ae) to head (f7de7cc).
⚠️ Report is 16 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10504      +/-   ##
==========================================
+ Coverage   78.88%   78.95%   +0.07%     
==========================================
  Files         530      531       +1     
  Lines       71548    71660     +112     
==========================================
+ Hits        56439    56578     +139     
+ Misses      11092    11058      -34     
- Partials     4017     4024       +7     
Flag Coverage Δ
unittests 78.95% <91.66%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

@okJiang: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review f7de7cc link true /test pull-error-log-review
pull-unit-test-next-gen-3 f7de7cc link true /test pull-unit-test-next-gen-3
pull-unit-test-next-gen-2 f7de7cc link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rm: make controller config metadata update atomic

2 participants