Skip to content

client: remove meta revision#10543

Open
bufferflies wants to merge 5 commits intotikv:masterfrom
bufferflies:pr-merge/c3cd07c6-remove-meta-revision
Open

client: remove meta revision#10543
bufferflies wants to merge 5 commits intotikv:masterfrom
bufferflies:pr-merge/c3cd07c6-remove-meta-revision

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Apr 1, 2026

Issue Number: ref #10516, close #10542

author: @disksing

cp c3cd07c6

What problem does this PR solve?

This removes the resource-group meta revision cursor from the client watch path.

What is changed and how does it work?

  • stop loading the initial resource-group revision only for watch startup
  • stop passing WithRev(metaRevision) when creating or recreating the watch
  • stop updating metaRevision from each watched event

Check List

Tests

  • go test . -run "Test.*ResourceManager.*" -count=1
  • go test ./resource_group/controller -run TestDoesNotExist -count=1
  • make check

Side effects

  • Possible performance regression

Release note

None

Summary by CodeRabbit

  • Refactor
    • Simplified resource-group metadata watcher and retry behavior by removing startup pre-load and local revision tracking. The watcher is now created without anchoring to a specific revision, streamlining initialization and retry flows.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/needs-triage-completed labels Apr 1, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 1, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removed startup loading and per-event revision tracking from the resource-group meta watcher; watcher creation (initial and retry) no longer uses a specific revision.

Changes

Cohort / File(s) Summary
Meta watch revision removal
client/resource_group/controller/global_controller.go
Deleted the initial provider.LoadResourceGroups(ctx) call and the metaRevision variable; removed opt.WithRev(metaRevision) from watcher creation (initial and retry) and stopped updating metaRevision on received events.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I nibble through the revision vine,
No more anchors hold my time.
Watches wake without a chain,
Lighter code, a cleaner brain. 🐇

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing the meta revision tracking from the client watch path.
Description check ✅ Passed The description covers the problem statement, changes made, test execution, and side effects as per the template requirements.
Linked Issues check ✅ Passed The PR implements the objective stated in linked issues by removing meta revision cursor from client watch path [#10542], addressing the root cause referenced in #10516.
Out of Scope Changes check ✅ Passed All changes are within scope: the modifications target only the global controller's meta revision handling as specified in linked issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

// Use WithPrevKV() to get the previous key-value pair when get Delete Event.
prefix := pd.GroupSettingsPathPrefixBytes(c.keyspaceID)
watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithRev(metaRevision), opt.WithPrefix(), opt.WithPrevKV())
watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithPrefix(), opt.WithPrevKV())
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the reconnect semantics of the resource-group meta watch. Before this PR, the controller resumed from the last processed metaRevision, so updates that happened while the watch stream was broken could still be replayed. After removing WithRev(metaRevision), a re-created watch starts from "now", which can silently skip PUT/DELETE events that landed during the disconnect window. That means the local controller cache can diverge from RM state after a transient watch failure. Please keep a resume revision (or reload a fresh snapshot before recreating the watch) so reconnects do not lose intermediate resource-group changes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I updated the retry path to reload a fresh resource-group snapshot, resync the cached controllers, and recreate the watch from snapshot revision + 1 so reconnects do not skip the disconnect window. I also added TestReloadResourceGroupMetaWatch to cover the retry behavior.

Signed-off-by: bufferflies <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Apr 1, 2026
Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies
Copy link
Copy Markdown
Contributor Author

Re-reviewed the latest follow-up on top of commit 64409c56374adbb6092ceb6e0e16052da1093706.

The previous reconnect/snapshot finding is now addressed:

  • snapshot sync still updates existing groups
  • snapshot sync still tombstones groups that disappeared from the fresh snapshot
  • snapshot-only groups are now also created in the local controller cache before the watch resumes from revision + 1

I do not have a new finding on this follow-up delta.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/resource_group/controller/global_controller.go`:
- Around line 334-335: Replace the direct initial call to c.provider.Watch(ctx,
...) with a call to reloadResourceGroupMetaWatch(c.loopCtx) so the first
meta-watch is bootstrapped with the controller's loop context (c.loopCtx) and
performs the snapshot/ready barrier before Start() returns; keep retries using
c.loopCtx when re-opening watches, ensure watchMetaChannel is derived from
reloadResourceGroupMetaWatch, and make Stop() cancel/close via c.loopCtx (or an
errgroup tied to it) to prevent goroutine leaks and ensure the initial watch is
established before request traffic can create controllers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 049cc121-4f58-4ef6-ac80-5be97c35319b

📥 Commits

Reviewing files that changed from the base of the PR and between 4ad4e84 and daef237.

📒 Files selected for processing (2)
  • client/resource_group/controller/global_controller.go
  • client/resource_group/controller/global_controller_test.go

// Use WithPrevKV() to get the previous key-value pair when get Delete Event.
prefix := pd.GroupSettingsPathPrefixBytes(c.keyspaceID)
watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithRev(metaRevision), opt.WithPrefix(), opt.WithPrevKV())
watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithPrefix(), opt.WithPrevKV())
Copy link
Copy Markdown
Contributor

@lhy1024 lhy1024 Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial watch here no longer uses a revision, which leaves a startup window. If a resource group changes after the controller starts but before the watch is actually established, the local cache can miss that update. The reconnect path already rebuilds from a snapshot/revision, but the initial startup path still needs the same barrier.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we just need the latest event and can ignore all the mvvc events

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we do not need the whole MVCC history. The issue is that without a startup barrier, we may miss the latest update itself if it happens before the watch is established and no later event arrives. In that case the local cache can stay stale indefinitely.

@bufferflies bufferflies requested a review from disksing April 3, 2026 02:06
@ti-chi-bot ti-chi-bot bot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed do-not-merge/needs-triage-completed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 8, 2026
@bufferflies bufferflies requested a review from lhy1024 April 8, 2026 07:24
@bufferflies
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 8, 2026

✅ Actions performed

Full review triggered.

This reverts commit 64409c5.

Signed-off-by: bufferflies <1045931706@qq.com>
This reverts commit daef237.

Signed-off-by: bufferflies <1045931706@qq.com>
@bufferflies bufferflies force-pushed the pr-merge/c3cd07c6-remove-meta-revision branch from 2cfa22f to 1228f9d Compare April 8, 2026 07:26
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
client/resource_group/controller/global_controller.go (1)

334-335: ⚠️ Potential issue | 🟠 Major

Use c.loopCtx (not parent ctx) for meta watch lifecycle.

Both initial and retry meta-watch calls are still bound to ctx, while Stop() only cancels c.loopCtx. That can leave watch streams running after controller stop if the parent context is still alive.

Suggested fix
-			watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithPrefix(), opt.WithPrevKV())
+			watchMetaChannel, err = c.provider.Watch(c.loopCtx, prefix, opt.WithPrefix(), opt.WithPrevKV())
...
-					watchMetaChannel, err = c.provider.Watch(ctx, prefix, opt.WithPrefix(), opt.WithPrevKV())
+					watchMetaChannel, err = c.provider.Watch(c.loopCtx, prefix, opt.WithPrefix(), opt.WithPrevKV())

As per coding guidelines, "Prevent goroutine leaks: pair with cancellation; consider errgroup".

Also applies to: 362-363

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@client/resource_group/controller/global_controller.go` around lines 334 -
335, The meta-watch calls are using the parent ctx instead of the controller
lifecycle context, so change the Watch invocations to use c.loopCtx (not ctx) so
the watch stream is cancelled when Stop() cancels c.loopCtx; specifically update
the c.provider.Watch(...) calls (both the initial call near the
prefix/opt.WithPrevKV() and the retry/loop call around the same logic) to pass
c.loopCtx and ensure any goroutine handling the watch is tied to c.loopCtx for
proper cancellation and no-leak pairing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@client/resource_group/controller/global_controller.go`:
- Around line 334-335: The meta-watch calls are using the parent ctx instead of
the controller lifecycle context, so change the Watch invocations to use
c.loopCtx (not ctx) so the watch stream is cancelled when Stop() cancels
c.loopCtx; specifically update the c.provider.Watch(...) calls (both the initial
call near the prefix/opt.WithPrevKV() and the retry/loop call around the same
logic) to pass c.loopCtx and ensure any goroutine handling the watch is tied to
c.loopCtx for proper cancellation and no-leak pairing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dc8ad108-2f5e-44e3-b3d9-0e4a354ae8d7

📥 Commits

Reviewing files that changed from the base of the PR and between dca466b and 1228f9d.

📒 Files selected for processing (1)
  • client/resource_group/controller/global_controller.go

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 8, 2026

@bufferflies: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 1228f9d link true /test pull-unit-test-next-gen-2
pull-unit-test-next-gen-3 1228f9d link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.89%. Comparing base (3eb99ae) to head (1228f9d).
⚠️ Report is 16 commits behind head on master.

❌ Your patch check has failed because the patch coverage (50.00%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10543      +/-   ##
==========================================
+ Coverage   78.88%   78.89%   +0.01%     
==========================================
  Files         530      532       +2     
  Lines       71548    71858     +310     
==========================================
+ Hits        56439    56694     +255     
- Misses      11092    11133      +41     
- Partials     4017     4031      +14     
Flag Coverage Δ
unittests 78.89% <50.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @disksing @lhy1024

@bufferflies bufferflies requested a review from okJiang April 10, 2026 10:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

c3cd07c6 remove meta revision

2 participants