Skip to content

keyspace, scheduling: watch keyspace meta for suspect range updates#10506

Open
bufferflies wants to merge 4 commits intotikv:masterfrom
bufferflies:decouple_keyspace_utils
Open

keyspace, scheduling: watch keyspace meta for suspect range updates#10506
bufferflies wants to merge 4 commits intotikv:masterfrom
bufferflies:decouple_keyspace_utils

Conversation

@bufferflies
Copy link
Copy Markdown
Contributor

@bufferflies bufferflies commented Mar 27, 2026

What problem does this PR solve?

Issue Number: ref #10123

What is changed and how does it work?

add keyspace range parsing and cached keyspace metadata helpers so scheduling
can detect regions that span multiple keyspaces and calculate split points

watch keyspace metadata changes in scheduling, refresh the keyspace cache from
etcd events, and add suspect key ranges for affected raw and txn bounds

Check List

Tests

  • Unit test
  • Integration test

Side effects

  • Increased code complexity

Release note

None.

Summary by CodeRabbit

  • New Features

    • In-memory keyspace cache and live watcher for keyspace metadata; cluster now exposes the cache for scheduling.
    • Automatic detection of regions that span multiple keyspaces and computation of keyspace split points to improve region management.
  • Bug Fixes

    • Prevented unintended retention of key-range references to reduce memory leaks.
  • Tests

    • Expanded unit and integration tests covering keyspace extraction, split detection, cache behavior, and metadata watch flow.

Signed-off-by: tongjian <1045931706@qq.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Mar 27, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 27, 2026

📝 Walkthrough

Walkthrough

Adds keyspace parsing/boundary utilities, an in-memory keyspace cache, a keyspace-meta etcd watcher that publishes suspect key ranges to the checker controller, integrates the cache and watcher into the scheduling cluster/server, and augments tests for parsing, cache, watcher, and end-to-end behavior.

Changes

Cohort / File(s) Summary
Keyspace utilities & cache
pkg/keyspace/util.go, pkg/keyspace/util_test.go
Introduce KeyType, ExtractKeyspaceID, Checker interface, region-bound helpers (RegionSpansMultipleKeyspaces, GetKeyspaceSplitKeys), a btree-backed Cache + API (NewCache, Save, DeleteKeyspace, KeyspaceExist, GetKeyspaceIDInRange), and NewKeyspaceMeta. Extensive unit tests added.
Keyspace manager integration
pkg/keyspace/keyspace.go, pkg/keyspace/keyspace_test.go
Manager now holds and populates the keyspace Cache; added GetKeyspaceIDInRange and KeyspaceExist methods; tests for existence and range queries added.
Keyspace metadata watcher
pkg/mcs/scheduling/server/keyspace_meta/watcher.go, pkg/mcs/scheduling/server/keyspace_meta/watch_test.go
New Watcher to monitor etcd keyspace meta prefix, parse meta keys, update cache, derive txn/raw region bounds, accumulate suspect ranges and forward them to checker controller; Close cancels watcher.
Scheduling cluster & server integration
pkg/mcs/scheduling/server/cluster.go, pkg/mcs/scheduling/server/server.go
Cluster now holds keyspaceCache (keyspace.NewCache() exposed via GetKeyspaceCache()); Server initializes the keyspace meta Watcher on startup and closes it on stop.
Key range utilities
pkg/utils/keyutil/keyrange.go
Added (*KeyRanges).Clean() to clear internal accumulator slice between batches.
Rule watcher
pkg/mcs/scheduling/server/rule/watcher.go
Added failpoint skipRuleCommit in postEventsFn to optionally skip committing rule patches and subsequent suspect-range additions.
Integration tests
tests/integrations/mcs/scheduling/meta_test.go
Added TestKeyspaceMetaWatch to exercise end-to-end flow: create keyspace, start cluster, and assert watcher emits expected suspect key ranges; minor setup change disabling keyspace split wait.

Sequence Diagram(s)

sequenceDiagram
    participant Etcd as Etcd (keyspace meta)
    participant Watcher as Keyspace Watcher
    participant Cache as Keyspace Cache
    participant Checker as Checker Controller
    participant Server as Scheduling Server

    Server->>Watcher: NewWatcher(ctx, etcdClient, checkerController, keyspaceCache)
    Watcher->>Etcd: Watch(keypath.KeyspaceMetaPrefix())
    Etcd-->>Watcher: put/delete event (meta key)
    Watcher->>Watcher: extractKeyspaceIDFromMetaKey()
    Watcher->>Cache: Save/DeleteKeyspace(id, meta)
    Watcher->>Watcher: MakeRegionBound(id, mode) -> {left,right} bounds
    Watcher->>Checker: AddSuspectKeyRange(start, end)
    Checker-->>Server: Suspect ranges queued/available
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm, approved, ok-to-test

Suggested reviewers

  • rleungx
  • lhy1024

Poem

🐰
I nibble bytes where keyspaces lie,
I map their bounds and watch them fly,
Cache warm beneath a comfy paw,
Suspect ranges hop — I send them off! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 55.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title directly reflects the main objective of the PR: watching keyspace metadata changes to update suspect ranges for scheduling purposes.
Description check ✅ Passed The description includes the required issue reference, explains what changed and how it works via commit message, and checks most template sections including tests and side effects.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (1)

17-36: Reorder the etcd imports into the third-party block.

go.etcd.io/... is currently grouped after the github.com/tikv/pd/... imports, so gci/goimports will keep this file dirty.

As per coding guidelines, **/*.go: Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 17 - 36, The
import ordering violates gci/goimports: move the etcd imports
(go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as clientv3) into
the third-party block so imports follow stdlib | third-party |
github.com/pingcap | github.com/tikv/pd | blank; update the import block around
the existing imports in watcher.go so the two go.etcd.io entries appear with
other third-party imports before the github.com/pingcap and github.com/tikv/pd
groups.
tests/integrations/mcs/scheduling/meta_test.go (1)

159-165: Verify both raw and txn suspect ranges.

The watcher appends two ranges per keyspace change. This assertion returns on the first matching pair, so the test still passes if one of the two append paths regresses.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integrations/mcs/scheduling/meta_test.go` around lines 159 - 165, The
test currently exits as soon as PopOneSuspectKeyRange returns a matching pair,
which can hide regressions because the watcher should append both txn and raw
ranges; update the test inside testutil.Eventually to drain/pop suspect ranges
until none remain (calling
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly),
record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound)
&& bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0],
bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return
true when both flags are seen; use the same
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and
bound constants to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/mcs/scheduling/server/cluster.go`:
- Around line 221-223: NewCluster never initializes the Cluster.keyspaceCache
field, so GetKeyspaceCache() returns nil and startCluster passes a nil cache
into keyspace_meta.NewWatcher; initialize keyspaceCache inside NewCluster (e.g.,
create a new keyspace.Cache instance and assign it to c.keyspaceCache) before
returning the Cluster so GetKeyspaceCache() returns a valid cache used by
startCluster and keyspace_meta.NewWatcher to populate metadata and enable
keyspace-aware split checks.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 124-130: The code calls
rw.checkerController.AddSuspectKeyRange(...) but the checker controller API is
AddSuspectKeyRanges(start, end []byte); update the postEventsFn in watcher.go to
build two slices (e.g., starts, ends) by iterating suspectKeyRanges.Ranges() and
appending kr.StartKey and kr.EndKey, then call
rw.checkerController.AddSuspectKeyRanges(starts, ends) (keeping the nil check
for rw.checkerController). This ensures you use the correct method name and
parameter shapes (AddSuspectKeyRanges and suspectKeyRanges.Ranges()) so the code
compiles against the controller.

In `@pkg/mcs/scheduling/server/server.go`:
- Around line 537-541: The keyspace watcher created with
keyspace_meta.NewWatcher using s.Context() (symbol: keyspaceWatcher) is not
closed during primary shutdown; update stopWatcher() to explicitly call
keyspaceWatcher.Close() (or Stop/Cancel method provided by the watcher) and/or
construct the watcher with the primary callback context instead of s.Context()
so it is canceled on primary exit; locate the creation site
(keyspace_meta.NewWatcher) and the stopWatcher() function and ensure the watcher
is closed and any errors ignored/handled when the primary steps down to prevent
goroutine leaks and duplicate updates.

---

Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 17-36: The import ordering violates gci/goimports: move the etcd
imports (go.etcd.io/etcd/api/v3/mvccpb and go.etcd.io/etcd/client/v3 as
clientv3) into the third-party block so imports follow stdlib | third-party |
github.com/pingcap | github.com/tikv/pd | blank; update the import block around
the existing imports in watcher.go so the two go.etcd.io entries appear with
other third-party imports before the github.com/pingcap and github.com/tikv/pd
groups.

In `@tests/integrations/mcs/scheduling/meta_test.go`:
- Around line 159-165: The test currently exits as soon as PopOneSuspectKeyRange
returns a matching pair, which can hide regressions because the watcher should
append both txn and raw ranges; update the test inside testutil.Eventually to
drain/pop suspect ranges until none remain (calling
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange repeatedly),
record whether you observed the txn pair (bytes.Equal(kr[0], bound.TxnLeftBound)
&& bytes.Equal(kr[1], bound.TxnRightBound)) and the raw pair (bytes.Equal(kr[0],
bound.RawLeftBound) && bytes.Equal(kr[1], bound.RawRightBound)), and only return
true when both flags are seen; use the same
GetCoordinator().GetCheckerController().PopOneSuspectKeyRange(), kr variable and
bound constants to locate the code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e5f0673a-3a4f-4ceb-84a2-98eaf6beeb61

📥 Commits

Reviewing files that changed from the base of the PR and between 99eb5b5 and 93b8c72.

📒 Files selected for processing (8)
  • pkg/keyspace/util.go
  • pkg/keyspace/util_test.go
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/mcs/scheduling/server/keyspace_meta/watch_test.go
  • pkg/mcs/scheduling/server/keyspace_meta/watcher.go
  • pkg/mcs/scheduling/server/server.go
  • pkg/utils/keyutil/keyrange.go
  • tests/integrations/mcs/scheduling/meta_test.go

Signed-off-by: bufferflies <1045931706@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (3)

89-108: Consider using log.Debug for routine watch events.

log.Info on every PUT event (line 90) could generate significant log volume in clusters with frequent keyspace metadata changes. Operational logs at Info level are typically reserved for less frequent, notable events.

♻️ Suggested adjustment
 	putFn := func(kv *mvccpb.KeyValue) error {
-		log.Info("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))
+		log.Debug("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))

The same consideration applies to deleteFn on line 111.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 89 - 108,
The watcher currently logs every PUT (in putFn) and DELETE (in deleteFn) event
with log.Info which can flood logs; change these to log.Debug (or conditional
logging) for routine watch events and keep Info for notable occurrences. Update
the calls to log.Info in putFn and deleteFn to use log.Debug (or guard with
log.Enabled(zap.DebugLevel) if supported) so routine keyspace meta updates use
debug level while preserving existing context fields (e.g., key and value) and
keeping any error logs at Info/Error levels.

30-36: Import ordering: move third-party import to correct group.

"go.etcd.io/etcd/api/v3/mvccpb" on line 35 is a third-party package but is placed in the tikv/pd import group. Per project conventions, it should be grouped with other third-party imports.

♻️ Suggested fix
 import (
 	"context"
 	"errors"
 	"fmt"
 	"strconv"
 	"strings"
 	"sync"
 
 	clientv3 "go.etcd.io/etcd/client/v3"
+	"go.etcd.io/etcd/api/v3/mvccpb"
 	"go.uber.org/zap"
 
 	"github.com/pingcap/log"
 
 	"github.com/tikv/pd/pkg/keyspace"
 	"github.com/tikv/pd/pkg/schedule/checker"
 	"github.com/tikv/pd/pkg/utils/etcdutil"
 	"github.com/tikv/pd/pkg/utils/keypath"
 	"github.com/tikv/pd/pkg/utils/keyutil"
-	"go.etcd.io/etcd/api/v3/mvccpb"
 )

As per coding guidelines: "Use gci/goimports ordering: stdlib | third-party | pingcap | tikv/pd | blank".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 30 - 36, The
import grouping is incorrect: the third-party package
"go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that
import into the third-party import group (between stdlib and pingcap groups) in
the import block of watcher.go so it follows the project's gci/goimports order;
ensure the import name mvccpb remains unchanged and the remaining tikv/pd
imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.

60-81: Consider validating checkerController at construction time.

If checkerController is nil, postEventsFn returns an error on every event batch, which per etcdutil.LoopWatcher behavior (relevant snippet 4) gets logged but doesn't stop the watcher. This results in repeated error logs without surfacing the misconfiguration clearly.

Either validate non-nil at construction:

 func NewWatcher(
 	ctx context.Context,
 	etcdClient *clientv3.Client,
 	checkerController *checker.Controller,
 	keyspaceCache *keyspace.Cache) (*Watcher, error) {
+	if checkerController == nil {
+		return nil, errors.New("checkerController cannot be nil")
+	}
 	ctx, cancel := context.WithCancel(ctx)

Or handle nil gracefully in postEventsFn without returning an error (just skip adding ranges with a debug log).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 60 - 81,
NewWatcher must validate that checkerController is non-nil to avoid postEventsFn
returning errors on every batch; update NewWatcher to check checkerController
and return a descriptive error if nil (referencing NewWatcher and
checkerController), or alternatively modify postEventsFn to treat a nil
checkerController as a no-op (log debug and skip adding ranges) and not return
an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer
validating in NewWatcher to fail-fast.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 89-108: The watcher currently logs every PUT (in putFn) and DELETE
(in deleteFn) event with log.Info which can flood logs; change these to
log.Debug (or conditional logging) for routine watch events and keep Info for
notable occurrences. Update the calls to log.Info in putFn and deleteFn to use
log.Debug (or guard with log.Enabled(zap.DebugLevel) if supported) so routine
keyspace meta updates use debug level while preserving existing context fields
(e.g., key and value) and keeping any error logs at Info/Error levels.
- Around line 30-36: The import grouping is incorrect: the third-party package
"go.etcd.io/etcd/api/v3/mvccpb" is currently inside the tikv/pd group; move that
import into the third-party import group (between stdlib and pingcap groups) in
the import block of watcher.go so it follows the project's gci/goimports order;
ensure the import name mvccpb remains unchanged and the remaining tikv/pd
imports (keyspace, checker, etcdutil, keypath, keyutil) stay together.
- Around line 60-81: NewWatcher must validate that checkerController is non-nil
to avoid postEventsFn returning errors on every batch; update NewWatcher to
check checkerController and return a descriptive error if nil (referencing
NewWatcher and checkerController), or alternatively modify postEventsFn to treat
a nil checkerController as a no-op (log debug and skip adding ranges) and not
return an error so etcdutil.LoopWatcher doesn't repeatedly log failures—prefer
validating in NewWatcher to fail-fast.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24b83b47-b99e-494a-8e3e-9dfc08469666

📥 Commits

Reviewing files that changed from the base of the PR and between 93b8c72 and 0e0c8d1.

📒 Files selected for processing (4)
  • pkg/mcs/scheduling/server/keyspace_meta/watcher.go
  • pkg/mcs/scheduling/server/rule/watcher.go
  • pkg/mcs/scheduling/server/server.go
  • tests/integrations/mcs/scheduling/meta_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/integrations/mcs/scheduling/meta_test.go
  • pkg/mcs/scheduling/server/server.go

Signed-off-by: tongjian <1045931706@qq.com>
Signed-off-by: tongjian <1045931706@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
pkg/mcs/scheduling/server/keyspace_meta/watcher.go (2)

89-91: Consider using DEBUG level for routine keyspace meta updates.

Logging every keyspace meta update at INFO level could be verbose in clusters with frequent keyspace changes. Consider downgrading to DEBUG, or add a rate limiter for INFO-level summary logs.

♻️ Suggested change
 putFn := func(kv *mvccpb.KeyValue) error {
-	log.Info("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))
+	log.Debug("update keyspace meta", zap.String("key", string(kv.Key)), zap.String("value", string(kv.Value)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 89 - 91, The
putFn handler currently logs every keyspace meta update at INFO using log.Info
in watcher.go; change that to a debug-level log (log.Debug) or implement an
INFO-level rate-limited summary instead (e.g., increment a counter and emit an
aggregated log periodically) so routine updates aren’t noisy; locate the putFn
and the log.Info(...) call around the call to rw.extractKeyspaceIDFromMetaKey
and replace or wrap it with the chosen lower-level logging or a simple
rate-limiter/aggregator tied to that handler.

124-131: Returning error when checkerController is nil may disrupt watcher operation.

While the checker controller should be non-nil at construction time (per server.go:538), returning an error from postEventsFn could cause the LoopWatcher to log errors or retry unexpectedly. Consider logging a warning and continuing gracefully, or asserting this invariant at construction time instead.

♻️ Suggested defensive handling
 postEventsFn := func([]*clientv3.Event) error {
 	if rw.checkerController == nil {
-		return errors.New("checkerController is nil, cannot add suspect key ranges")
+		log.Warn("checkerController is nil, skipping suspect key ranges")
+		return nil
 	}
 	for _, kr := range suspectKeyRanges.Ranges() {
 		rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
 	}
 	return nil
 }

Alternatively, validate checkerController != nil in NewWatcher and return an error there to fail fast.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go` around lines 124 - 131,
postEventsFn currently returns an error when rw.checkerController is nil which
can cause LoopWatcher to error/retry; change this to log a warning (use the
existing logger) and continue processing instead of returning an error, and
additionally enforce the invariant at construction by validating
checkerController != nil in NewWatcher and returning an error from NewWatcher if
nil so the watcher fails fast; reference postEventsFn and rw.checkerController
for the runtime change and NewWatcher for the constructor validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/mcs/scheduling/server/keyspace_meta/watcher.go`:
- Around line 89-91: The putFn handler currently logs every keyspace meta update
at INFO using log.Info in watcher.go; change that to a debug-level log
(log.Debug) or implement an INFO-level rate-limited summary instead (e.g.,
increment a counter and emit an aggregated log periodically) so routine updates
aren’t noisy; locate the putFn and the log.Info(...) call around the call to
rw.extractKeyspaceIDFromMetaKey and replace or wrap it with the chosen
lower-level logging or a simple rate-limiter/aggregator tied to that handler.
- Around line 124-131: postEventsFn currently returns an error when
rw.checkerController is nil which can cause LoopWatcher to error/retry; change
this to log a warning (use the existing logger) and continue processing instead
of returning an error, and additionally enforce the invariant at construction by
validating checkerController != nil in NewWatcher and returning an error from
NewWatcher if nil so the watcher fails fast; reference postEventsFn and
rw.checkerController for the runtime change and NewWatcher for the constructor
validation.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c069a1d4-b8b7-4d3c-a0e4-c427a32f922c

📥 Commits

Reviewing files that changed from the base of the PR and between 0e0c8d1 and f70e3c6.

📒 Files selected for processing (5)
  • pkg/keyspace/keyspace.go
  • pkg/keyspace/keyspace_test.go
  • pkg/keyspace/util_test.go
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/mcs/scheduling/server/keyspace_meta/watcher.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/mcs/scheduling/server/cluster.go
  • pkg/keyspace/util_test.go

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 80.47809% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.93%. Comparing base (3eb99ae) to head (f70e3c6).
⚠️ Report is 17 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10506      +/-   ##
==========================================
+ Coverage   78.88%   78.93%   +0.05%     
==========================================
  Files         530      532       +2     
  Lines       71548    71889     +341     
==========================================
+ Hits        56439    56749     +310     
- Misses      11092    11093       +1     
- Partials     4017     4047      +30     
Flag Coverage Δ
unittests 78.93% <80.47%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ystaticy
Copy link
Copy Markdown
Contributor

/test pull-unit-test-next-gen-2

@bufferflies bufferflies requested a review from disksing March 31, 2026 10:09
@bufferflies
Copy link
Copy Markdown
Contributor Author

/ping @ystaticy

@bufferflies bufferflies requested a review from lhy1024 April 1, 2026 02:45
Comment on lines +128 to +130
for _, kr := range suspectKeyRanges.Ranges() {
rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a TTL cache. Its ttl is 3 minutes. Is there a scenario where we have too many keyspaces, making it impossible to process them all in three minutes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the cache minutes should use the GetWaitRegionSplitTimeout.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Aligning the TTL with GetWaitRegionSplitTimeout should help reduce the gap.

I still be worried that the suspect queue is still TTL-based / best-effort, so I would view this more as an incremental improvement than a strict guarantee that it will definitely split later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing, for the suspect-queue part, we can remove the TTL behavior for the follow-up keyspace handling in a separate PR.

if manager == nil {
return []uint32{start}, true
}
return manager.cache.GetKeyspaceIDInRange(start, end, limit)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a lazy loading mechanism, meaning we might not see historical keyspaces.

It could also be contaminated by failed saves.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func TestReviewRestartedManagerMissesPersistedKeyspaceInRangeCache(t *testing.T) {
	re := require.New(t)
	ctx := context.Background()
	store := endpoint.NewStorageEndpoint(kv.NewMemoryKV(), nil)

	manager1 := NewKeyspaceManager(ctx, store, nil, mockid.NewIDAllocator(), &mockConfig{}, nil)
	meta := &keyspacepb.KeyspaceMeta{
		Id:             100,
		Name:           "restart-proof",
		State:          keyspacepb.KeyspaceState_ENABLED,
		CreatedAt:      time.Now().Unix(),
		StateChangedAt: time.Now().Unix(),
	}
	re.NoError(manager1.saveNewKeyspace(meta))

	manager2 := NewKeyspaceManager(ctx, store, nil, mockid.NewIDAllocator(), &mockConfig{}, nil)
	loaded, err := manager2.LoadRangeKeyspace(meta.Id, 1)
	re.NoError(err)
	re.Len(loaded, 1)
	re.Equal(meta.Id, loaded[0].GetId())

	ids, ok := manager2.GetKeyspaceIDInRange(meta.Id, meta.Id, 1)
	re.False(ok)
	re.Empty(ids)
}

func TestReviewFailedSaveLeavesPhantomKeyspaceInRangeCache(t *testing.T) {
	re := require.New(t)
	ctx := context.Background()
	baseStore := endpoint.NewStorageEndpoint(kv.NewMemoryKV(), nil)
	store := &failingSaveMetaStore{KeyspaceStorage: baseStore}
	manager := NewKeyspaceManager(ctx, store, nil, mockid.NewIDAllocator(), &mockConfig{}, nil)

	meta := &keyspacepb.KeyspaceMeta{
		Id:             200,
		Name:           "phantom-proof",
		State:          keyspacepb.KeyspaceState_ENABLED,
		CreatedAt:      time.Now().Unix(),
		StateChangedAt: time.Now().Unix(),
	}
	re.Error(manager.saveNewKeyspace(meta))

	var loaded []*keyspacepb.KeyspaceMeta
	err := baseStore.RunInTxn(ctx, func(txn kv.Txn) error {
		var err error
		loaded, err = baseStore.LoadRangeKeyspace(txn, meta.Id, 1)
		return err
	})
	re.NoError(err)
	re.Empty(loaded)
	re.Equal([]uint32{meta.Id}, ids)
}

I added two tests for it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, both the split and merge checkers first use KeyspaceExist to load the keyspace metadata, then call GetKeyspaceIDInRange later.
We must ensure the new keyspace can't be mergedand then split the new keyspace region. The old keyspaces we can ignore.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a concern that, in the scheduling path of this PR, the check is still backed by the in-memory cache, so the result may still be affected by cases like restart without warmup or failed-save scenarios. Because of that, I am not fully comfortable treating this as “no wrong merge” yet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example "pd transfer leader"?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing, for the cache issue, pd is okay with the scheduling server doing one full load. For PD leader transfer, the worst case seems to be a missed split, instead of an incorrect merge.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schedule microservice must load all keyspace from etcd after it campaign primary successfully

Copy link
Copy Markdown
Contributor

@lhy1024 lhy1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rest LGTM

Comment on lines +128 to +130
for _, kr := range suspectKeyRanges.Ranges() {
rw.checkerController.AddSuspectKeyRange(kr.StartKey, kr.EndKey)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing, for the suspect-queue part, we can remove the TTL behavior for the follow-up keyspace handling in a separate PR.

if manager == nil {
return []uint32{start}, true
}
return manager.cache.GetKeyspaceIDInRange(start, end, limit)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing, for the cache issue, pd is okay with the scheduling server doing one full load. For PD leader transfer, the worst case seems to be a missed split, instead of an incorrect merge.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Apr 9, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-09 10:10:10.494279219 +0000 UTC m=+1037415.699639276: ☑️ agreed by lhy1024.

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Apr 9, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lhy1024

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants