Skip to content

[https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever#12803

Open
yingguo-trt wants to merge 2 commits intoNVIDIA:mainfrom
yingguo-trt:rcca_6018647
Open

[https://nvbugs/6018647][test] Add unit test for Lifecycle Race Condition error in disagg sever#12803
yingguo-trt wants to merge 2 commits intoNVIDIA:mainfrom
yingguo-trt:rcca_6018647

Conversation

@yingguo-trt
Copy link
Copy Markdown
Collaborator

@yingguo-trt yingguo-trt commented Apr 7, 2026

Add regression test StoreBlocksForReuseWithPinDoesNotCreateGhostFreeBlocks to verify that storeBlocksForReuse(pin=true) correctly calls claimBlock() on zero-ref blocks before incRefCount(), preventing duplicate free queue entries that caused hangs and crashes in disaggregated serving with block reuse enabled.

Summary by CodeRabbit

  • Tests
    • Added regression test for KV-cache block lifecycle management to validate correct behavior when enabling block reuse with pinned blocks, ensuring free-block counts remain accurate and preventing resource leaks.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…m lifecycle

Add regression test StoreBlocksForReuseWithPinDoesNotCreateGhostFreeBlocks
to verify that storeBlocksForReuse(pin=true) correctly calls claimBlock()
on zero-ref blocks before incRefCount(), preventing duplicate free queue
entries that caused hangs and crashes in disaggregated serving with block
reuse enabled.

Signed-off-by: yingguo-trt <244492186+yingguo-trt@users.noreply.github.com>
@yingguo-trt yingguo-trt requested a review from LarryXFly April 7, 2026 09:32
@yingguo-trt yingguo-trt requested a review from a team as a code owner April 7, 2026 09:32
@yingguo-trt
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c37dcf4a-8a59-4882-afef-c68b3bba36af

📥 Commits

Reviewing files that changed from the base of the PR and between 63cb1f9 and 9c96699.

📒 Files selected for processing (1)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp

📝 Walkthrough

Walkthrough

Added a regression test case that validates KV-cache block lifecycle behavior when block reuse and pinning are enabled together, ensuring no "ghost free" blocks inflate free-block counts during storeBlocksForReuse operations.

Changes

Cohort / File(s) Summary
KV-Cache Block Management Test
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
Added regression test StoreBlocksForReuseWithPinDoesNotCreateGhostFreeBlocks that validates block lifecycle during reuse with pinning, checking that free-block counts remain consistent.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description explains the test's purpose and what it validates, but required template sections (Description, Test Coverage, and PR Checklist details) are largely unfilled or left as comments. Fill in the Description section with details about the lifecycle race condition bug, expand Test Coverage to describe which test safeguards the changes, and complete the PR Checklist items appropriately.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a regression unit test for a lifecycle race condition in disaggregated serving with block reuse.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42122 [ run ] triggered by Bot. Commit: 9c96699 Link to invocation

@yingguo-trt yingguo-trt closed this Apr 7, 2026
@yingguo-trt yingguo-trt reopened this Apr 7, 2026
@yingguo-trt
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42134 [ run ] triggered by Bot. Commit: 9c96699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42134 [ run ] completed with state SUCCESS. Commit: 9c96699
/LLM/main/L0_MergeRequest_PR pipeline #32969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yingguo-trt
Copy link
Copy Markdown
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42226 [ run ] triggered by Bot. Commit: 9c96699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42226 [ run ] completed with state SUCCESS. Commit: 9c96699
/LLM/main/L0_MergeRequest_PR pipeline #33039 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@yingguo-trt
Copy link
Copy Markdown
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42244 [ reuse-pipeline ] triggered by Bot. Commit: 9c96699 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42244 [ reuse-pipeline ] completed with state SUCCESS. Commit: 9c96699
Reusing PR_Github #42226 (Partly Tested) for commit 9c96699

Link to invocation

@yingguo-trt yingguo-trt enabled auto-merge (squash) April 8, 2026 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants