Skip to content

[https://nvbugs/6025177][fix] Fix KV cache issue (cherry-pick to release/1.3.0rc5.post2)#12819

Open
thorjohnsen wants to merge 3 commits intoNVIDIA:release/1.3.0rc5.post2from
thorjohnsen:user/tjohnsen/fix_cross_contamination_release_1.3.0rc5.post2
Open

[https://nvbugs/6025177][fix] Fix KV cache issue (cherry-pick to release/1.3.0rc5.post2)#12819
thorjohnsen wants to merge 3 commits intoNVIDIA:release/1.3.0rc5.post2from
thorjohnsen:user/tjohnsen/fix_cross_contamination_release_1.3.0rc5.post2

Conversation

@thorjohnsen
Copy link
Copy Markdown
Collaborator

@thorjohnsen thorjohnsen commented Apr 7, 2026

Summary

  • Cherry-pick of commit b444973 into release/1.3.0rc5.post2

Test plan

  • Verify KV cache fix applies cleanly
  • Run KV cache manager unit tests: cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • Run cache transceiver multi-GPU tests: cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Improved KV cache reuse logic to accurately compute materialized token counts, ensuring proper context extent handling during chunked prefill and generation phases.
  • Tests

    • Added test utilities for KV cache manager validation.
    • Added new test cases validating materialized context extent behavior.
    • Expanded test coverage across KV cache reuse, block management, and cache transceivers.

@thorjohnsen thorjohnsen requested a review from a team as a code owner April 7, 2026 22:03
…ase/1.3.0rc5.post2)

Cherry-pick of NVIDIA#12673 onto release/1.3.0rc5.post2.

Fixes KV cache corruption caused by storing blocks with over-counted unique
token extent during chunked prefill. Introduces getUsableUniqueTokenCountForReuse()
and getMaterializedUniqueTokenCountForReuse() helpers to correctly cap the
number of tokens stored for reuse.

Also moves simulatePrefillCompletion to a KvCacheManagerTestUtil test utility
class to avoid polluting the production interface.

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

This PR refactors KV cache reuse token counting logic by introducing helper functions that compute accurate usable unique token counts based on context state, replacing hardcoded assumptions. It adds test utilities to simulate prefill completion for unit tests, exposes these utilities via nanobind Python bindings, and updates multiple unit tests to use the new simulation utility.

Changes

Cohort / File(s) Summary
Core KV Cache Manager Logic
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
Added getMaterializedUniqueTokenCountForReuse and getUsableUniqueTokenCountForReuse helper functions to compute context-aware reusable token counts. Updated reuse token chopping logic throughout to use computed usable count instead of uniqueTokens.size() - 1.
Testing Utility (C++)
cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h
New utility header defining KvCacheManagerTestUtil class with static method simulatePrefillCompletion that sets context current position to prompt length for unit test simulation.
Nanobind Binding Infrastructure
cpp/tensorrt_llm/nanobind/CMakeLists.txt, cpp/tensorrt_llm/nanobind/bindings.cpp, cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.*
Added new nanobind source file to build, extended module initialization to register initKvCacheTestUtilBindings, and created bindings module exposing simulate_prefill_completion_only_use_for_testing Python function.
C++ Unit Tests
cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp, cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp, cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
Integrated calls to simulatePrefillCompletion in existing tests. Added two new test cases (KVCacheManagerStoreContextBlocksUsesMaterializedContextExtent, KVCacheManagerReleaseBlocksUsesMaterializedContextExtent) validating materialized context extent behavior.
Python Unit Tests
tests/unittest/_torch/executor/test_resource_manager.py, tests/unittest/llmapi/test_llm_kv_cache_events.py
Imported and integrated simulate_prefill_completion_only_use_for_testing binding into test flows immediately before resource cleanup operations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.54% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the PR as a fix for a KV cache issue (NVBugs #6025177) and indicates it is a cherry-pick to a release branch, which matches the changeset's purpose.
Description check ✅ Passed The PR description includes a concise summary of the cherry-pick, test plan with specific test files, and verification steps, but lacks detailed explanation of the issue and solution as recommended by the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)

5184-5189: ⚠️ Potential issue | 🟠 Major

Apply the simulated prefill completion to the stored requests, not a copy.

Line 5184 iterates llmRequests by value, so Line 5188 only updates a temporary. The vector returned from fillKvCacheManager(...) still contains requests with stale prefill state, and the later removeSequence(...) calls in FillKvCacheAndCompleteRequestsTest will keep exercising the wrong state.

🔧 Suggested fix
-    for (auto request : llmRequests)
+    for (auto& request : llmRequests)
     {
         kvCacheManager.addSequence(request.mRequestId, request.getPromptLen(), maxBeamWidth, request);
         request.mState = LlmRequestState::kGENERATION_IN_PROGRESS;
         tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request);
         kvCacheManager.storeContextBlocks(request);
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 5184
- 5189, The loop over llmRequests is iterating by value so
simulatePrefillCompletion and the state change only affect a copy; change the
loop to iterate by reference (e.g., for (auto& request : llmRequests)) so
kvCacheManager.addSequence, setting request.mState =
LlmRequestState::kGENERATION_IN_PROGRESS,
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request),
and kvCacheManager.storeContextBlocks(request) modify the actual entries
returned by fillKvCacheManager(...) and later removeSequence(...) will see the
updated prefill state in the FillKvCacheAndCompleteRequestsTest.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h (1)

29-40: Narrow this helper's contract.

It only calls setContextCurrentPosition(getPromptLen()); LlmRequestState and the rest of the post-prefill bookkeeping stay unchanged. Either drive the full post-prefill transition here or rename/doc it as a materialization-only shim so tests do not assume stronger behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h` around lines 29 - 40, The
helper simulatePrefillCompletion only sets
llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves
LlmRequestState and other post-prefill bookkeeping unchanged; either make it
perform the full post-prefill transition (update LlmRequestState and any other
post-prefill flags/members) or narrow its contract by renaming and documenting
it as a materialization-only shim. Specifically, either implement the full
post-prefill steps on batch_manager::LlmRequest (matching what production
prefill completion does) inside simulatePrefillCompletion, or rename the
function (e.g., simulatePrefillMaterialization) and update its comment to state
it only calls setContextCurrentPosition/getPromptLen and does NOT modify
LlmRequestState or other bookkeeping, and update all tests to use the new
name/expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`:
- Around line 403-406: The call to
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(*)
inside the if (llmReq->getContextRemainingLength() == 0) branch is a no-op
because remaining==0 already implies llmReq->getContextCurrentPosition() ==
llmReq->getPromptLen(); to actually exercise the materialized-prefix path either
remove the simulatePrefillCompletion() call from this branch or change the test
setup/condition so simulatePrefillCompletion() runs when
contextRemainingLength() > 0 (or set llmReq so getContextCurrentPosition() <
getPromptLen() before calling simulatePrefillCompletion()), then call
kvCacheManager->storeContextBlocks(*llmReq) — update the condition or request
initialization accordingly to ensure simulatePrefillCompletion() meaningfully
mutates the request state.

In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Line 34: Update the NVIDIA copyright header in the modified file
kvCacheManagerTest.cpp to reflect 2026 (the latest meaningful modification
year); locate the file header at the top of kvCacheManagerTest.cpp (the test
file that includes "tensorrt_llm/testing/kvCacheManagerTestUtil.h") and change
the year range/end year to 2026 so the header is current.

---

Outside diff comments:
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 5184-5189: The loop over llmRequests is iterating by value so
simulatePrefillCompletion and the state change only affect a copy; change the
loop to iterate by reference (e.g., for (auto& request : llmRequests)) so
kvCacheManager.addSequence, setting request.mState =
LlmRequestState::kGENERATION_IN_PROGRESS,
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request),
and kvCacheManager.storeContextBlocks(request) modify the actual entries
returned by fillKvCacheManager(...) and later removeSequence(...) will see the
updated prefill state in the FillKvCacheAndCompleteRequestsTest.

---

Nitpick comments:
In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h`:
- Around line 29-40: The helper simulatePrefillCompletion only sets
llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves
LlmRequestState and other post-prefill bookkeeping unchanged; either make it
perform the full post-prefill transition (update LlmRequestState and any other
post-prefill flags/members) or narrow its contract by renaming and documenting
it as a materialization-only shim. Specifically, either implement the full
post-prefill steps on batch_manager::LlmRequest (matching what production
prefill completion does) inside simulatePrefillCompletion, or rename the
function (e.g., simulatePrefillMaterialization) and update its comment to state
it only calls setContextCurrentPosition/getPromptLen and does NOT modify
LlmRequestState or other bookkeeping, and update all tests to use the new
name/expectations.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0855f8a5-dac7-4461-8c53-cb3568dacef1

📥 Commits

Reviewing files that changed from the base of the PR and between fc1f38b and f30dbf2.

📒 Files selected for processing (11)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/nanobind/CMakeLists.txt
  • cpp/tensorrt_llm/nanobind/bindings.cpp
  • cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.cpp
  • cpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.h
  • cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h
  • cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
  • cpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp
  • tests/unittest/_torch/executor/test_resource_manager.py
  • tests/unittest/llmapi/test_llm_kv_cache_events.py

@thorjohnsen thorjohnsen force-pushed the user/tjohnsen/fix_cross_contamination_release_1.3.0rc5.post2 branch from f30dbf2 to 20131a8 Compare April 7, 2026 22:13
@thorjohnsen
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42198 [ run ] triggered by Bot. Commit: 20131a8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42198 [ run ] completed with state SUCCESS. Commit: 20131a8
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #7 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@eopXD
Copy link
Copy Markdown
Collaborator

eopXD commented Apr 8, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42322 [ run ] triggered by Bot. Commit: 20131a8 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42322 [ run ] completed with state SUCCESS. Commit: 20131a8
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #10 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
@thorjohnsen thorjohnsen requested a review from a team as a code owner April 8, 2026 16:51
@thorjohnsen
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42365 [ run ] triggered by Bot. Commit: b9b7ab4 Link to invocation

@thorjohnsen
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42389 [ kill ] triggered by Bot. Commit: 059e4c4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42389 [ kill ] completed with state SUCCESS. Commit: 059e4c4
Successfully killed previous jobs for commit 059e4c4

Link to invocation

@juney-nvidia
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42797 [ run ] triggered by Bot. Commit: 059e4c4 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42797 [ run ] completed with state SUCCESS. Commit: 059e4c4
/LLM/release-1.3.0rc5.post2/L0_MergeRequest_PR pipeline #18 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants