Skip to content

[TRTLLM-10939][feat] Enable block reuse with overlap scheduler#12816

Open
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:trtllm-10939-block-reuse-v2
Open

[TRTLLM-10939][feat] Enable block reuse with overlap scheduler#12816
chienchunhung wants to merge 1 commit intoNVIDIA:mainfrom
chienchunhung:trtllm-10939-block-reuse-v2

Conversation

@chienchunhung
Copy link
Copy Markdown
Collaborator

@chienchunhung chienchunhung commented Apr 7, 2026

Summary by CodeRabbit

  • New Features

    • KV cache block reuse is now supported with the overlap scheduler in PyTorch, expanding efficient memory reuse scenarios.
  • Bug Fixes

    • Deterministic request termination behavior improved to avoid redundant cleanup paths and ensure boolean-based termination outcomes.
  • Tests

    • Test coverage expanded and test configs updated to exercise block-reuse scenarios and validate scheduler consistency and cache-hit behavior.

Description

Re-enable KV cache block reuse (prefix caching) when the overlap scheduler is active. Block reuse and the overlap scheduler were previously mutually exclusive due to an explicit guard in base_worker.py that rejected context-only requests in disaggregated serving when both features were enabled.

Problem

Block reuse (enable_block_reuse, default True) and the overlap scheduler (disable_overlap_scheduler=False, default) are both enabled by default, but their combination was explicitly blocked for disaggregated context-only requests.

Root cause

Removing the guard exposed a latent issue: both _handle_responses (early termination with pinned blocks) and _end_transfer_and_maybe_terminate (after KV transfer completes) call _terminate_requestfree_resources on the same request. Under the non-overlap scheduler this was benign because the transfer typically completed before _handle_responses ran, so only one path fired. Under the overlap scheduler, the deferred processing creates a window where the transfer is still in-flight when _handle_responses terminates the request, causing end_transfer to terminate it again.

Fix

In _end_transfer_and_maybe_terminate, skip the redundant _terminate_request call when should_store_blocks is True, since the early-termination path in _handle_responses already handled it. This preserves the existing early-termination + pin/unpin mechanism while preventing the double-termination crash.

Changes

tensorrt_llm/_torch/pyexecutor/py_executor.py

  • In _end_transfer_and_maybe_terminate, guard the non-fast-transfer _terminate_request call with if not should_store_blocks. When should_store_blocks is True, _handle_responses already terminated the request via the enable_partial_reuse_for_disagg early-termination branch.
  • Fix end_transfer to return False (instead of bare return) on KeyError, preventing unintended termination by the caller.

tensorrt_llm/executor/base_worker.py

  • Remove the ValueError guard that rejected context-only requests when overlap scheduler, block reuse, and KV cache transceiver were all active.

tests/unittest/_torch/executor/test_overlap_scheduler.py

  • Add enable_block_reuse parameter to create_llm helper and to test_overlap_scheduler_consistency as a parametrized axis ([False, True]), so the existing consistency test now covers both block reuse configurations.
  • Add test_overlap_scheduler_block_reuse_cache_hit — sends the same prompt twice and verifies blocks are actually reused (cached_tokens > 0 on second pass).
  • Add strict=True to zip() call for length-mismatch safety.

tests/integration/defs/accuracy/test_disaggregated_serving.py

  • Remove pytest.skip for the overlap + block reuse combination on context servers.
  • Enable enable_block_reuse unconditionally in _test_chunked_prefill_helper (was gated on ctx_pp == 1; regular block reuse has no PP restriction).

tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml

  • Enable enable_block_reuse: true and enable_partial_reuse: true on both context and generation servers to exercise the full block reuse path in the disaggregated overlap test.

Test Coverage

Test What it covers
test_overlap_scheduler_consistency[no_reuse-*] Existing: overlap vs. non-overlap consistency (block reuse OFF)
test_overlap_scheduler_consistency[block_reuse-*] New parameter: overlap vs. non-overlap consistency (block reuse ON)
test_overlap_scheduler_block_reuse_cache_hit New: verifies blocks are actually reused (cached_tokens > 0)
test_auto_dtype (disaggregated) Unblocked: overlap + block reuse + context-only + disaggregated serving
disagg_config_overlap.yaml test Updated: now exercises block reuse in disaggregated overlap config

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@chienchunhung chienchunhung force-pushed the trtllm-10939-block-reuse-v2 branch 2 times, most recently from 2b7c7a5 to 27c188b Compare April 7, 2026 21:05
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42188 [ run ] triggered by Bot. Commit: 27c188b Link to invocation

@chienchunhung chienchunhung marked this pull request as ready for review April 7, 2026 21:38
@chienchunhung chienchunhung requested review from a team as code owners April 7, 2026 21:38
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 7, 2026

📝 Walkthrough

Walkthrough

Adjusts PyExecutor termination flow to make end_transfer boolean-consistent and avoid redundant terminations when blocks are stored; removes a backend-only validation blocking block-reuse with overlap scheduling; and enables/parametrizes block-reuse in related integration and unit tests and configs.

Changes

Cohort / File(s) Summary
PyExecutor termination
tensorrt_llm/_torch/pyexecutor/py_executor.py
AsyncTransferManager.end_transfer(request) now returns False when request.py_request_id is absent; termination helper skips _terminate_request(request) when async_transfer_manager.should_store_blocks is true to avoid double resource-free paths.
Enqueue validation removal
tensorrt_llm/executor/base_worker.py
Removed backend-specific ValueError check in _enqueue_request() that previously blocked REQUEST_TYPE_CONTEXT_ONLY disaggregated requests when overlap scheduler + KV cache block reuse were in use.
Tests & configs enabling block reuse
tests/integration/defs/accuracy/test_disaggregated_serving.py, tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml, tests/unittest/_torch/executor/test_overlap_scheduler.py
Removed a conditional test skip so tests run when block reuse + overlap scheduler are enabled; made chunked prefill always enable kv_cache_config["enable_block_reuse"]; updated YAML to set enable_block_reuse: true and enable_partial_reuse: true; added enable_block_reuse parametrization and a cache-hit test in overlap scheduler unit tests (also updated create_llm signature to accept the new flag).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: enabling block reuse functionality with the overlap scheduler, matching the primary feature goal of the PR.
Description check ✅ Passed The PR description comprehensively explains the problem, root cause, solution, and all code changes with detailed test coverage mapping.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml (1)

13-15: Make enable_partial_reuse explicit in this overlap config.

This config only exercises the disaggregated reuse path because enable_partial_reuse currently defaults to true. Making that explicit keeps the coverage stable if the default changes later.

📝 Suggested change
 context_servers:
   num_instances: 1
   max_batch_size: 1
   max_num_tokens: 3000
   max_seq_len: 4096
   tensor_parallel_size: 1
   pipeline_parallel_size: 1
   kv_cache_config:
     enable_block_reuse: true
+    enable_partial_reuse: true
     free_gpu_memory_fraction: 0.2
@@
 generation_servers:
   num_instances: 1
   tensor_parallel_size: 1
   pipeline_parallel_size: 1
   max_batch_size: 256
   max_num_tokens: 4096
   max_seq_len: 4096
   kv_cache_config:
     enable_block_reuse: true
+    enable_partial_reuse: true
     free_gpu_memory_fraction: 0.2

Also applies to: 25-27

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml`
around lines 13 - 15, The test config omits the kv_cache_config flag
enable_partial_reuse which currently defaults to true; make it explicit to
ensure this disaggregated reuse path remains covered. Update the YAML under
kv_cache_config in disagg_config_overlap.yaml (and the similar block at lines
25-27) to add enable_partial_reuse: true so the test explicitly enables partial
reuse (refer to kv_cache_config and enable_partial_reuse).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 609-616: The code currently clears the termination guard
unconditionally which can allow double termination when
AsyncTransferManager.end_transfer(request) returns False for the first of
multiple transfers; change the second branch to mirror the first: call
end_transfer(request) and if it returns True then perform termination work (call
self._terminate_request(...) and remove from self.active_requests if present)
and only then discard self._terminated_request_ids for request.py_request_id,
otherwise return immediately without discarding; update the logic around
AsyncTransferManager.end_transfer, self.active_requests.remove,
self._terminate_request, and self._terminated_request_ids.discard to ensure the
guard is only cleared when end_transfer() returned True.
- Around line 3406-3413: The _terminated_request_ids set is only updated in the
transfer callback path, causing IDs (recorded at line where req_id is taken from
request.py_request_id and added to self._terminated_request_ids) to never be
removed for ordinary completions or for PP requests terminated later by
DisaggPPTerminationHandler; update the logic so that any completion path that
finalizes a request also removes the id from self._terminated_request_ids (and
pops result_wait_queues) — specifically ensure _do_terminate_request(),
_end_transfer_and_maybe_terminate(), and the normal completion code path call a
common cleanup helper (e.g., _cleanup_request_termination(req_id)) that frees
resources via resource_manager.free_resources(request) if needed and removes
req_id from self._terminated_request_ids and result_wait_queues to prevent
permanent growth and reuse conflicts.

In `@tests/unittest/_torch/executor/test_overlap_scheduler.py`:
- Around line 120-149: The current cold-cache check is flaky because calling
llm.generate on a batch can produce cache hits for later prompts; change the
first warmup/generation to invoke llm.generate per-prompt to guarantee a cold
cache for each request: inside the with create_llm(...) block, replace the
batched call that produces outputs_first with a loop over prompts that calls
llm.generate([prompt], sampling_params=sampling_config, use_tqdm=True) and
assert output.cached_tokens == 0 for each single-prompt result (keep the later
batched generate for the cache-reuse check). Reference functions/objects:
create_llm, llm.generate, SamplingParams, outputs_first/outputs_second.

---

Nitpick comments:
In
`@tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml`:
- Around line 13-15: The test config omits the kv_cache_config flag
enable_partial_reuse which currently defaults to true; make it explicit to
ensure this disaggregated reuse path remains covered. Update the YAML under
kv_cache_config in disagg_config_overlap.yaml (and the similar block at lines
25-27) to add enable_partial_reuse: true so the test explicitly enables partial
reuse (refer to kv_cache_config and enable_partial_reuse).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a1c7f708-6bee-4c8e-82be-00cfa11a8b12

📥 Commits

Reviewing files that changed from the base of the PR and between a1777fd and 27c188b.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/executor/base_worker.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/defs/disaggregated/test_configs/disagg_config_overlap.yaml
  • tests/unittest/_torch/executor/test_overlap_scheduler.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/executor/base_worker.py

Comment on lines +609 to +616
if self.async_transfer_manager.end_transfer(request):
self.active_requests.remove(request)
self._terminate_request(request)
self._terminated_request_ids.discard(request.py_request_id)
return
if self.async_transfer_manager.end_transfer(request):
self._terminate_request(request)
self._terminated_request_ids.discard(request.py_request_id)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't drop the termination guard before the last transfer.

AsyncTransferManager can track more than one transfer for the same request. Clearing _terminated_request_ids even when end_transfer() returns False drops the guard after the first completion, so a request with both connector and transceiver transfers can hit _do_terminate_request() again on the final completion and double-free resources.

🐛 Proposed fix
             if response:
                 response.result.cached_tokens = request.cached_tokens
                 self._enqueue_responses([(request.py_request_id, response)])
             if self.async_transfer_manager.end_transfer(request):
                 self.active_requests.remove(request)
                 self._terminate_request(request)
-            self._terminated_request_ids.discard(request.py_request_id)
+                self._terminated_request_ids.discard(request.py_request_id)
             return
         if self.async_transfer_manager.end_transfer(request):
             self._terminate_request(request)
-        self._terminated_request_ids.discard(request.py_request_id)
+            self._terminated_request_ids.discard(request.py_request_id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 609 - 616, The
code currently clears the termination guard unconditionally which can allow
double termination when AsyncTransferManager.end_transfer(request) returns False
for the first of multiple transfers; change the second branch to mirror the
first: call end_transfer(request) and if it returns True then perform
termination work (call self._terminate_request(...) and remove from
self.active_requests if present) and only then discard
self._terminated_request_ids for request.py_request_id, otherwise return
immediately without discarding; update the logic around
AsyncTransferManager.end_transfer, self.active_requests.remove,
self._terminate_request, and self._terminated_request_ids.discard to ensure the
guard is only cleared when end_transfer() returned True.

Comment on lines +3406 to +3413
req_id = request.py_request_id
if req_id in self._terminated_request_ids:
return
self._terminated_request_ids.add(req_id)
self.resource_manager.free_resources(request)

if self.gather_all_responses or self.dist.rank == 0:
self.result_wait_queues.pop(request.py_request_id, None)
self.result_wait_queues.pop(req_id, None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound _terminated_request_ids outside the transfer callback too.

Line 3409 records every terminated request, but the only visible cleanup is in _end_transfer_and_maybe_terminate(). Ordinary completions—and PP requests whose real _do_terminate_request() runs later via DisaggPPTerminationHandler—never remove their IDs, so this set grows for the lifetime of the executor and will turn termination into a no-op if request IDs are ever reused.

🧹 Proposed fix
     def _do_terminate_request(self, request: LlmRequest):
         req_id = request.py_request_id
         if req_id in self._terminated_request_ids:
             return
-        self._terminated_request_ids.add(req_id)
-        self.resource_manager.free_resources(request)
+        keep_guard = req_id in self.async_transfer_manager.requests_in_transfer()
+        self._terminated_request_ids.add(req_id)
+        try:
+            self.resource_manager.free_resources(request)
+        finally:
+            if not keep_guard:
+                self._terminated_request_ids.discard(req_id)
 
         if self.gather_all_responses or self.dist.rank == 0:
             self.result_wait_queues.pop(req_id, None)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 3406 - 3413, The
_terminated_request_ids set is only updated in the transfer callback path,
causing IDs (recorded at line where req_id is taken from request.py_request_id
and added to self._terminated_request_ids) to never be removed for ordinary
completions or for PP requests terminated later by DisaggPPTerminationHandler;
update the logic so that any completion path that finalizes a request also
removes the id from self._terminated_request_ids (and pops result_wait_queues) —
specifically ensure _do_terminate_request(),
_end_transfer_and_maybe_terminate(), and the normal completion code path call a
common cleanup helper (e.g., _cleanup_request_termination(req_id)) that frees
resources via resource_manager.free_resources(request) if needed and removes
req_id from self._terminated_request_ids and result_wait_queues to prevent
permanent growth and reuse conflicts.

Comment on lines +120 to +149
prompts = test_case["prompts"]
max_new_tokens = test_case["max_new_tokens"]
temperature = test_case["temperature"]
top_p = test_case["top_p"]
stop_words = test_case["stop_words"]

sampling_config = SamplingParams(max_tokens=max_new_tokens,
stop=stop_words,
temperature=temperature,
top_p=top_p,
n=1,
use_beam_search=True)

with create_llm(model_path,
disable_overlap_scheduler=False,
sampler_type=sampler_type,
enable_block_reuse=True) as llm:
outputs_first = llm.generate(prompts,
sampling_params=sampling_config,
use_tqdm=True)
for output in outputs_first:
assert output.cached_tokens == 0, (
"First pass should have no cached tokens (cold cache)")

outputs_second = llm.generate(prompts,
sampling_params=sampling_config,
use_tqdm=True)
for output in outputs_second:
assert output.cached_tokens > 0, (
"Second pass should reuse cached blocks")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The cold-cache assertion is batch-order dependent.

With max_num_tokens=128, the first generate(prompts, ...) call can legitimately produce cache hits for later prompts in the same batch once earlier requests finish chunked prefill. That makes cached_tokens == 0 on every first-pass output flaky instead of a pure cold-cache check.

💡 Proposed fix
-    prompts = test_case["prompts"]
+    prompt = test_case["prompts"][0]
@@
-        outputs_first = llm.generate(prompts,
-                                     sampling_params=sampling_config,
-                                     use_tqdm=True)
-        for output in outputs_first:
-            assert output.cached_tokens == 0, (
-                "First pass should have no cached tokens (cold cache)")
+        output_first = llm.generate([prompt],
+                                    sampling_params=sampling_config,
+                                    use_tqdm=True)[0]
+        assert output_first.cached_tokens == 0, (
+            "First pass should have no cached tokens (cold cache)")
 
-        outputs_second = llm.generate(prompts,
-                                      sampling_params=sampling_config,
-                                      use_tqdm=True)
-        for output in outputs_second:
-            assert output.cached_tokens > 0, (
-                "Second pass should reuse cached blocks")
+        output_second = llm.generate([prompt],
+                                     sampling_params=sampling_config,
+                                     use_tqdm=True)[0]
+        assert output_second.cached_tokens > 0, (
+            "Second pass should reuse cached blocks")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/executor/test_overlap_scheduler.py` around lines 120 -
149, The current cold-cache check is flaky because calling llm.generate on a
batch can produce cache hits for later prompts; change the first
warmup/generation to invoke llm.generate per-prompt to guarantee a cold cache
for each request: inside the with create_llm(...) block, replace the batched
call that produces outputs_first with a loop over prompts that calls
llm.generate([prompt], sampling_params=sampling_config, use_tqdm=True) and
assert output.cached_tokens == 0 for each single-prompt result (keep the later
batched generate for the cache-reuse check). Reference functions/objects:
create_llm, llm.generate, SamplingParams, outputs_first/outputs_second.

Make _do_terminate_request idempotent to prevent double-termination
when both _handle_responses (early termination) and
_end_transfer_and_maybe_terminate fire on the same request under
the overlap scheduler.

- Add _terminated_request_ids tracking set to skip redundant
  free_resources calls
- Remove ValueError guard in base_worker.py that blocked
  context-only + overlap + block_reuse + disagg
- Remove pytest.skip for overlap + block_reuse in disagg test
- Add enable_block_reuse parameter to overlap scheduler tests
- Add cache-hit verification test
- Fix end_transfer bare return -> return False

Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
Signed-off-by: Chien-Chun Hung <2679986+chienchunhung@users.noreply.github.com>
Made-with: Cursor
@chienchunhung chienchunhung force-pushed the trtllm-10939-block-reuse-v2 branch from 27c188b to 1b3b9d8 Compare April 8, 2026 17:45
@chienchunhung
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@chienchunhung
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42374 [ run ] triggered by Bot. Commit: 1b3b9d8 Link to invocation

@chienchunhung chienchunhung requested a review from Tabrizian April 8, 2026 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants