Skip to content

[None][fix] Add bounded timeout to gen-side KV transfer in C++ CacheTransceiver#12820

Draft
yifjiang wants to merge 2 commits intoNVIDIA:mainfrom
yifjiang:fix/cpp-gen-transfer-timeout-v2
Draft

[None][fix] Add bounded timeout to gen-side KV transfer in C++ CacheTransceiver#12820
yifjiang wants to merge 2 commits intoNVIDIA:mainfrom
yifjiang:fix/cpp-gen-transfer-timeout-v2

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented Apr 7, 2026

Summary

  • Replace unbounded future.get() in checkGenTransferStatus with future.wait_for() that branches on ready/timeout/error status
  • Fix non-blocking path that could still block indefinitely when kv_transfer_sender_future_timeout_ms is unset (std::nullopt) — uses blockAll flag instead of !receiverFutureTimeoutMs.has_value() to decide blocking behavior
  • Guard the updateKVCacheTransferBW timing collective so it only runs when all ranks block together (blockAll) or the request was confirmed ready on every rank in the initial poll, preventing allgather hangs when a peer timed out and skipped the request

Supersedes #12476 — addresses CodeRabbit review feedback from that PR.

Test plan

  • Existing disaggregated serving unit tests pass
  • Manual test with kv_transfer_sender_future_timeout_ms set and unset
  • Multi-rank test with TRTLLM_KVCACHE_TIME_OUTPUT_PATH enabled to verify no allgather hang on timeout

…ransceiver

Replace the unbounded future.get() in checkGenTransferStatus with
future.wait_for() that branches on ready/timeout/error status.

Key improvements over the initial approach:
- Use `blockAll` flag (not `!receiverFutureTimeoutMs.has_value()`) to decide
  whether to block, preventing indefinite hangs when the timeout config is
  unset (std::nullopt).
- Guard the updateKVCacheTransferBW timing collective so it only runs when
  either all ranks block together (blockAll) or the request was confirmed
  ready on every rank in the initial poll, avoiding allgather hangs when a
  peer timed out and skipped the request.
- Erase completed/errored futures inside each branch to avoid dangling
  iterator access after exceptions.

Signed-off-by: Yifan Jiang <yifanj@nvidia.com>
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Apr 8, 2026
Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants