cancel request and block new inputs when sleeping by grimoire · Pull Request #4541 · InternLM/lmdeploy

grimoire · 2026-04-21T07:37:00Z

PR: Guard PyTorch Engine Sleep Against In-flight and New Requests

Summary

This PR fixes a PyTorch engine sleep race where sleep() can release model/KV-cache resources while requests are still active or while new EngineInstance inputs are accepted. After sleep, those requests can resume or enter inference against invalid resources and break generation.

The fix is scoped to lmdeploy/pytorch/engine.

Problem

Before this change, PyTorch engine sleep only delegated to executor/model-agent sleep. Direct PyTorch engine instances could still enqueue new inference work around the sleep transition, and existing scheduler sessions could remain alive even though sleep may release KV cache.

This creates unsafe cases:

A new request arrives after sleep starts but before resources are restored.
An active request is still attached to scheduler state while KV cache is released.
A prefetched next batch from before sleep survives the drain point and may be used after wakeup.

Changes

Add a request-admission gate in the PyTorch request manager.
- Blocks only ADD_SESSION and ADD_MESSAGE during explicit PyTorch engine sleep.
- Keeps cleanup requests such as STOP_SESSION and END_SESSION enabled.
- Rejects blocked requests immediately with ResponseType.CANCEL and wakes the sender.
- Also rejects already-queued add requests if the gate is enabled before processing.
Make Engine.sleep() perform PyTorch-engine cleanup before resource release.
- Blocks new inference inputs before awaiting anything.
- Drains the engine loop to a safe scheduling boundary.
- Cancels active request responses.
- Ends all remaining scheduler sessions because sleep invalidates KV cache, including sessions that requested cache preservation.
- Calls executor sleep only after the engine loop is drained and sessions are removed.
Make Engine.wakeup() re-enable inference only after all sleeping tags are restored.
- Partial wakeup, such as weights-only, keeps inference blocked while KV cache is still sleeping.
- Full wakeup unblocks ADD_SESSION and ADD_MESSAGE and resumes scheduling.
Add sleep-drain coordination in EngineLoop.
- Engine.sleep() requests a drain and waits for the main loop to acknowledge a safe boundary.
- The main loop drops prefetched work from before sleep because those batches are stale after sessions are ended and KV cache is released.
- Resume uses scheduler-aware runnable-event logic rather than forcing runnable state.
Add logs and comments for the sleep lifecycle.
- Logs request blocking/unblocking, blocked request rejection, sleep start, drain, cleanup counts, wakeup progress, and partial-wakeup blocking.
- Comments document the event handshake and why all scheduler sessions are ended during sleep.

Tests

Focused unit tests were added/updated for:

Blocked ADD_SESSION and ADD_MESSAGE returning ResponseType.CANCEL immediately.
Cleanup requests staying allowed while add requests are blocked.
Already-queued add requests being cancelled after the gate is enabled.
Request admission being restored after unblock/wakeup.
Engine.sleep() blocking input before executor sleep.
Active responses being cancelled and scheduler sessions being ended before executor sleep.
Direct EngineInstance requests returning CANCEL while sleeping.
Partial wakeup keeping requests blocked until all sleeping tags are restored.

A real Qwen3-8B corner-case smoke test was also run and passed. It covers:

Baseline request before sleep.
Sleep while a streaming request is active.
New request while sleeping.
Direct PyTorch EngineInstance request while sleeping.
Weights-only partial wakeup.
Full wakeup recovery.
A second sleep/wakeup cycle.

Notes / Scope

This PR intentionally does not change lmdeploy/serve, AsyncEngine, Turbomind, public response enums, or HTTP middleware behavior.
empty_init is not treated as PyTorch engine sleep. The sleep guard is driven only by explicit Engine.sleep() / Engine.wakeup() calls.
Migration/disaggregated serving sleep behavior is not covered by the smoke test in this PR. The tested runtime path is PyTorch Hybrid role with MP-engine wrapper.

cancel request and block new inputs when sleep

0434b93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cancel request and block new inputs when sleeping#4541

cancel request and block new inputs when sleeping#4541
grimoire wants to merge 1 commit intoInternLM:mainfrom
grimoire:better-sleep-wake

grimoire commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

grimoire commented Apr 21, 2026

PR: Guard PyTorch Engine Sleep Against In-flight and New Requests

Summary

Problem

Changes

Tests

Notes / Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant