Skip to content

cancel request and block new inputs when sleeping#4541

Open
grimoire wants to merge 1 commit intoInternLM:mainfrom
grimoire:better-sleep-wake
Open

cancel request and block new inputs when sleeping#4541
grimoire wants to merge 1 commit intoInternLM:mainfrom
grimoire:better-sleep-wake

Conversation

@grimoire
Copy link
Copy Markdown
Collaborator

PR: Guard PyTorch Engine Sleep Against In-flight and New Requests

Summary

This PR fixes a PyTorch engine sleep race where sleep() can release model/KV-cache resources while requests are still active or while new EngineInstance inputs are accepted. After sleep, those requests can resume or enter inference against invalid resources and break generation.

The fix is scoped to lmdeploy/pytorch/engine.

Problem

Before this change, PyTorch engine sleep only delegated to executor/model-agent sleep. Direct PyTorch engine instances could still enqueue new inference work around the sleep transition, and existing scheduler sessions could remain alive even though sleep may release KV cache.

This creates unsafe cases:

  • A new request arrives after sleep starts but before resources are restored.
  • An active request is still attached to scheduler state while KV cache is released.
  • A prefetched next batch from before sleep survives the drain point and may be used after wakeup.

Changes

  • Add a request-admission gate in the PyTorch request manager.

    • Blocks only ADD_SESSION and ADD_MESSAGE during explicit PyTorch engine sleep.
    • Keeps cleanup requests such as STOP_SESSION and END_SESSION enabled.
    • Rejects blocked requests immediately with ResponseType.CANCEL and wakes the sender.
    • Also rejects already-queued add requests if the gate is enabled before processing.
  • Make Engine.sleep() perform PyTorch-engine cleanup before resource release.

    • Blocks new inference inputs before awaiting anything.
    • Drains the engine loop to a safe scheduling boundary.
    • Cancels active request responses.
    • Ends all remaining scheduler sessions because sleep invalidates KV cache, including sessions that requested cache preservation.
    • Calls executor sleep only after the engine loop is drained and sessions are removed.
  • Make Engine.wakeup() re-enable inference only after all sleeping tags are restored.

    • Partial wakeup, such as weights-only, keeps inference blocked while KV cache is still sleeping.
    • Full wakeup unblocks ADD_SESSION and ADD_MESSAGE and resumes scheduling.
  • Add sleep-drain coordination in EngineLoop.

    • Engine.sleep() requests a drain and waits for the main loop to acknowledge a safe boundary.
    • The main loop drops prefetched work from before sleep because those batches are stale after sessions are ended and KV cache is released.
    • Resume uses scheduler-aware runnable-event logic rather than forcing runnable state.
  • Add logs and comments for the sleep lifecycle.

    • Logs request blocking/unblocking, blocked request rejection, sleep start, drain, cleanup counts, wakeup progress, and partial-wakeup blocking.
    • Comments document the event handshake and why all scheduler sessions are ended during sleep.

Tests

Focused unit tests were added/updated for:

  • Blocked ADD_SESSION and ADD_MESSAGE returning ResponseType.CANCEL immediately.
  • Cleanup requests staying allowed while add requests are blocked.
  • Already-queued add requests being cancelled after the gate is enabled.
  • Request admission being restored after unblock/wakeup.
  • Engine.sleep() blocking input before executor sleep.
  • Active responses being cancelled and scheduler sessions being ended before executor sleep.
  • Direct EngineInstance requests returning CANCEL while sleeping.
  • Partial wakeup keeping requests blocked until all sleeping tags are restored.

A real Qwen3-8B corner-case smoke test was also run and passed. It covers:

  • Baseline request before sleep.
  • Sleep while a streaming request is active.
  • New request while sleeping.
  • Direct PyTorch EngineInstance request while sleeping.
  • Weights-only partial wakeup.
  • Full wakeup recovery.
  • A second sleep/wakeup cycle.

Notes / Scope

  • This PR intentionally does not change lmdeploy/serve, AsyncEngine, Turbomind, public response enums, or HTTP middleware behavior.
  • empty_init is not treated as PyTorch engine sleep. The sleep guard is driven only by explicit Engine.sleep() / Engine.wakeup() calls.
  • Migration/disaggregated serving sleep behavior is not covered by the smoke test in this PR. The tested runtime path is PyTorch Hybrid role with MP-engine wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant