Skip to content

Add per-request prompt cache files to server#1283

Open
Quiet-Node-io wants to merge 1 commit into
ml-explore:mainfrom
Quiet-Node-io:ran418-server-prompt-cache-file
Open

Add per-request prompt cache files to server#1283
Quiet-Node-io wants to merge 1 commit into
ml-explore:mainfrom
Quiet-Node-io:ran418-server-prompt-cache-file

Conversation

@Quiet-Node-io
Copy link
Copy Markdown

Summary

Adds optional per-request disk prompt-cache support to the OpenAI-compatible server by accepting prompt_cache_file / prompt-cache-file in chat and text completion requests.

The server already supports an in-memory prompt cache and the CLI path already supports --prompt-cache-file; this change lets server callers use the same disk persistence primitive without switching away from /v1/chat/completions.

Behavior

  • prompt_cache_file loads a matching disk prompt cache into an isolated request-local LRUPromptCache before prefill.
  • The server saves a prompt cache file after the first generated token so future matching-prefix requests can reuse the prefix.
  • Saved disk caches are trimmed by one prompt token to avoid exact-cache-hit requests entering generation with an empty remaining prompt.
  • Cache-file requests are routed through the single-request path, not batching, until batch semantics are defined for per-request disk files.
  • disable_prompt_cache forces an isolated request-local cache and suppresses disk load/save even if a caller sends a cache file path.
  • Requests without these fields keep the existing non-cache server path.

Validation

  • python -m py_compile mlx_lm/server.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant