Add per-request prompt cache files to server by Quiet-Node-io · Pull Request #1283 · ml-explore/mlx-lm

Quiet-Node-io · 2026-05-18T17:12:10Z

Summary

Adds optional per-request disk prompt-cache support to the OpenAI-compatible server by accepting prompt_cache_file / prompt-cache-file in chat and text completion requests.

The server already supports an in-memory prompt cache and the CLI path already supports --prompt-cache-file; this change lets server callers use the same disk persistence primitive without switching away from /v1/chat/completions.

Behavior

prompt_cache_file loads a matching disk prompt cache into an isolated request-local LRUPromptCache before prefill.
The server saves a prompt cache file after the first generated token so future matching-prefix requests can reuse the prefix.
Saved disk caches are trimmed by one prompt token to avoid exact-cache-hit requests entering generation with an empty remaining prompt.
Cache-file requests are routed through the single-request path, not batching, until batch semantics are defined for per-request disk files.
disable_prompt_cache forces an isolated request-local cache and suppresses disk load/save even if a caller sends a cache file path.
Requests without these fields keep the existing non-cache server path.

Validation

python -m py_compile mlx_lm/server.py

Add per-request prompt cache files to server

5532f39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-request prompt cache files to server#1283

Add per-request prompt cache files to server#1283
Quiet-Node-io wants to merge 1 commit into
ml-explore:mainfrom
Quiet-Node-io:ran418-server-prompt-cache-file

Quiet-Node-io commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Quiet-Node-io commented May 18, 2026

Summary

Behavior

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant