Add TurboQuant KV cache compression (arXiv 2504.19874) by joelnishanth · Pull Request #405 · ml-explore/mlx-swift

joelnishanth · 2026-05-14T03:53:57Z

Summary

Adds opt-in TurboQuant KV cache compression for long-context LLM inference on Apple Silicon. TurboQuant (Zandieh et al., arXiv 2504.19874) applies 3-bit PolarQuant + 1-bit QJL residual correction to compress the KV cache, achieving ~5x memory reduction for cached tokens with minimal quality impact.

New C++ primitives in Source/Cmlx/turbo-quant/ (outside mlx/mlx-c submodules)
- turbo_quant.h: Header-only Lloyd-Max codebook, Walsh-Hadamard rotation, QJL projection
- turbo_quant_ops.cpp: Encode/decode in mlx::core::fast namespace
- turbo_quant_bridge.cpp: C bridge for Swift FFI
Swift bindings in MLXFast.swift:
- turboQuantEncode(keys:values:bits:) — compress K/V tensors
- turboDecodeK(packed:) / turboDecodeV(packed:) — decompress
Package.swift updated with header search path for new directory

All TurboQuant code lives in a dedicated turbo-quant/ directory outside the mlx and mlx-c submodules, keeping the dependency boundary clean.

Expected Performance

Based on the TurboQuant paper (Table 1):

Metric	fp16	TurboQuant (3-bit)	Delta
KV Cache Size (8K ctx)	100%	~19%	-81%
Throughput (tok/s)	baseline	~0.95-1.0x	minimal
Quality (perplexity)	baseline	+0.1-0.3 ppl	negligible

Companion PRs

mlx-swift-lm: KVCache integration with hot-window eviction + ChatSession toggle (PR forthcoming)
Benchmarks: joelnishanth/mlx-swift-turboquant

Test plan

Build succeeds on macOS 14+ (Xcode)
No changes to existing public APIs
All new code is behind opt-in API calls
Verified in downstream project (Clipper) with model inference

References

Zandieh et al., "TurboQuant: Online KV Cache Quantization with Provably Minimal Error", arXiv 2504.19874 (2025)

— Joel Nishanth · offlyn.AI

Made with Cursor

Adds opt-in TurboQuant KV cache compression for long-context LLM inference. 3-bit PolarQuant + 1-bit QJL residual correction for keys. New files in Source/Cmlx/turbo-quant/: - turbo_quant.h: Header-only C++ Lloyd-Max codebook, WHT, QJL - turbo_quant_ops.cpp: Encode/decode in mlx::core::fast - turbo_quant_bridge.cpp: C bridge for Swift FFI - turbo_quant_decl.h: Forward declarations Modified: - Package.swift: turbo-quant header search path - include/mlx/c/fast.h: C bridge declarations - MLXFast.swift: Swift bindings Reference: Zandieh et al., arXiv 2504.19874 (2025) Benchmarks: https://github.com/joelnishanth/mlx-swift-turboquant Co-authored-by: Cursor <cursoragent@cursor.com>

davidkoski · 2026-05-18T15:53:10Z

You shouldn't do it this way -- the modification is to the mlx layer. This PR should go there.

davidkoski · 2026-05-18T19:31:04Z

See also:

davidkoski · 2026-05-18T20:32:57Z

FWIW I think so far people have been adding turboquant in the lm layer. Here is a comment on adding it to mlx:

Add TurboQuant KV cache compression with native Metal SDPA kernel mlx#3328 (comment)

joelnishanth · 2026-05-19T18:15:20Z

ah makes sense , i do have some benchmarking numbers but i dont want duplicate efforts if there has been significant progress made

joelnishanth mentioned this pull request May 14, 2026

Add TurboQuant KV cache compression on KVCacheSimple ml-explore/mlx-swift-lm#287

Open

5 tasks

davidkoski closed this May 18, 2026

davidkoski mentioned this pull request May 18, 2026

Interest in TurboQuant / rotating quantized KV cache support? ml-explore/mlx-swift-lm#294

Open

davidkoski mentioned this pull request May 18, 2026

Add TurboQuant Swift support primitives #412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TurboQuant KV cache compression (arXiv 2504.19874)#405

Add TurboQuant KV cache compression (arXiv 2504.19874)#405
joelnishanth wants to merge 1 commit into
ml-explore:mainfrom
joelnishanth:feature/turboquant

joelnishanth commented May 14, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

joelnishanth commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joelnishanth commented May 14, 2026

Summary

Expected Performance

Companion PRs

Test plan

References

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

joelnishanth commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants