Skip to content

Add TurboQuant KV cache compression (arXiv 2504.19874)#405

Closed
joelnishanth wants to merge 1 commit into
ml-explore:mainfrom
joelnishanth:feature/turboquant
Closed

Add TurboQuant KV cache compression (arXiv 2504.19874)#405
joelnishanth wants to merge 1 commit into
ml-explore:mainfrom
joelnishanth:feature/turboquant

Conversation

@joelnishanth
Copy link
Copy Markdown

Summary

Adds opt-in TurboQuant KV cache compression for long-context LLM inference on Apple Silicon. TurboQuant (Zandieh et al., arXiv 2504.19874) applies 3-bit PolarQuant + 1-bit QJL residual correction to compress the KV cache, achieving ~5x memory reduction for cached tokens with minimal quality impact.

  • New C++ primitives in Source/Cmlx/turbo-quant/ (outside mlx/mlx-c submodules)
    • turbo_quant.h: Header-only Lloyd-Max codebook, Walsh-Hadamard rotation, QJL projection
    • turbo_quant_ops.cpp: Encode/decode in mlx::core::fast namespace
    • turbo_quant_bridge.cpp: C bridge for Swift FFI
  • Swift bindings in MLXFast.swift:
    • turboQuantEncode(keys:values:bits:) — compress K/V tensors
    • turboDecodeK(packed:) / turboDecodeV(packed:) — decompress
  • Package.swift updated with header search path for new directory

All TurboQuant code lives in a dedicated turbo-quant/ directory outside the mlx and mlx-c submodules, keeping the dependency boundary clean.

Expected Performance

Based on the TurboQuant paper (Table 1):

Metric fp16 TurboQuant (3-bit) Delta
KV Cache Size (8K ctx) 100% ~19% -81%
Throughput (tok/s) baseline ~0.95-1.0x minimal
Quality (perplexity) baseline +0.1-0.3 ppl negligible

Companion PRs

Test plan

  • Build succeeds on macOS 14+ (Xcode)
  • No changes to existing public APIs
  • All new code is behind opt-in API calls
  • Verified in downstream project (Clipper) with model inference

References

  • Zandieh et al., "TurboQuant: Online KV Cache Quantization with Provably Minimal Error", arXiv 2504.19874 (2025)

Joel Nishanth · offlyn.AI

Made with Cursor

Adds opt-in TurboQuant KV cache compression for long-context LLM inference.
3-bit PolarQuant + 1-bit QJL residual correction for keys.

New files in Source/Cmlx/turbo-quant/:
- turbo_quant.h: Header-only C++ Lloyd-Max codebook, WHT, QJL
- turbo_quant_ops.cpp: Encode/decode in mlx::core::fast
- turbo_quant_bridge.cpp: C bridge for Swift FFI
- turbo_quant_decl.h: Forward declarations

Modified:
- Package.swift: turbo-quant header search path
- include/mlx/c/fast.h: C bridge declarations
- MLXFast.swift: Swift bindings

Reference: Zandieh et al., arXiv 2504.19874 (2025)
Benchmarks: https://github.com/joelnishanth/mlx-swift-turboquant
Co-authored-by: Cursor <cursoragent@cursor.com>
@davidkoski
Copy link
Copy Markdown
Collaborator

You shouldn't do it this way -- the modification is to the mlx layer. This PR should go there.

@davidkoski davidkoski closed this May 18, 2026
@davidkoski
Copy link
Copy Markdown
Collaborator

@davidkoski
Copy link
Copy Markdown
Collaborator

FWIW I think so far people have been adding turboquant in the lm layer. Here is a comment on adding it to mlx:

@joelnishanth
Copy link
Copy Markdown
Author

ah makes sense , i do have some benchmarking numbers but i dont want duplicate efforts if there has been significant progress made

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants