Skip to content

Add TurboQuant KV cache compression on KVCacheSimple#287

Open
joelnishanth wants to merge 2 commits into
ml-explore:mainfrom
joelnishanth:feature/turboquant
Open

Add TurboQuant KV cache compression on KVCacheSimple#287
joelnishanth wants to merge 2 commits into
ml-explore:mainfrom
joelnishanth:feature/turboquant

Conversation

@joelnishanth
Copy link
Copy Markdown

Summary

Integrates TurboQuant KV cache compression (Zandieh et al., arXiv 2504.19874) into MLXLMCommon as a fully opt-in feature. When enabled, compresses cold KV cache tokens to ~3 bits using PolarQuant + QJL while keeping the most recent tokens in full fp16 precision (hot window).

  • KVCache.swift: turboQuantEnabled property on KVCacheSimple, hot-window eviction logic that compresses cold tokens via MLXFast.turboQuantEncode, head dimension handling (128/256 direct, 512 split-and-merge)
  • AttentionUtils.swift: Decode path in attentionWithCacheUpdate() that decompresses polar arrays and prepends to hot window before SDPA, dynamic mask slicing for extended context
  • ChatSession.swift: turboQuantEnabled property that propagates to all KVCacheSimple layers after cache creation

What changes for existing users: Nothing

turboQuantEnabled defaults to false. All existing code paths are unchanged. The TurboQuant code only activates when explicitly opted in.

Expected Performance

Context KV Memory (fp16) KV Memory (TurboKV) Savings
2K tokens baseline baseline (below threshold) 0%
4K tokens baseline ~60% baseline ~40%
8K tokens baseline ~25% baseline ~75%

Quality impact: +0.1-0.3 perplexity increase (negligible).

Usage

let session = ChatSession(modelContainer)
session.turboQuantEnabled = true  // Opt-in

for try await chunk in session.streamResponse(to: prompt) {
    print(chunk, terminator: "")
}

Dependencies

This PR currently points Package.swift at joelnishanth/mlx-swift (feature/turboquant) for the TurboQuant C++ primitives. See companion PR: ml-explore/mlx-swift#405

Once that PR merges, this dependency can switch back to upstream ml-explore/mlx-swift.

Test plan

  • Build succeeds on macOS 14+ (Xcode)
  • Default behavior (turboQuantEnabled=false) unchanged
  • App launch smoke test with dependency switch (no crashes)
  • All existing public APIs preserved
  • Verified in downstream project with model inference

Benchmarks

joelnishanth/mlx-swift-turboquant — Benchmark CLI comparing fp16 vs TurboQuant across short/medium/long contexts.

References

  • Zandieh et al., "TurboQuant: Online KV Cache Quantization with Provably Minimal Error", arXiv 2504.19874 (2025)

Joel Nishanth · offlyn.AI

Made with Cursor

joelnishanth and others added 2 commits May 13, 2026 20:53
Integrates TurboQuant (arXiv 2504.19874) into MLXLMCommon as an opt-in
feature for KV cache compression during long-context inference.

KVCache.swift:
- turboQuantEnabled, polarKeys/Values, residualKeys/Values on KVCacheSimple
- Hot-window eviction: compresses cold tokens via MLXFast.turboQuantEncode
- Head dim handling: 128/256 direct, 512 split-and-merge

AttentionUtils.swift:
- Decode path: MLXFast.turboDecodeK/V prepended to hot window
- Dynamic mask slicing for extended compressed context
- TurboKVTelemetry for compression stats logging

ChatSession.swift:
- turboQuantEnabled property propagated to KVCacheSimple layers

All changes are opt-in (turboQuantEnabled defaults to false).
Existing fp16 path is completely unchanged.

Package.swift points to joelnishanth/mlx-swift feature/turboquant
for TurboQuant C++ primitives. Intended to switch back to upstream
once the mlx-swift PR is merged.

Benchmarks: https://github.com/joelnishanth/mlx-swift-turboquant
Co-authored-by: Cursor <cursoragent@cursor.com>
…hatSession

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant