Add TurboQuant KV cache compression on KVCacheSimple by joelnishanth · Pull Request #287 · ml-explore/mlx-swift-lm

joelnishanth · 2026-05-14T03:54:18Z

Summary

Integrates TurboQuant KV cache compression (Zandieh et al., arXiv 2504.19874) into MLXLMCommon as a fully opt-in feature. When enabled, compresses cold KV cache tokens to ~3 bits using PolarQuant + QJL while keeping the most recent tokens in full fp16 precision (hot window).

KVCache.swift: turboQuantEnabled property on KVCacheSimple, hot-window eviction logic that compresses cold tokens via MLXFast.turboQuantEncode, head dimension handling (128/256 direct, 512 split-and-merge)
AttentionUtils.swift: Decode path in attentionWithCacheUpdate() that decompresses polar arrays and prepends to hot window before SDPA, dynamic mask slicing for extended context
ChatSession.swift: turboQuantEnabled property that propagates to all KVCacheSimple layers after cache creation

What changes for existing users: Nothing

turboQuantEnabled defaults to false. All existing code paths are unchanged. The TurboQuant code only activates when explicitly opted in.

Expected Performance

Context	KV Memory (fp16)	KV Memory (TurboKV)	Savings
2K tokens	baseline	baseline (below threshold)	0%
4K tokens	baseline	~60% baseline	~40%
8K tokens	baseline	~25% baseline	~75%

Quality impact: +0.1-0.3 perplexity increase (negligible).

Usage

let session = ChatSession(modelContainer)
session.turboQuantEnabled = true  // Opt-in

for try await chunk in session.streamResponse(to: prompt) {
    print(chunk, terminator: "")
}

Dependencies

This PR currently points Package.swift at joelnishanth/mlx-swift (feature/turboquant) for the TurboQuant C++ primitives. See companion PR: ml-explore/mlx-swift#405

Once that PR merges, this dependency can switch back to upstream ml-explore/mlx-swift.

Test plan

Build succeeds on macOS 14+ (Xcode)
Default behavior (turboQuantEnabled=false) unchanged
App launch smoke test with dependency switch (no crashes)
All existing public APIs preserved
Verified in downstream project with model inference

Benchmarks

joelnishanth/mlx-swift-turboquant — Benchmark CLI comparing fp16 vs TurboQuant across short/medium/long contexts.

References

Zandieh et al., "TurboQuant: Online KV Cache Quantization with Provably Minimal Error", arXiv 2504.19874 (2025)

— Joel Nishanth · offlyn.AI

Made with Cursor

Integrates TurboQuant (arXiv 2504.19874) into MLXLMCommon as an opt-in feature for KV cache compression during long-context inference. KVCache.swift: - turboQuantEnabled, polarKeys/Values, residualKeys/Values on KVCacheSimple - Hot-window eviction: compresses cold tokens via MLXFast.turboQuantEncode - Head dim handling: 128/256 direct, 512 split-and-merge AttentionUtils.swift: - Decode path: MLXFast.turboDecodeK/V prepended to hot window - Dynamic mask slicing for extended compressed context - TurboKVTelemetry for compression stats logging ChatSession.swift: - turboQuantEnabled property propagated to KVCacheSimple layers All changes are opt-in (turboQuantEnabled defaults to false). Existing fp16 path is completely unchanged. Package.swift points to joelnishanth/mlx-swift feature/turboquant for TurboQuant C++ primitives. Intended to switch back to upstream once the mlx-swift PR is merged. Benchmarks: https://github.com/joelnishanth/mlx-swift-turboquant Co-authored-by: Cursor <cursoragent@cursor.com>

…hatSession Co-authored-by: Cursor <cursoragent@cursor.com>

joelnishanth and others added 2 commits May 13, 2026 20:53

Add configurable turboMinActivationTokens and turboHotWindowSize on C…

1a7cf0d

…hatSession Co-authored-by: Cursor <cursoragent@cursor.com>

This was referenced May 18, 2026

Add TurboQuant KV cache compression (arXiv 2504.19874) ml-explore/mlx-swift#405

Closed

Interest in TurboQuant / rotating quantized KV cache support? #294

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TurboQuant KV cache compression on KVCacheSimple#287

Add TurboQuant KV cache compression on KVCacheSimple#287
joelnishanth wants to merge 2 commits into
ml-explore:mainfrom
joelnishanth:feature/turboquant

joelnishanth commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

joelnishanth commented May 14, 2026

Summary

What changes for existing users: Nothing

Expected Performance

Usage

Dependencies

Test plan

Benchmarks

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant