Add TurboQuant KV cache compression on KVCacheSimple#287
Open
joelnishanth wants to merge 2 commits into
Open
Conversation
Integrates TurboQuant (arXiv 2504.19874) into MLXLMCommon as an opt-in feature for KV cache compression during long-context inference. KVCache.swift: - turboQuantEnabled, polarKeys/Values, residualKeys/Values on KVCacheSimple - Hot-window eviction: compresses cold tokens via MLXFast.turboQuantEncode - Head dim handling: 128/256 direct, 512 split-and-merge AttentionUtils.swift: - Decode path: MLXFast.turboDecodeK/V prepended to hot window - Dynamic mask slicing for extended compressed context - TurboKVTelemetry for compression stats logging ChatSession.swift: - turboQuantEnabled property propagated to KVCacheSimple layers All changes are opt-in (turboQuantEnabled defaults to false). Existing fp16 path is completely unchanged. Package.swift points to joelnishanth/mlx-swift feature/turboquant for TurboQuant C++ primitives. Intended to switch back to upstream once the mlx-swift PR is merged. Benchmarks: https://github.com/joelnishanth/mlx-swift-turboquant Co-authored-by: Cursor <cursoragent@cursor.com>
…hatSession Co-authored-by: Cursor <cursoragent@cursor.com>
This was referenced May 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Integrates TurboQuant KV cache compression (Zandieh et al., arXiv 2504.19874) into MLXLMCommon as a fully opt-in feature. When enabled, compresses cold KV cache tokens to ~3 bits using PolarQuant + QJL while keeping the most recent tokens in full fp16 precision (hot window).
turboQuantEnabledproperty onKVCacheSimple, hot-window eviction logic that compresses cold tokens viaMLXFast.turboQuantEncode, head dimension handling (128/256 direct, 512 split-and-merge)attentionWithCacheUpdate()that decompresses polar arrays and prepends to hot window before SDPA, dynamic mask slicing for extended contextturboQuantEnabledproperty that propagates to all KVCacheSimple layers after cache creationWhat changes for existing users: Nothing
turboQuantEnableddefaults tofalse. All existing code paths are unchanged. The TurboQuant code only activates when explicitly opted in.Expected Performance
Quality impact: +0.1-0.3 perplexity increase (negligible).
Usage
Dependencies
This PR currently points
Package.swiftatjoelnishanth/mlx-swift(feature/turboquant) for the TurboQuant C++ primitives. See companion PR: ml-explore/mlx-swift#405Once that PR merges, this dependency can switch back to upstream
ml-explore/mlx-swift.Test plan
Benchmarks
joelnishanth/mlx-swift-turboquant — Benchmark CLI comparing fp16 vs TurboQuant across short/medium/long contexts.
References
— Joel Nishanth · offlyn.AI
Made with Cursor