add TurboQuant KV cache compression by TheTom · Pull Request #232 · ml-explore/mlx-swift-lm

TheTom · 2026-04-22T14:40:49Z

Summary

Implements TurboQuant KV cache compression -- WHT rotation + Lloyd-Max codebook quantization with support for asymmetric K/V bit-widths. Compresses KV cache 3-7x with minimal quality loss.

New: TurboQuantKVCache -- two-phase cache (raw prefill, compressed decode)
New: TurboQuantKernels -- Metal kernels for fused encode and compressed-domain attention
Modified: KVCache.swift, AttentionUtils.swift, Evaluate.swift -- scheme routing and parameter threading

Dependencies

Requires ml-explore/mlx-c#113 -- guards mlx_array_dim against 0-dim arrays. Without this, Swift metadata init on 0-dim MLXArray crashes during TurboQuant codec setup.

Schemes

Scheme	Description	KV Compression
`turbo4`	4-bit K + 4-bit V	3.8x
`turbo3`	3-bit symmetric	4.9x
`turbo2`	2-bit symmetric	7.1x
`turbo4v2`	4-bit K + 2-bit V (asymmetric)	4.9x
`turbo4v3`	4-bit K + 3-bit V	4.3x
`turbo0v4`	FP16 K + 4-bit V (raw keys)	1.6x
`turbo0v2`	FP16 K + 2-bit V	1.8x

PPL data (Genesis text, 512 tokens)

Multi-model (turbo4 -- recommended default)

Model	Architecture	Baseline	turbo4	turbo4v2	turbo0v4
Llama-3.2-3B-4bit	Dense, 28L	1.03	1.03	1.10	1.03
Qwen2.5-3B-4bit	Dense, 36L	1.00	1.97	4.29	1.00
Mistral-7B-4bit	Dense, 32L	1.00	1.00	--	1.00
Qwen3.5-35B-A3B-4bit	MoE, 64L	1.00	1.00	1.00	1.00

All schemes (Llama-3.2-3B)

Scheme	PPL	Delta
none (FP16)	1.47	--
turbo0v4	1.48	+0.01
turbo0v2	1.57	+0.10
turbo4	1.63	+0.16
turbo4v2	2.04	+0.57
turbo3	2.11	+0.64
turbo4v3	2.32	+0.85
turbo2	14.31	+12.84

Context length stability (Llama-3.2-3B)

Context	Baseline	turbo4
512	1.03	1.03
1024	1.15	1.14
2048	1.02	1.03

Design

Two-phase: raw FP16 during prefill (zero overhead), batch-compress on first decode token
WHT rotation: O(d log d) butterfly transform in Metal for power-of-2 head dims, dense matmul fallback otherwise
Shared codecs: all layers with same (dim, bits, seed) share rotation matrix and codebook
Batch recompression: pending tokens accumulated and encoded in batches (default 64) to reduce kernel launches
Raw-K mode: turbo0v* schemes keep keys at FP16 while compressing only values -- best quality/compression tradeoff when memory allows

Known limitations

turbo8 (8-bit) encode is prohibitively slow due to 256-centroid codebook search -- practical limit is 4-bit
WHT requires power-of-2 head dimensions (64, 128, 256); non-power-of-2 falls back to dense rotation
Flash attention kernels for L=1 decode are present but disabled pending validation -- using separate score + value path

Feedback welcome

This is a draft -- would love feedback on:

API surface and integration points
Metal kernel approach (JIT via metalKernel vs framework-level dispatch)
Whether this should ship as opt-in only or with recommended defaults
Any concerns about the two-phase architecture

🤖 Generated with Claude Code

Implements TurboQuant (arXiv:2504.19874) for KV cache compression: WHT rotation + Lloyd-Max codebook quantization with asymmetric K/V bit-width support. Compresses KV cache 3-7x with minimal quality loss. New files: - TurboQuantKVCache.swift: two-phase cache (raw prefill, compressed decode) - TurboQuantKernels.swift: Metal kernels for fused encode and attention Modified: - KVCache.swift: kvScheme routing for turbo schemes - AttentionUtils.swift: TurboQuantKVCache dispatch - Evaluate.swift: kvScheme parameter threading

aleroot · 2026-04-22T18:33:16Z

    ],
    dependencies: [
-        .package(url: "https://github.com/ml-explore/mlx-swift", .upToNextMinor(from: "0.31.3")),
+        .package(url: "https://github.com/ekryski/mlx-swift", branch: "alpha"),


You cannot change this library, I mean this needs to reference the official ml-explore mlx-swift, if you need to add there, you need to first add a pull request there and get it merged, this is no sense.

This PR is still in draft, pointing to the wrong library. I would hold off on review. Defiant pointing at the wrong library

This will be fixed in the next push. No intention of switching the dependencies.

aleroot · 2026-04-22T18:35:37Z

Out-of-bounds threadgroup memory in fused encode kernels (can corrupt results/crash for head_dim > 128).
In both dense and WHT encode kernels, shared_norm is hardcoded to 4 entries, but indexed by sg_id = d/32 and read up to num_groups = (Dim+31)/32. For Dim=256, this accesses indices 0...7.

Good catch, fixed in next push. shared_norm is now sized [(Dim + 31) / 32] instead of hardcoded 4. Current models all use head_dim=128 so this wasn't hit in practice but would break on 256.

aleroot · 2026-04-22T18:38:08Z

+        }
+
+        // KV head mapping (use first query's head — same assumption as non-causal NR0)
+        uint q_head_idx_0 = (query_group * NR0) / L;


NR0 path derives kv_idx from only the first query row in the group. If a group spans multiple query heads (possible when grouping is not aligned to L), some rows will read/write against the wrong KV head. Guarding only on totalQ % nr0 == 0 is insufficient for correctness.

Added a queryChunkLength % nr0 == 0 in, next push, guard so the NR0 path only activates when L aligns with the group size. Falls back to per-row dispatch otherwise.

aleroot · 2026-04-22T18:39:37Z

TurboQuant cache cannot be restored correctly from prompt cache. TurboQuantKVCache is not represented in cacheClassName, so it is serialized as KVCache. On load it is restored as KVCacheSimple, but TurboQuant compressed state carries 3/4 arrays rather than KVCacheSimple’s required 2 arrays, leading to invalid restoration behavior. Add explicit TurboQuant class name mapping and restore path.

Fixed in nex push. Added TurboQuantKVCache to cacheClassName and restoreCacheFromMetaState. metaState now carries bits, keyBits, valueBits, and seed so the cache can be reconstructed correctly on load.

TheTom · 2026-04-22T18:41:28Z

@aleroot I appreciate the early reviews but this is still in draft.

aleroot · 2026-04-22T18:42:23Z

        self.maxTokens = parameters.maxTokens
        self.numDraftTokens = numDraftTokens

        self.quantizeKVCache = { cache in


kvScheme dropped in speculative decoding quantization.TurboQuant scheme selection is ignored for speculative generation despite being part of GenerateParameters...

Yep, missed that. Threaded kvScheme through the speculative quantization closure now.

aleroot

The implementation is ambitious and has strong ideas, and honestly is something I wanted to work on as well... At the moment this pull request is just a prototype as it does not look like a full inner-product TurboQuant formulation ...
There are no tests as well that are really proving is working.

TheTom · 2026-04-22T18:50:32Z

The implementation is ambitious and has strong ideas, and honestly is something I wanted to work on as well... At the moment this pull request is just a prototype as it does not look like a full inner-product TurboQuant formulation ...
There are no tests as well that are really proving is working.

Appreciate your early feedback. Still a work in progress and know that this is an ambitious PR as is. Thank you for the draft comments!

aleroot · 2026-04-22T18:57:37Z

The implementation is ambitious and has strong ideas, and honestly is something I wanted to work on as well... At the moment this pull request is just a prototype as it does not look like a full inner-product TurboQuant formulation ...
There are no tests as well that are really proving is working.

Appreciate your early feedback. Still a work in progress and know that this is an ambitious PR as is. Thank you for the draft comments!

Sorry, I did not notice it was a draft .

TheTom · 2026-04-22T19:13:32Z

The implementation is ambitious and has strong ideas, and honestly is something I wanted to work on as well... At the moment this pull request is just a prototype as it does not look like a full inner-product TurboQuant formulation ...

There are no tests as well that are really proving is working.

Appreciate your early feedback. Still a work in progress and know that this is an ambitious PR as is. Thank you for the draft comments!

Sorry, I did not notice it was a draft .

All good let's work together on this. It's a big lift and it's hard to separate into individual PRs. Lots of work ahead.

TheTom · 2026-04-22T20:04:22Z

The implementation is ambitious and has strong ideas, and honestly is something I wanted to work on as well... At the moment this pull request is just a prototype as it does not look like a full inner-product TurboQuant formulation ... There are no tests as well that are really proving is working.

Follow up as i was on the road for my last comment:
Appreciate the thorough review, genuinely helpful. All the bugs you flagged are fixed locally and will be in the next push.

Still polishing this, the PPL data in the description shows it working across Llama, Qwen, Mistral, and Qwen3.5 MoE. Happy to collaborate on getting this to a state that works for the project. If you have ideas on the formulation or testing approach I'm all ears. (we can also do email or discord or whatever works)

…kvScheme - size shared_norm dynamically in Metal encode kernels for head_dim > 128 - guard NR0 flash path on query chunk alignment to prevent KV head mismatch - add TurboQuantKVCache to prompt cache save/restore with metaState - thread kvScheme through speculative decoding quantization closure - point Package.swift at ml-explore/mlx-swift 0.31.3

aleroot reviewed Apr 22, 2026

View reviewed changes

aleroot suggested changes Apr 22, 2026

View reviewed changes

TheTom force-pushed the pr/upstream-turboquant branch from 4921784 to 2864b83 Compare April 22, 2026 20:08

aleroot mentioned this pull request May 8, 2026

support gpt-oss-20b-tq3 #270

Open

This was referenced May 18, 2026

Add TurboQuant KV cache compression (arXiv 2504.19874) ml-explore/mlx-swift#405

Closed

Interest in TurboQuant / rotating quantized KV cache support? #294

Open

Conversation

TheTom commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Schemes

PPL data (Genesis text, 512 tokens)

Multi-model (turbo4 -- recommended default)

All schemes (Llama-3.2-3B)

Context length stability (Llama-3.2-3B)

Design

Known limitations

Feedback welcome

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheTom commented Apr 22, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aleroot left a comment

Choose a reason for hiding this comment

Uh oh!

TheTom commented Apr 22, 2026

Uh oh!

aleroot commented Apr 22, 2026

Uh oh!

TheTom commented Apr 22, 2026

Uh oh!

TheTom commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheTom commented Apr 22, 2026 •

edited

Loading