Share the fused gated-delta kernel between MLXLLM and MLXVLM by john-rocky · Pull Request #257 · ml-explore/mlx-swift-lm

john-rocky · 2026-05-01T00:29:01Z

Summary

Libraries/MLXLLM/Models/GatedDelta.swift defines a fused gatedDeltaKernel (custom Metal kernel, single threadgroup walking the time loop) and a gatedDeltaOps expression-graph fallback. gatedDeltaUpdate prefers the kernel when MLX Metal is available, which is the only path the LLM-side Qwen 3.5 / Qwen 3 Next models take on Apple Silicon.

MLXVLM/Models/Qwen35.swift was a copy/paste of the same code with the kernel branch removed:

private func gatedDeltaUpdate(...) {
    ...
    return gatedDeltaOps(...)   // ops fallback only
}

So every VLM Qwen 3.5 checkpoint with linear_attn was running the unfused per-step expression-graph version, generating far more intermediates per token. This is a stand-alone VLM-side missed-branch fix; for context on the broader #124 perf discussion, see my profiling comment there showing the 35B-A3B-4bit text-only MoE no longer reproduces a gap on current mlx-swift 0.31.3 (mlx-c 0.31.1) builds — the VLM-side branch removal addressed in this PR is independent of that finding.

Change

Move Libraries/MLXLLM/Models/GatedDelta.swift → Libraries/MLXLMCommon/GatedDelta.swift.
Make computeGatedDeltaG, gatedDeltaKernel, gatedDeltaOps, gatedDeltaUpdate public so both LLM and VLM call through MLXLMCommon.
Delete the duplicate computeGatedDeltaG / gatedDeltaStepOps / gatedDeltaOps / gatedDeltaUpdate helpers from MLXVLM/Models/Qwen35.swift. The single VLM call site now resolves to the shared public function with the kernel-preferred dispatch.

Net (excluding the bench test): +10 / −132. No behavior change for MLXLLM (same code, new home); MLXVLM Qwen 3.5 with linear_attn now hits the fused Metal kernel.

Microbench (M4 Max, 128 GB, bfloat16)

Tests/MLXLMTests/GatedDeltaBenchTests.swift constructs realistic Qwen 3.5 linear-attn input shapes (B=1, Hk=16, Hv=64, Dk=192, Dv=128) at four representative T values and times gatedDeltaKernel vs gatedDeltaOps over 20 iterations after warmup.

shape	T	`gatedDeltaOps` median (ms)	`gatedDeltaKernel` median (ms)	speedup
decode_T1	1	0.605	0.315	1.92×
decode_T8	8	2.909	0.369	7.88×
prefill_T64	64	20.255	0.799	25.35×
prefill_T256	256	87.577	1.467	59.69×

Reproduce:

xcodebuild test -scheme mlx-swift-lm-Package \
  -destination 'platform=macOS,arch=arm64' \
  -only-testing:MLXLMTests/GatedDeltaBenchTests \
  -skipMacroValidation

(SPM swift test is unsupported because the metal kernel needs the .metallib bundle that only xcodebuild produces.)

End-to-end (M4 Max, 128 GB, `mlx-community/Qwen3.5-0.8B-MLX-4bit`)

The 0.8B unified VLM checkpoint has 24 transformer blocks; 18 of them are linear_attention (gated-delta) per its text_config.layer_types and 6 are full_attention. With max-tokens 64, temperature 0, seed 42, prompt "What is the capital of Japan?" (24 tokens), 6 trials per branch with the first dropped as warmup:

build	prompt median (tok/s)	generation median (tok/s)
Swift VLM, baseline (origin/main + #143 sanitize fix only, ops fallback)	1377.42	164.67
Swift VLM, this PR (sanitize fix + kernel fix)	2125.60	355.95
Python `mlx_lm` 0.31.3 (kernel path)	669.80	213.59
Swift speedup over baseline	1.54×	2.16×

Decode improvement (1.54× / 2.16×) is smaller than the microbench (1.92× – 59.69×) because the 0.8B model still spends a meaningful fraction of each forward in non-linear_attn work (full-attention blocks, MoE / MLP, RMS norm, embed). The decode-per-token cost on linear_attn blocks specifically is what the kernel fuses; end-to-end gain scales with the share of those blocks. Models with a higher linear_attn ratio (the larger 35B-A3B / 122B-A10B Qwen 3.5 unified VLMs) should see proportionally larger wins.

Greedy output is identical between the two Swift builds:

The capital of Japan is Tokyo (also written as 東京 in Japanese).

Tokyo is the largest city in the country, located in the central part of the country, …

(Python emits the same text minus the bold markers because it doesn't go through Swift's markdown wrapper.)

davidkoski · 2026-05-14T18:00:17Z

+// Gated-delta helpers (`gatedDeltaUpdate`, `gatedDeltaOps`, `gatedDeltaKernel`,
+// `computeGatedDeltaG`) are shared with MLXLLM via `MLXLMCommon/GatedDelta.swift`,
+// which selects the fused Metal kernel when available and falls back to the ops
+// path otherwise. Keeping a single implementation also keeps both paths in sync
+// when the upstream Python kernel evolves.


We don't need the comment for removed code.

Done — the duplicate helpers are just deleted now, nothing left in their place.

davidkoski · 2026-05-14T18:02:19Z

+import MLXLMCommon
+import XCTest
+
+final class GatedDeltaBenchTests: XCTestCase {


I don't think we need benchmark tests, though if you did want to add it, look in IntegrationTesting -- that is a more appropriate place.

Dropped the benchmark test file. The microbench and end-to-end numbers stay in the PR description as the supporting evidence.

davidkoski

See comments. This needs a rebase to pull the updated GatedDelta code.

Thanks!

…ore#124) `GatedDelta.swift` defines a fused `gatedDeltaKernel` (custom Metal kernel, single threadgroup over the time loop) plus a `gatedDeltaOps` fallback. `gatedDeltaUpdate` prefers the kernel when MLX Metal is available, which is the only path the LLM-side Qwen 3.5 / Qwen 3 Next models take on Apple Silicon. The VLM-side `MLXVLM/Models/Qwen35.swift` was a copy/paste of the same code with the kernel branch omitted, so every VLM Qwen 3.5 checkpoint with `linear_attn` ran the unfused per-step expression-graph version, generating far more intermediates per token. Move the shared file to `MLXLMCommon` and mark `gatedDeltaUpdate` public; its helpers stay internal since nothing outside the file calls them. Delete the duplicate helpers from `MLXVLM/Models/Qwen35.swift` so the VLM call site goes through the same kernel-preferred dispatch as the LLM side. `Tests/MLXLMTests/GatedDeltaTests.swift` imports `MLXLMCommon` instead of `@testable import MLXLLM`, following the move. No behavior change for MLXLLM (same code in a new module); MLXVLM Qwen 3.5 with `linear_attn` now uses the fused kernel.

john-rocky · 2026-05-14T19:11:08Z

Thanks for the review! Addressed all three points:

Rebased onto current main (5b7e543) — this picks up fix gated delta state precision -- fp32 state to match Python #224 (fix gated delta state precision -- fp32 state to match Python), so the moved GatedDelta.swift carries the fp32-state code rather than the pre-fix gated delta state precision -- fp32 state to match Python #224 version.
Removed the comment for the deleted code in MLXVLM/Models/Qwen35.swift — the duplicate helpers are just gone now.
Dropped the benchmark test (GatedDeltaBenchTests.swift). The microbench / end-to-end numbers stay in the PR description as the supporting evidence.

Two things worth flagging since they're new in this diff:

Only gatedDeltaUpdate is public now — its helpers (computeGatedDeltaG, gatedDeltaKernel, gatedDeltaOps) stay internal, since nothing outside GatedDelta.swift calls them. The earlier revision over-exposed all four.
Tests/MLXLMTests/GatedDeltaTests.swift (added in fix gated delta state precision -- fp32 state to match Python #224) switches from @testable import MLXLLM to import MLXLMCommon, following the file's move — it only references the now-public gatedDeltaUpdate.

Net is +2 / −132 across 3 files (GatedDelta.swift is a 99%-similarity rename). swift build and swift build --build-tests both pass clean on main@5b7e543.

davidkoski reviewed May 14, 2026

View reviewed changes

davidkoski requested changes May 14, 2026

View reviewed changes

john-rocky force-pushed the perf/qwen35-vlm-gated-delta-kernel branch from 90ba4ce to 46f46f8 Compare May 14, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Share the fused gated-delta kernel between MLXLLM and MLXVLM#257

Share the fused gated-delta kernel between MLXLLM and MLXVLM#257
john-rocky wants to merge 1 commit into
ml-explore:mainfrom
john-rocky:perf/qwen35-vlm-gated-delta-kernel

john-rocky commented May 1, 2026 •

edited

Loading

Uh oh!

davidkoski May 14, 2026

Uh oh!

john-rocky May 14, 2026

Uh oh!

davidkoski May 14, 2026

Uh oh!

john-rocky May 14, 2026

Uh oh!

davidkoski left a comment

Uh oh!

john-rocky commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

john-rocky commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Microbench (M4 Max, 128 GB, bfloat16)

End-to-end (M4 Max, 128 GB, mlx-community/Qwen3.5-0.8B-MLX-4bit)

Uh oh!

davidkoski May 14, 2026

Choose a reason for hiding this comment

Uh oh!

john-rocky May 14, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski May 14, 2026

Choose a reason for hiding this comment

Uh oh!

john-rocky May 14, 2026

Choose a reason for hiding this comment

Uh oh!

davidkoski left a comment

Choose a reason for hiding this comment

Uh oh!

john-rocky commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

john-rocky commented May 1, 2026 •

edited

Loading

End-to-end (M4 Max, 128 GB, `mlx-community/Qwen3.5-0.8B-MLX-4bit`)