Skip to content

Share the fused gated-delta kernel between MLXLLM and MLXVLM#257

Open
john-rocky wants to merge 1 commit into
ml-explore:mainfrom
john-rocky:perf/qwen35-vlm-gated-delta-kernel
Open

Share the fused gated-delta kernel between MLXLLM and MLXVLM#257
john-rocky wants to merge 1 commit into
ml-explore:mainfrom
john-rocky:perf/qwen35-vlm-gated-delta-kernel

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

@john-rocky john-rocky commented May 1, 2026

Summary

Libraries/MLXLLM/Models/GatedDelta.swift defines a fused gatedDeltaKernel (custom Metal kernel, single threadgroup walking the time loop) and a gatedDeltaOps expression-graph fallback. gatedDeltaUpdate prefers the kernel when MLX Metal is available, which is the only path the LLM-side Qwen 3.5 / Qwen 3 Next models take on Apple Silicon.

MLXVLM/Models/Qwen35.swift was a copy/paste of the same code with the kernel branch removed:

private func gatedDeltaUpdate(...) {
    ...
    return gatedDeltaOps(...)   // ops fallback only
}

So every VLM Qwen 3.5 checkpoint with linear_attn was running the unfused per-step expression-graph version, generating far more intermediates per token. This is a stand-alone VLM-side missed-branch fix; for context on the broader #124 perf discussion, see my profiling comment there showing the 35B-A3B-4bit text-only MoE no longer reproduces a gap on current mlx-swift 0.31.3 (mlx-c 0.31.1) builds — the VLM-side branch removal addressed in this PR is independent of that finding.

Change

  • Move Libraries/MLXLLM/Models/GatedDelta.swiftLibraries/MLXLMCommon/GatedDelta.swift.
  • Make computeGatedDeltaG, gatedDeltaKernel, gatedDeltaOps, gatedDeltaUpdate public so both LLM and VLM call through MLXLMCommon.
  • Delete the duplicate computeGatedDeltaG / gatedDeltaStepOps / gatedDeltaOps / gatedDeltaUpdate helpers from MLXVLM/Models/Qwen35.swift. The single VLM call site now resolves to the shared public function with the kernel-preferred dispatch.

Net (excluding the bench test): +10 / −132. No behavior change for MLXLLM (same code, new home); MLXVLM Qwen 3.5 with linear_attn now hits the fused Metal kernel.

Microbench (M4 Max, 128 GB, bfloat16)

Tests/MLXLMTests/GatedDeltaBenchTests.swift constructs realistic Qwen 3.5 linear-attn input shapes (B=1, Hk=16, Hv=64, Dk=192, Dv=128) at four representative T values and times gatedDeltaKernel vs gatedDeltaOps over 20 iterations after warmup.

shape T gatedDeltaOps median (ms) gatedDeltaKernel median (ms) speedup
decode_T1 1 0.605 0.315 1.92×
decode_T8 8 2.909 0.369 7.88×
prefill_T64 64 20.255 0.799 25.35×
prefill_T256 256 87.577 1.467 59.69×

Reproduce:

xcodebuild test -scheme mlx-swift-lm-Package \
  -destination 'platform=macOS,arch=arm64' \
  -only-testing:MLXLMTests/GatedDeltaBenchTests \
  -skipMacroValidation

(SPM swift test is unsupported because the metal kernel needs the .metallib bundle that only xcodebuild produces.)

End-to-end (M4 Max, 128 GB, mlx-community/Qwen3.5-0.8B-MLX-4bit)

The 0.8B unified VLM checkpoint has 24 transformer blocks; 18 of them are linear_attention (gated-delta) per its text_config.layer_types and 6 are full_attention. With max-tokens 64, temperature 0, seed 42, prompt "What is the capital of Japan?" (24 tokens), 6 trials per branch with the first dropped as warmup:

build prompt median (tok/s) generation median (tok/s)
Swift VLM, baseline (origin/main + #143 sanitize fix only, ops fallback) 1377.42 164.67
Swift VLM, this PR (sanitize fix + kernel fix) 2125.60 355.95
Python mlx_lm 0.31.3 (kernel path) 669.80 213.59
Swift speedup over baseline 1.54× 2.16×

Decode improvement (1.54× / 2.16×) is smaller than the microbench (1.92× – 59.69×) because the 0.8B model still spends a meaningful fraction of each forward in non-linear_attn work (full-attention blocks, MoE / MLP, RMS norm, embed). The decode-per-token cost on linear_attn blocks specifically is what the kernel fuses; end-to-end gain scales with the share of those blocks. Models with a higher linear_attn ratio (the larger 35B-A3B / 122B-A10B Qwen 3.5 unified VLMs) should see proportionally larger wins.

Greedy output is identical between the two Swift builds:

The capital of Japan is Tokyo (also written as 東京 in Japanese).

Tokyo is the largest city in the country, located in the central part of the country, …

(Python emits the same text minus the bold markers because it doesn't go through Swift's markdown wrapper.)

Comment thread Libraries/MLXVLM/Models/Qwen35.swift Outdated
Comment on lines +19 to +23
// Gated-delta helpers (`gatedDeltaUpdate`, `gatedDeltaOps`, `gatedDeltaKernel`,
// `computeGatedDeltaG`) are shared with MLXLLM via `MLXLMCommon/GatedDelta.swift`,
// which selects the fused Metal kernel when available and falls back to the ops
// path otherwise. Keeping a single implementation also keeps both paths in sync
// when the upstream Python kernel evolves.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the comment for removed code.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — the duplicate helpers are just deleted now, nothing left in their place.

import MLXLMCommon
import XCTest

final class GatedDeltaBenchTests: XCTestCase {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need benchmark tests, though if you did want to add it, look in IntegrationTesting -- that is a more appropriate place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped the benchmark test file. The microbench and end-to-end numbers stay in the PR description as the supporting evidence.

Copy link
Copy Markdown
Collaborator

@davidkoski davidkoski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. This needs a rebase to pull the updated GatedDelta code.

Thanks!

…ore#124)

`GatedDelta.swift` defines a fused `gatedDeltaKernel` (custom Metal
kernel, single threadgroup over the time loop) plus a `gatedDeltaOps`
fallback. `gatedDeltaUpdate` prefers the kernel when MLX Metal is
available, which is the only path the LLM-side Qwen 3.5 / Qwen 3 Next
models take on Apple Silicon.

The VLM-side `MLXVLM/Models/Qwen35.swift` was a copy/paste of the same
code with the kernel branch omitted, so every VLM Qwen 3.5 checkpoint
with `linear_attn` ran the unfused per-step expression-graph version,
generating far more intermediates per token.

Move the shared file to `MLXLMCommon` and mark `gatedDeltaUpdate`
public; its helpers stay internal since nothing outside the file
calls them. Delete the duplicate helpers from
`MLXVLM/Models/Qwen35.swift` so the VLM call site goes through the
same kernel-preferred dispatch as the LLM side.
`Tests/MLXLMTests/GatedDeltaTests.swift` imports `MLXLMCommon` instead
of `@testable import MLXLLM`, following the move.

No behavior change for MLXLLM (same code in a new module); MLXVLM
Qwen 3.5 with `linear_attn` now uses the fused kernel.
@john-rocky john-rocky force-pushed the perf/qwen35-vlm-gated-delta-kernel branch from 90ba4ce to 46f46f8 Compare May 14, 2026 19:10
@john-rocky
Copy link
Copy Markdown
Contributor Author

Thanks for the review! Addressed all three points:

Two things worth flagging since they're new in this diff:

  • Only gatedDeltaUpdate is public now — its helpers (computeGatedDeltaG, gatedDeltaKernel, gatedDeltaOps) stay internal, since nothing outside GatedDelta.swift calls them. The earlier revision over-exposed all four.
  • Tests/MLXLMTests/GatedDeltaTests.swift (added in fix gated delta state precision -- fp32 state to match Python #224) switches from @testable import MLXLLM to import MLXLMCommon, following the file's move — it only references the now-public gatedDeltaUpdate.

Net is +2 / −132 across 3 files (GatedDelta.swift is a 99%-similarity rename). swift build and swift build --build-tests both pass clean on main@5b7e543.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants