ggml-webgpu: Add fused RMS_NORM + MUL by yomaytk · Pull Request #21983 · ggml-org/llama.cpp

yomaytk · 2026-04-16T04:53:33Z

Overview

This PR adds the initial kernel fusion to WebGPU backend with RMS_NORM + MUL (it is similar to #14800).
The performance on the major models on my device (M2, Metal 4) is as follows, but unfortunately, the performance is almost the same on this implementation.

The command is like this:
llama-bench -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -fa 1 -p 512 -n 0

Model	Test	t/s master(`a620695`)	t/s yomaytk/fused-rms-norm-mul	Speedup
gemma4 E4B Q4_K_M	pp1	32.09	31.94	0.995
gemma4 E4B Q4_K_M	pp512	464.23	467.15	1.006
gemma4 E4B Q4_K_M	tg1	31.99	32.46	1.015
gemma4 E4B Q4_K_M	tg128	32.50	32.85	1.011
qwen35 4B Q4_K_M	pp1	30.88	30.84	0.999
qwen35 4B Q4_K_M	pp512	479.07	480.75	1.004
qwen35 4B Q4_K_M	tg1	30.99	30.90	0.997
qwen35 4B Q4_K_M	tg128	31.12	31.35	1.007
llama3.2 3B Q4_K_M	pp1	52.51	53.57	1.020
llama3.2 3B Q4_K_M	pp512	670.86	676.70	1.009
llama3.2 3B Q4_K_M	tg1	53.40	53.83	1.008
llama3.2 3B Q4_K_M	tg128	54.03	54.71	1.013

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, I used AI to investigate the kernel fusion of Vulkan and CUDA backend

yomaytk · 2026-04-16T05:02:17Z

@reeselevine This PR should conflicts with #21873 (especially in ggml_backend_webgpu_graph_compute), so I will update accordingly after the PR is merged.

reeselevine

Thanks for starting to work on fusion, this is a big step!

The performance not changing too much is a little disappointing, but also not a blocker. Once we get the fusion format working we can optimize. I wonder if the reason for the lack of performance is just that the current RMS_NORM is not very well-optimized? So the reduction in bandwidth ends up being hidden because RMS_NORM is too slow.

Do you know if this fusion path leads to significant performance gains in other backends?

reeselevine · 2026-04-17T18:40:23Z

    size_t memset_bytes_per_thread;

+    bool     disable_fusion;
+    uint32_t num_additional_fused_ops;


this general structure comes from the vulkan backend right? I haven't looked into it too closely, but my first thought is that it seems too general, at least based on this PR, because you end up having to check which ops you are actually fusing, and this doesn't encode that at all.

reeselevine · 2026-04-17T18:43:00Z

+
+static bool ggml_webgpu_can_fuse_check(webgpu_context & ctx, const struct ggml_cgraph * cgraph, int node_idx) {
+    // RMS_NORM + MUL
+    if (ggml_webgpu_can_fuse(cgraph, node_idx, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {


under the hood can_fuse ends up repeating the if condition on RMS_NORM + MUL, so really should we have separate functions for each set of potential fusions?

reeselevine · 2026-04-17T18:46:49Z

+        if (!ctx->disable_fusion) {
+            ggml_webgpu_can_fuse_check(ctx, cgraph, i);
+        }
+        if (auto cmd = ggml_webgpu_encode_node(ctx, cgraph->nodes, i)) {


Right now we're hiding whether the encode_node encodes based on whether the next node will be fused.

Instead, what do you think of just calling encode node (maybe function name should just change to encode to encompass the fact it might encode multiple nodes), and updating i based on the number of fused operations. So for the new RMS_NORM + MUL, we end updating i by 2. That avoids hiding the fusion in the additional_fused_ops variable. That to me seems cleaner for now, but maybe I'm missing something that doesn't translate well to future fusions?

reeselevine · 2026-04-17T18:55:59Z

+        (uint32_t) dst->ne[1],
+        (uint32_t) dst->ne[2],
+        (uint32_t) dst->ne[3],
+        *(uint32_t *) rn_dst->op_params  // epsilon, treated as f32 in the shader


this leads to compiler warnings and will fail when the new ggml-webgpu-nvidia-ci is enabled, I moved to a new format: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?id=5911

yomaytk requested a review from a team as a code owner April 16, 2026 04:53

github-actions bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 16, 2026

yomaytk changed the title ~~ggml-webgpu: Add the support of fused RMS_NORM + MUL~~ ggml-webgpu: Add fused RMS_NORM + MUL Apr 16, 2026

yomaytk added 2 commits April 17, 2026 10:51

fused rms_norm_mul + mul

75c52b0

Add GGML_WEBGPU_DISABLE_FUSION for being able to disable kernel fusion.

5fef017

yomaytk force-pushed the fused-rms-norm-mul branch from 8fd976c to 5fef017 Compare April 17, 2026 02:22

reeselevine reviewed Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-webgpu: Add fused RMS_NORM + MUL#21983

ggml-webgpu: Add fused RMS_NORM + MUL#21983
yomaytk wants to merge 2 commits intoggml-org:masterfrom
yomaytk:fused-rms-norm-mul

yomaytk commented Apr 16, 2026 •

edited

Loading

Uh oh!

yomaytk commented Apr 16, 2026 •

edited

Loading

Uh oh!

reeselevine left a comment

Uh oh!

reeselevine Apr 17, 2026

Uh oh!

reeselevine Apr 17, 2026

Uh oh!

reeselevine Apr 17, 2026

Uh oh!

reeselevine Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yomaytk commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

yomaytk commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reeselevine left a comment

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

reeselevine Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yomaytk commented Apr 16, 2026 •

edited

Loading

yomaytk commented Apr 16, 2026 •

edited

Loading