Skip to content

ggml-webgpu: Add fused RMS_NORM + MUL#21983

Open
yomaytk wants to merge 2 commits intoggml-org:masterfrom
yomaytk:fused-rms-norm-mul
Open

ggml-webgpu: Add fused RMS_NORM + MUL#21983
yomaytk wants to merge 2 commits intoggml-org:masterfrom
yomaytk:fused-rms-norm-mul

Conversation

@yomaytk
Copy link
Copy Markdown
Contributor

@yomaytk yomaytk commented Apr 16, 2026

Overview

This PR adds the initial kernel fusion to WebGPU backend with RMS_NORM + MUL (it is similar to #14800).
The performance on the major models on my device (M2, Metal 4) is as follows, but unfortunately, the performance is almost the same on this implementation.

The command is like this:
llama-bench -m Llama-3.2-3B-Instruct-Q4_K_M.gguf -fa 1 -p 512 -n 0

Model Test t/s master(a620695) t/s yomaytk/fused-rms-norm-mul Speedup
gemma4 E4B Q4_K_M pp1 32.09 31.94 0.995
gemma4 E4B Q4_K_M pp512 464.23 467.15 1.006
gemma4 E4B Q4_K_M tg1 31.99 32.46 1.015
gemma4 E4B Q4_K_M tg128 32.50 32.85 1.011
qwen35 4B Q4_K_M pp1 30.88 30.84 0.999
qwen35 4B Q4_K_M pp512 479.07 480.75 1.004
qwen35 4B Q4_K_M tg1 30.99 30.90 0.997
qwen35 4B Q4_K_M tg128 31.12 31.35 1.007
llama3.2 3B Q4_K_M pp1 52.51 53.57 1.020
llama3.2 3B Q4_K_M pp512 670.86 676.70 1.009
llama3.2 3B Q4_K_M tg1 53.40 53.83 1.008
llama3.2 3B Q4_K_M tg128 54.03 54.71 1.013

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES, I used AI to investigate the kernel fusion of Vulkan and CUDA backend

@yomaytk yomaytk requested a review from a team as a code owner April 16, 2026 04:53
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning WebGPU labels Apr 16, 2026
@yomaytk
Copy link
Copy Markdown
Contributor Author

yomaytk commented Apr 16, 2026

@reeselevine This PR should conflicts with #21873 (especially in ggml_backend_webgpu_graph_compute), so I will update accordingly after the PR is merged.

@yomaytk yomaytk changed the title ggml-webgpu: Add the support of fused RMS_NORM + MUL ggml-webgpu: Add fused RMS_NORM + MUL Apr 16, 2026
@yomaytk yomaytk force-pushed the fused-rms-norm-mul branch from 8fd976c to 5fef017 Compare April 17, 2026 02:22
Copy link
Copy Markdown
Contributor

@reeselevine reeselevine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to work on fusion, this is a big step!

The performance not changing too much is a little disappointing, but also not a blocker. Once we get the fusion format working we can optimize. I wonder if the reason for the lack of performance is just that the current RMS_NORM is not very well-optimized? So the reduction in bandwidth ends up being hidden because RMS_NORM is too slow.

Do you know if this fusion path leads to significant performance gains in other backends?

size_t memset_bytes_per_thread;

bool disable_fusion;
uint32_t num_additional_fused_ops;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this general structure comes from the vulkan backend right? I haven't looked into it too closely, but my first thought is that it seems too general, at least based on this PR, because you end up having to check which ops you are actually fusing, and this doesn't encode that at all.


static bool ggml_webgpu_can_fuse_check(webgpu_context & ctx, const struct ggml_cgraph * cgraph, int node_idx) {
// RMS_NORM + MUL
if (ggml_webgpu_can_fuse(cgraph, node_idx, { GGML_OP_RMS_NORM, GGML_OP_MUL })) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

under the hood can_fuse ends up repeating the if condition on RMS_NORM + MUL, so really should we have separate functions for each set of potential fusions?

if (!ctx->disable_fusion) {
ggml_webgpu_can_fuse_check(ctx, cgraph, i);
}
if (auto cmd = ggml_webgpu_encode_node(ctx, cgraph->nodes, i)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we're hiding whether the encode_node encodes based on whether the next node will be fused.

Instead, what do you think of just calling encode node (maybe function name should just change to encode to encompass the fact it might encode multiple nodes), and updating i based on the number of fused operations. So for the new RMS_NORM + MUL, we end updating i by 2. That avoids hiding the fusion in the additional_fused_ops variable. That to me seems cleaner for now, but maybe I'm missing something that doesn't translate well to future fusions?

(uint32_t) dst->ne[1],
(uint32_t) dst->ne[2],
(uint32_t) dst->ne[3],
*(uint32_t *) rn_dst->op_params // epsilon, treated as f32 in the shader
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leads to compiler warnings and will fail when the new ggml-webgpu-nvidia-ci is enabled, I moved to a new format: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfpcd/classification.cfm?id=5911

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning WebGPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants