hexagon: add SOLVE_TRI op by mengshengwu · Pull Request #21974 · ggml-org/llama.cpp

mengshengwu · 2026-04-16T01:05:29Z

Overview

This PR add solve tri op support for hexagon. Use hvx to accelarate the caculation.

Tests all passes with test-backend-ops.
Profile kernels on both CPU and HTP0 backends:

Shape	CPU runs	CPU time / run	Work / run	CPU throughput	HTP0 runs	HTP0 time / run	Work / run	HTP0 throughput
`ne_lhs=[64,64,4,4], ne_rhs=[32,64,4,4]`	11271	128.73 us	2.13 MFLOP	16.55 GFLOPS	16380	81.78 us	2.13 MFLOP	26.04 GFLOPS
`ne_lhs=[128,128,4,2], ne_rhs=[32,128,4,2]`	3786	275.89 us	4.23 MFLOP	15.32 GFLOPS	8190	144.31 us	4.23 MFLOP	29.29 GFLOPS
`ne_lhs=[64,64,8,32], ne_rhs=[64,64,8,32]`	354	4008.46 us	68.16 MFLOP	17.00 GFLOPS	1468	8059.77 us	68.16 MFLOP	8.46 GFLOPS
`ne_lhs=[128,128,4,32], ne_rhs=[128,128,4,32]`	60	17616.20 us	270.53 MFLOP	15.36 GFLOPS	370	17405.15 us	270.53 MFLOP	15.54 GFLOPS
`ne_lhs=[256,256,4,2], ne_rhs=[128,256,4,2]`	238	7489.87 us	67.37 MFLOP	8.99 GFLOPS	1485	2717.44 us	67.37 MFLOP	24.79 GFLOPS

Perf summary

HTP0 is faster on 64x64, rhs=32: 26.04 vs 16.55 GFLOPS.
HTP0 is faster on 128x128, rhs=32: 29.29 vs 15.32 GFLOPS.
CPU is faster on 64x64, rhs=64, batch=8x32: 17.00 vs 8.46 GFLOPS.
HTP0 is slightly faster on 128x128, rhs=128, batch=4x32: 15.54 vs 15.36 GFLOPS.
HTP0 is faster on 256x256, rhs=128: 24.79 vs 8.99 GFLOPS.

Also I tested this with a full model on device using Qwen3.5-0.8B-Q4_K_M.gguf. Here are representative tensor shape samples from the delta-net chunked path around SOLVE_TRI:

common_debug_cb_eval:        attn_pre_solve-21 = (f32) NEG(HTP0#attn-21#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-21 = (f32) ADD(node_2574{64, 64, 1, 16}, HTP0#node_2571#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

common_debug_cb_eval:       dnet_add_ch_lhs-22 = (f32) ADD(HTP0#attn-22#0{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}
common_debug_cb_eval:        attn_pre_solve-22 = (f32) NEG(HTP0#attn-22#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-22 = (f32) ADD(node_2717{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:
YES, I use AI to learn the API usage of HVX and the algorithms for solving triangular matrix equations and review my code.

mengshengwu · 2026-04-16T03:45:26Z

Waitting for DMA optimize

mengshengwu added 3 commits April 11, 2026 18:56

hexagon: add SOLVE_TRI op

b57132b

ggml: fix TODO description for solve_tri

b66e94c

hexagon: rm unused variable/function warnings

d70d466

mengshengwu requested review from a team and ggerganov as code owners April 16, 2026 01:05

mengshengwu changed the title ~~Hexagon solve tri op~~ hexagon: add SOLVE_TRI op Apr 16, 2026

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 16, 2026

mengshengwu marked this pull request as draft April 16, 2026 03:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hexagon: add SOLVE_TRI op#21974

hexagon: add SOLVE_TRI op#21974
mengshengwu wants to merge 3 commits intoggml-org:masterfrom
qualcomm:hexagon-solve-tri-op

mengshengwu commented Apr 16, 2026

Uh oh!

mengshengwu commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mengshengwu commented Apr 16, 2026

Overview

Requirements

Uh oh!

mengshengwu commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant