Skip to content

hexagon: add SOLVE_TRI op#21974

Draft
mengshengwu wants to merge 3 commits intoggml-org:masterfrom
qualcomm:hexagon-solve-tri-op
Draft

hexagon: add SOLVE_TRI op#21974
mengshengwu wants to merge 3 commits intoggml-org:masterfrom
qualcomm:hexagon-solve-tri-op

Conversation

@mengshengwu
Copy link
Copy Markdown
Contributor

Overview

This PR add solve tri op support for hexagon. Use hvx to accelarate the caculation.

  • Tests all passes with test-backend-ops.
  • Profile kernels on both CPU and HTP0 backends:
Shape CPU runs CPU time / run Work / run CPU throughput HTP0 runs HTP0 time / run Work / run HTP0 throughput
ne_lhs=[64,64,4,4], ne_rhs=[32,64,4,4] 11271 128.73 us 2.13 MFLOP 16.55 GFLOPS 16380 81.78 us 2.13 MFLOP 26.04 GFLOPS
ne_lhs=[128,128,4,2], ne_rhs=[32,128,4,2] 3786 275.89 us 4.23 MFLOP 15.32 GFLOPS 8190 144.31 us 4.23 MFLOP 29.29 GFLOPS
ne_lhs=[64,64,8,32], ne_rhs=[64,64,8,32] 354 4008.46 us 68.16 MFLOP 17.00 GFLOPS 1468 8059.77 us 68.16 MFLOP 8.46 GFLOPS
ne_lhs=[128,128,4,32], ne_rhs=[128,128,4,32] 60 17616.20 us 270.53 MFLOP 15.36 GFLOPS 370 17405.15 us 270.53 MFLOP 15.54 GFLOPS
ne_lhs=[256,256,4,2], ne_rhs=[128,256,4,2] 238 7489.87 us 67.37 MFLOP 8.99 GFLOPS 1485 2717.44 us 67.37 MFLOP 24.79 GFLOPS

Perf summary

  • HTP0 is faster on 64x64, rhs=32: 26.04 vs 16.55 GFLOPS.
  • HTP0 is faster on 128x128, rhs=32: 29.29 vs 15.32 GFLOPS.
  • CPU is faster on 64x64, rhs=64, batch=8x32: 17.00 vs 8.46 GFLOPS.
  • HTP0 is slightly faster on 128x128, rhs=128, batch=4x32: 15.54 vs 15.36 GFLOPS.
  • HTP0 is faster on 256x256, rhs=128: 24.79 vs 8.99 GFLOPS.

Also I tested this with a full model on device using Qwen3.5-0.8B-Q4_K_M.gguf. Here are representative tensor shape samples from the delta-net chunked path around SOLVE_TRI:

common_debug_cb_eval:        attn_pre_solve-21 = (f32) NEG(HTP0#attn-21#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-21 = (f32) ADD(node_2574{64, 64, 1, 16}, HTP0#node_2571#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

common_debug_cb_eval:       dnet_add_ch_lhs-22 = (f32) ADD(HTP0#attn-22#0{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}
common_debug_cb_eval:        attn_pre_solve-22 = (f32) NEG(HTP0#attn-22#0{64, 64, 1, 16}, }) = {64, 64, 1, 16}
common_debug_cb_eval: dnet_add_ch_attn_solved-22 = (f32) ADD(node_2717{64, 64, 1, 16}, HTP0#node_2714#0{64, 64, 1, 1}}) = {64, 64, 1, 16}

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure:
    YES, I use AI to learn the API usage of HVX and the algorithms for solving triangular matrix equations and review my code.

@mengshengwu mengshengwu requested review from a team and ggerganov as code owners April 16, 2026 01:05
@mengshengwu mengshengwu changed the title Hexagon solve tri op hexagon: add SOLVE_TRI op Apr 16, 2026
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Apr 16, 2026
@mengshengwu mengshengwu marked this pull request as draft April 16, 2026 03:45
@mengshengwu
Copy link
Copy Markdown
Contributor Author

Waitting for DMA optimize

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant