Skip to content

Add chipStar (SPIR-V) support for HIP backends#1942

Open
pvelesko wants to merge 5 commits intoCEED:mainfrom
CHIP-SPV:chipStar
Open

Add chipStar (SPIR-V) support for HIP backends#1942
pvelesko wants to merge 5 commits intoCEED:mainfrom
CHIP-SPV:chipStar

Conversation

@pvelesko
Copy link
Copy Markdown

@pvelesko pvelesko commented Mar 17, 2026

Purpose:

Add support for chipStar, enabling libCEED HIP backends (/gpu/hip/ref, /gpu/hip/shared, /gpu/hip/gen) to run on any OpenCL or Level Zero GPU (Intel, etc.) via SPIR-V.

Changes

  • Makefile: Detect chipStar HIP platform (__HIP_PLATFORM_SPIRV__), handle dynamic lib name (libCHIP.so vs libamdhip64.so), filter clang-only flags from HIPCONFIG_CPPFLAGS when using gcc as CC, use SYCLCXX as linker when SYCL backend is also enabled
  • HIP kernel syncthreads fix (#ifdef __HIP_PLATFORM_SPIRV__): chipStar SPIR-V target requires all threads in a workgroup to reach __syncthreads() barriers. Kernel element loops are refactored to a single-assignment + guard pattern on chipStar, while preserving the original for-loop structure on AMD ROCm/NVIDIA
    • hip-ref-basis-nontensor.h: 5 kernel functions
    • hip-shared-basis-tensor.h: 13 kernel functions
    • ceed-hip-gen-operator-build.cpp: code-generator emits conditional kernel source
  • Test infrastructure: Handle chipStar runtime stderr warnings in JUnit test runner; use errors='replace' for binary data in subprocess output

Benchmarks: HIP (chipStar) vs SYCL (oneAPI) on Intel Arc A770

chipStar main + hiprtc output caching. oneAPI 2024.2.2 + Level Zero. 5M DOFs, polynomial degree p=4, warm JIT cache.

Testing was done with SYCL Caching PR #1943 otherwise chipStar outperforms SYCL much more due to JIT costs.

ex1-volume (mass operator)

Backend Dim HIP (s) SYCL (s) SYCL/HIP
ref 2D 0.77 1.17 1.52x
shared 2D 0.76 0.99 1.30x
gen 2D 0.61 0.80 1.31x
ref 3D 0.97 1.30 1.34x
shared 3D 0.98 1.24 1.27x
gen 3D 0.81 1.06 1.31x

ex2-surface (surface integral)

Backend Dim HIP (s) SYCL (s) SYCL/HIP
ref 2D 0.88 1.22 1.39x
shared 2D 0.86 1.12 1.30x
gen 2D 0.70 0.91 1.30x
ref 3D 1.20 1.38 1.15x
shared 3D 1.19 1.49 1.25x
gen 3D 1.02 1.29 1.26x

chipStar HIP is 15-52% faster than native SYCL across all backends.

LLM/GenAI Disclosure:

Claude Code was used to rebase on latest main (bulk of this work was done in 2024/2025), write the #ifdef __HIP_PLATFORM_SPIRV__ preprocessor guards across HIP kernel files and the code-generator, and to draft this PR description.

By submitting this PR, the author certifies to its contents as described by the Developer's Certificate of Origin.
Please follow the Contributing Guidelines for all PRs.

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Mar 17, 2026

Note: Please use the libCEED PR template, including the LLM usage declaration. Thanks (requirement in CONTRIBUTING.md)

@pvelesko pvelesko marked this pull request as ready for review April 8, 2026 05:40
@pvelesko
Copy link
Copy Markdown
Author

pvelesko commented Apr 8, 2026

@jeremylt When will the CI run for this PR?

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 8, 2026

We have to manually trigger any jobs from forks, so just let us know when you want to run CI. I just approved CI running for the current push.

pvelesko added 2 commits April 8, 2026 17:32
Detect HIP platform at build time via hipconfig output:
- __HIP_PLATFORM_SPIRV__ → HIP_LIB_NAME=CHIP (chipStar)
- __HIP_PLATFORM_HCC__/__HIP_PLATFORM_AMD__ → HIP_LIB_NAME=amdhip64

Move ROCM_DIR and HIP_ARCH defaults to the top of the file where other
tool-path defaults live. Use HIP_LIB_NAME in the library detection glob
and in PKG_LIBS. Remove the subst=,, stripping from HIPCONFIG_CPPFLAGS
so flags are passed through unmodified.
When SYCL backends are built, libceed.so must be linked with icpx
(SYCLCXX) rather than g++, and -fsycl must appear in CEED_LDFLAGS
(before object files) so icpx can merge the SYCL fat binary device
sections. Without this, libceed.so lacks NEEDED: libsycl.so.7 and
SYCL kernels fail to load at runtime.
@pvelesko
Copy link
Copy Markdown
Author

pvelesko commented Apr 8, 2026

@jeremylt I've addressed the review feedback in the latest push:

  1. Makefile (db0d71ce → eb092232): hipconfig invocation is now guarded by ROCM_DIR non-empty AND hipconfig binary present, so non-HIP machines build cleanly. Default HIP_LIB_NAME=amdhip64.

  2. Style fixes (blank lines, brace-around-if) + duplication reduction: introduced include/ceed/jit-source/hip/hip-element-loop.h with three macros:

    • CEED_HIP_ELEM_LOOP_BEGIN(num_elem) / _END — for kernels with no shared-memory ops
    • CEED_HIP_ELEM_LOOP_BEGIN_SHARED(num_elem) / _END_SHARED — for kernels that need to reach __syncthreads() unconditionally
    • CEED_HIP_IF_ELEM_VALID — guards individual loads/stores; expands to if (elem < num_elem) on chipStar and to nothing on AMD/NVIDIA (where the for-loop already enforces validity)

    Both hip-ref-basis-nontensor.h and hip-shared-basis-tensor.h now use the macros — no more duplicated #ifdef __HIP_PLATFORM_SPIRV__ blocks. The code-generator (ceed-hip-gen-operator-build.cpp) emits the macro names too, so the JIT'd source benefits from the same single source of truth.

    Net result: hip-shared-basis-tensor.h shrank from ~760 to ~600 lines (and the diff vs upstream is much smaller).

  3. Verified: clean build with chipStar, t320-basis and ex1-volume (5M DOFs, 3D) pass on all 3 HIP backends (ref/shared/gen). Both AMD and SPIRV macro paths preprocess and compile cleanly.

Could you re-trigger CI when convenient? Thanks!

@pvelesko
Copy link
Copy Markdown
Author

pvelesko commented Apr 8, 2026

Follow-up: replaced the stderr line filter with CHIP_LOGLEVEL=crit set in the test process environment (2b8414e8). Cleaner — chipStar suppresses the noise at source instead of filtering after the fact. Could you re-trigger CI?

@jrwrigh
Copy link
Copy Markdown
Collaborator

jrwrigh commented Apr 8, 2026

@jeremylt We disabled the HIP backend checks when we removed the AMD card from Noether, right? Did we ever try and get it back going with the NVIDIA card? I'm afraid the CI may not be actually testing the code here (assuming I understand how chip star interacts with the backends).

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 8, 2026

@jeremylt We disabled the HIP backend checks when we removed the AMD card from Noether, right? Did we ever try and get it back going with the NVIDIA card? I'm afraid the CI may not be actually testing the code here (assuming I understand how chip star interacts with the backends).

Correct - CI is only testing that the build system is correct and that the HIP backend compiles correctly.

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 8, 2026

Overall this is looking pretty great. I need to do manual testing of the changes since ROCm testing isn't part of CI (as discussed above). Can you post the output of make prove -j for the full test suite so we have documented in the PR the test suite passing on a machine using chipStar?

Copy link
Copy Markdown
Collaborator

@zatkins-dev zatkins-dev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that hijacking the HIP backend for this support is interesting, but creates more brittle and confusing code. I think that this ought to be its own backend, perhaps utilizing the same inner kernels, to clarify the differences with the existing HIP backend

@jrwrigh
Copy link
Copy Markdown
Collaborator

jrwrigh commented Apr 8, 2026

I think that hijacking the HIP backend for this support is interesting, but creates more brittle and confusing code. I think that this ought to be its own backend, perhaps utilizing the same inner kernels, to clarify the differences with the existing HIP backend

I obviously don't work on the backends that much, so take my opinion with a grain of salt. But the point of chipStar is to just operate straight on HIP code and the changes made, while a bit confusing, are pretty minimal. Perhaps maybe there's some cleaner way to implement the changes.

@zatkins-dev
Copy link
Copy Markdown
Collaborator

the changes made, while a bit confusing, are pretty minima

I think my main issue is why is this necessary and in what situations. I don't understand it, which means I can't maintain it.

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 8, 2026

The GPU backends are in a tricky spot with maintainability today, let alone 2 months from now, so I think its a good concern to voice. I agree that the why about not looping over elements is not clear to me either.

@pvelesko
Copy link
Copy Markdown
Author

pvelesko commented Apr 9, 2026

@zatkins-dev I wouldn't call this hijacking since chipStar is a full HIP implementation+runtime.

The only real change here is the hoisting of syncthreads call outside of the inner loop. SPIR-V semantics dictate that all threads must reach syncthreads and this case we have cases where some threads are not participating in the calculation thus never reach syncthreads and thus hang.

These changes are benign but I can run some performance benchmarks on MI50 before and after this refactor and report performance.

Having two HIP backends doesn't make sense to me even in the case where there is a performance impact on the AMD side.

If I observe a performance degradation on the AMD side, we can discuss how to best set this up.

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 9, 2026

If the goal is to prevent some threads from not hitting sync points, I would tackle this differently. I have done that before with other kernels. Let me take a look at the code again.

@zatkins-dev
Copy link
Copy Markdown
Collaborator

Hijacking may have been a strong word, I meant no offense.

I am not concerned about performance, since the macro restores the existing code. I am concerned about how this can be doing the same computation when it removes a loop over elements. I am also confused as to why only the two files that you changed need these changes. As one of the people who will have to maintain this code in the future, I need to know when I should or should not be using these annotations, and I currently have no mental model for that.

@jeremylt
Copy link
Copy Markdown
Member

jeremylt commented Apr 9, 2026

I would use the approach in #1950 instead of the macro magic.

I think it should do what you're trying to do, but clearer so it can be maintained.

I do think that for a big change like this, it is better to first create an issue so design decisions can be discussed first.

…r gcc

chipStar's hipconfig -C outputs --offload=spirv64, -nohipwrapperinc, --hip-path=,
and --target= which are clang-only flags.  When CC=gcc is used for .c files (or
CXX != HIPCC for .cpp files), these flags cause build failures.

Add HIPCONFIG_CPPFLAGS_C that filters the clang-only flags and adds an explicit
-I$(ROCM_DIR)/include (since -nohipwrapperinc was suppressing the wrapper that
would have pulled in hip_runtime.h).
Add a generic --env KEY=VAL CLI option (repeatable) to tests/junit.py
that injects environment variables into the test subprocess environment.
The flag mutates os.environ in the parent before mp.Pool is created, so
worker processes inherit the values via the existing init_process()
copy.

Useful for backend-specific runtime knobs that affect test behavior, for
example silencing chipStar runtime informational/warning lines that
JUnit otherwise classifies as test failures:

    make junit BACKENDS='/gpu/hip/ref ...' \
        JUNIT_ARGS='--env CHIP_LOGLEVEL=crit'

No changes to tests/junit_common.py.
chipStar HIP headers do not transitively include <cstring>, causing
build failures for the four pre-existing strcmp() calls in this file
(resource_root comparisons). AMD ROCm headers include it transitively,
so this is a no-op on non-chipStar builds.
@pvelesko
Copy link
Copy Markdown
Author

pvelesko commented Apr 13, 2026

After discussion with @jeremylt, I've pulled all kernel changes out of this PR. @jeremylt's PR #1950 handles the syncthreads fix with a cleaner loop-padding approach that works across all platforms without preprocessor guards. This PR now focuses only on the build-system and test-infrastructure changes needed for chipStar support.

What's in this PR now

Commit Description
eb092232 Makefile: chipStar HIP platform detection and dynamic lib name
5b80b42c Makefile: use SYCLCXX as linker for libceed.so when SYCL is enabled
f10a5a85 Makefile: filter chipStar clang-only flags from HIPCONFIG_CPPFLAGS for gcc
ad0af6e3 hip-gen: add explicit #include <cstring> for chipStar compatibility
8d62fa67 tests/junit: add --env KEY=VAL flag for subprocess environment

Answers to inline review comments

@zatkins-dev on hip-ref-basis-nontensor.h line 27 ("why does this file need different behavior?"):
That change has been reverted to upstream. It was unnecessarily included in the earlier version of the PR.

@zatkins-dev on hip-element-loop.h line 51 (correctness concerns):
hip-element-loop.h has been deleted — the entire macro-based approach is gone. The kernel fix is now entirely in PR #1950.

@zatkins-dev on ceed-hip-gen-operator-build.cpp line 15 (#include <cstring>):
The include stays. chipStar HIP headers do not transitively include <cstring>, which causes build failures for the four pre-existing strcmp() calls in this file (lines 1191, 1215, 2164, 2188 — resource_root comparisons). AMD ROCm headers include it transitively, so this is a no-op on non-chipStar builds.

Note on PR #1950

I verified the loop-padding approach empirically on chipStar and found a formula bug: elem_loop_bound = num_elem * ceil(num_elem/stride) deadlocks when stride > num_elem. I posted the fix with a reproducer on PR #1950.

chipStar test invocation (for reference)

make junit BACKENDS='/gpu/hip/shared /gpu/hip/gen /gpu/hip/ref' JUNIT_ARGS='--env CHIP_LOGLEVEL=crit'

Test results

With the loop-padding approach (corrected formula equivalent to what I posted on #1950), the following pass on chipStar (Intel Arc A770, Level Zero):

  • t310-basis, t314-basis, t316-basis on /gpu/hip/shared — all pass
  • Basis tests previously deadlocked with the wrong formula (hipErrorOutOfMemory is chipStar's symptom for a workgroup barrier deadlock)

Full make prove results will be posted after #1950 merges and the kernel fix is in main.

@jeremylt
Copy link
Copy Markdown
Member

It looks like you heavily use Claude or something similar in your workflow - don't forget to update the LLM usage the disclosure so we know how you do so.

Note - you are replying to human generated review comments with machine generated messages. I think it's worth keeping in mind that is a potential frustration point

from junit_common import *


def parse_env_assignment(arg: str) -> Tuple[str, str]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zatkins-dev is this how you want this to work and should this be in the core? JUNIT has become mainly yours, so I don't want to make changes without your OK.

subsearch ?= .*
JUNIT_BATCH ?= ''
# Extra arguments forwarded to tests/junit.py (e.g. --env CHIP_LOGLEVEL=crit)
JUNIT_ARGS ?=
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we should automatically suppress the Chipstar warnings when we build with Chipstar so the test suite passes - so JUNIT_ARGS should be set appropriately by default if Chipstar is enabled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants