Add amrex::LaunchRaw by AlexanderSinn · Pull Request #4926 · AMReX-Codes/amrex

AlexanderSinn · 2026-01-28T16:52:10Z

Summary

This PR aims to provide a unified interface to be able to write kernels using shared memory and __syncthreads for CUDA, HIP and SYCL without the need to use ifdefs.

The number of threads per block is always a compile-time known 1D value, while the number of blocks can be 1d, 2d or 3d using the build-in platform indexes like blockIdx.y etc.

Perf testing for some existing kernels
porting/simplifying existing kernels to use this
Write documentation
add tests

Additional background

Example of an amrex::LaunchRaw kernel, which fuses a transpose operation in shared memory with data preprocess and postprocess stencils. (Only works on GPUs due to threads_per_block > 1)

constexpr int tile_dim_x = 16;
constexpr int tile_dim_x_ex = 34;
constexpr int tile_dim_y = 32;
constexpr int block_rows_x = 8;
constexpr int block_rows_y = 16;

const int nx = n_half + 1;
const int ny = n_data;

const int nx_sin = n_data;
const int ny_sin = n_batch;

const int num_blocks_x = (nx + tile_dim_x - 1)/tile_dim_x;
const int num_blocks_y = (ny + tile_dim_y - 1)/tile_dim_y;

amrex::LaunchRaw<tile_dim_x*block_rows_y, amrex::Real>(
    amrex::IntVectND<2>{num_blocks_x, num_blocks_y}, tile_dim_x_ex * tile_dim_y,
    [=] AMREX_GPU_DEVICE(auto lh) noexcept
    {
        const auto [block_x, block_y] = lh.blockIdxND();

        const int tile_begin_x = 2 * block_x * tile_dim_x - 2;
        const int tile_begin_y = block_y * tile_dim_y;

        const int tile_end_x = tile_begin_x + tile_dim_x_ex;
        const int tile_end_y = tile_begin_y + tile_dim_y;

        Array2<amrex::Real> shared{{lh.shared_memory(),
                                    {tile_begin_x, tile_begin_y, 0},
                                    {tile_end_x, tile_end_y, 1}, 1}};

        {
            const auto [thread_y, thread_x] =
                lh.template threadIdxND<tile_dim_y, block_rows_x>();

            for (int tx = thread_x; tx < tile_dim_x_ex; tx += block_rows_x) {
                const int i = tile_begin_x + tx;
                const int j = tile_begin_y + thread_y;

                if (j < nx_sin && i < ny_sin && i >= 0 ) {
                    shared(i, j) = transpose_to_sine(i, j);
                }
            }
        }

        lh.syncthreads();

        {
            const auto [thread_x, thread_y] =
                lh.template threadIdxND<tile_dim_x, block_rows_y>();

            for (int ty = thread_y; ty < tile_dim_y; ty += block_rows_y) {
                const int i = block_x * tile_dim_x + thread_x;
                const int j = tile_begin_y + ty;

                if (i < nx && j < ny) {
                    out(i, j) = to_complex(shared, i, j, n_half, n_batch);
                }
            }
        }
    });

Checklist

The proposed changes:

fix a bug or incorrect behavior in AMReX
add new capabilities to AMReX
changes answers in the test suite to more than roundoff level
are likely to significantly affect the results of downstream AMReX users
include documentation in the code and/or rst files, if appropriate

AlexanderSinn · 2026-03-10T13:29:21Z

I don't understand why the HYPRE and SUNDIALS tests keep failing. Maybe it is a CUDA compiler bug?

[ 12%] Building CUDA object Src/CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o
cd /home/runner/work/amrex/amrex/build/Src && ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAMREX_SPACEDIM=3 -Damrex_3d_EXPORTS --options-file CMakeFiles/amrex_3d.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC --expt-relaxed-constexpr --expt-extended-lambda -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function -maxrregcount=255 -Xcudafe --display_error_number --Wext-lambda-captures-this --use_fast_math --generate-line-info -MD -MT Src/CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o -MF CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o.d -x cu -rdc=true -c /home/runner/work/amrex/amrex/Src/Base/AMReX_ParallelContext.cpp -o CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o
/home/runner/work/amrex/amrex/Src/Base/AMReX_Reduce.H(679): error: no instance of function template "amrex::Reduce::detail::call_f_intvect_n" matches the argument list
            argument types are: (lambda [](int, int, int)->ReduceTuple, amrex::IntVectND<3>, int)
          detected during:
            instantiation of "void amrex::ReduceOps<Ps...>::eval_box(I, const amrex::BoxND<dim> &, int, D &, const F &) [with Ps=<amrex::ReduceOpSum>, I=amrex::Reduce::detail::iterate_box, dim=3, D=amrex::ReduceData<amrex::Real>, F=lambda [](int, int, int)->ReduceTuple]" 
(753): here
            instantiation of "void amrex::ReduceOps<Ps...>::eval(const amrex::BoxND<dim> &, D &, const F &) [with Ps=<amrex::ReduceOpSum>, D=amrex::ReduceData<amrex::Real>, F=lambda [](int, int, int)->ReduceTuple, dim=3]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_BaseFab.H(3749): here
            instantiation of "T amrex::BaseFab<T>::sum<run_on>(const amrex::Box &, amrex::DestComp, amrex::NumComps) const noexcept [with T=amrex::Real, run_on=amrex::RunOn::Device]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_BaseFab.H(2735): here
            instantiation of "T amrex::BaseFab<T>::sum<run_on>(const amrex::Box &, int, int) const noexcept [with T=amrex::Real, run_on=amrex::RunOn::Device]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_DistributionMapping.cpp(1708): here

1 error detected in the compilation of "/home/runner/work/amrex/amrex/Src/Base/AMReX_DistributionMapping.cpp".

Edit: Fixed by changed the input type of the device lambda from auto to the concrete type.

WeiqunZhang · 2026-03-14T18:47:42Z

diff --git a/Tests/LaunchRaw/GNUmakefile b/Tests/LaunchRaw/GNUmakefile
index 173eef4c67..b23330bb62 100644
--- a/Tests/LaunchRaw/GNUmakefile
+++ b/Tests/LaunchRaw/GNUmakefile
@@ -1,4 +1,4 @@
-AMREX_HOME = ../../../
+AMREX_HOME = ../../
 
 DEBUG  = FALSE

WeiqunZhang · 2026-03-14T18:49:53Z

Codex:

• The new LaunchRaw API is not fully usable on SYCL for its advertised 2D/3D cases, and the newly added GNUmake test cannot be built from its checked-in path. Those issues make the patch incorrect as submitted.

Full review comments:

[P2] Preserve 2D/3D SYCL support in LaunchHandler::handler() — /home/wqzhang/mygitrepo/amrex/Src/Base/AMReX_GpuTypes.H:290-292
Under SYCL, handler() unconditionally forwards m_item into Gpu::Handler, but Gpu::Handler only has a constructor taking sycl::nd_item<1> const*. That means any new 2D or 3D LaunchRaw kernel that calls lh.handler()—for example to use Gpu::blockReduce* or another helper that still expects a Gpu::Handler—will fail to compile, even though LaunchRaw advertises 1D/2D/3D block support.
[P2] Point the new GNUmake test at the repository root — /home/wqzhang/mygitrepo/amrex/Tests/LaunchRaw/GNUmakefile:1-1
From Tests/LaunchRaw, AMREX_HOME = ../../../ resolves one level above the checkout, so make -C Tests/LaunchRaw cannot find Tools/GNUMake/Make.rules and the new test does not build with the GNUmake-based test flow. The other tests use ../.. here.

WeiqunZhang · 2026-03-14T18:57:53Z

For the SYCL issue handle above, Codex suggests,

launchraw-current.patch

WeiqunZhang · 2026-03-14T19:04:44Z

The work-group-size runtime issue still exists. Codex suggests the following. Note that it's the diff against your branch with all the previous Codex changes.

launchraw-workgroups-fix.patch

WeiqunZhang · 2026-03-14T19:12:51Z

Re: MT > 1 on CPU, could you add a message to static_assert?

AlexanderSinn · 2026-03-14T19:22:56Z

Added. For SYCL it is getting a bit more complicated than I expected. Maybe we could just use a 1D range and split the block index manually using FastDivmodU64?

WeiqunZhang · 2026-03-14T19:25:10Z

Okay

WeiqunZhang · 2026-03-14T19:27:50Z

Codex:

• The patch introduces backend regressions: CPU-only builds now use an incomplete IntVectND type in LaunchHandler.

Full review comments:

[P1] Include the full IntVectND definition before storing it in LaunchHandler — /home/wqzhang/mygitrepo/amrex/Src/Base/AMReX_GpuTypes.H:401-403
In non-GPU builds this class now stores IntVectND members by value, but AMReX_GpuTypes.H only pulls in AMReX_BaseFwd.H, which forward-declares IntVectND. AMReX_GpuLaunch.H includes this header before AMReX_Box.H/AMReX_IntVect.H, so CPU-only configurations now see an incomplete type here and fail to compile as soon as they include the launch headers.

AlexanderSinn · 2026-03-16T09:05:34Z

Can you start GPU CI again?

WeiqunZhang · 2026-03-16T15:59:56Z

/run-hpsf-gitlab-ci

github-actions · 2026-03-16T16:00:09Z

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476241.

amrex-gitlab-ci-reporter · 2026-03-16T16:26:38Z

GitLab CI 1476241 finished with status: failed. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476241.

WeiqunZhang · 2026-03-16T18:50:00Z

It's one of the known minor issues that we define gpuStream_t for CUDA/HIP in AMReX_Control.H and for SYCL in AMReX_GpuTypes.H. We probably should move them to AMReX_GpuTypes.H (and add appropriate CUDA/HIP headers).

Initially Arena only included GpuControl.H, which in turn included GpuTypes.H. But now GpuTypes.H is removed from GpuControl.H. If we move gpuStream_t to GpuTypes.H, Arena.H will no longer need to include GpuControl.H since you just added GpuTypes.H. There is another minor issue (found by AI recently) that can be fixed. GpuTypes.H uses macros defined in AMReX_Qualifiers.H. I was planning to fix it. But maybe you can just fix it in this PR.

WeiqunZhang · 2026-03-16T21:51:25Z

/run-hpsf-gitlab-ci

github-actions · 2026-03-16T21:51:35Z

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476793.

amrex-gitlab-ci-reporter · 2026-03-16T22:15:58Z

GitLab CI 1476793 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476793.

Tests/LaunchRaw/GNUmakefile

AlexanderSinn and others added 8 commits January 13, 2026 19:05

Add amrex::Launch

406c6c4

add CUDA and CPU versions

2ac0e24

add includes

12be033

Fix circular include

9912b9b

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

69b214a

fix index calculation

288428f

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

494a26f

add function to get the runtime value of blockDim.x

418ba21

ax3l added the GPU label Jan 29, 2026

AlexanderSinn and others added 3 commits February 4, 2026 09:54

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

1897d9c

fix

87d2dcf

add template for SYCL

2636e19

AlexanderSinn mentioned this pull request Feb 4, 2026

Use amrex::LaunchRaw Hi-PACE/hipace#1353

Open

9 tasks

AlexanderSinn and others added 14 commits February 9, 2026 15:20

remove blockDim1Drt

536707f

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

f827c80

convert some functions to use LaunchRaw

e1facb8

Use LaunchRaw in more functions

1a0897b

fix

4956fde

fix sycl

19c909a

Merge branch 'development' into Add_amrex__Launch

35f3ea2

Add AMREX_IF_SYCL and nodiscard

c0a70a9

fix nodiscard

232988f

fix nodiscard 2

c392c70

try fix void return issue

62565bb

remove nodiscard

5c59a30

try fix cuda first capture issue

6161d90

add documentation

ec26b0b

AlexanderSinn added 2 commits March 10, 2026 18:14

start to add test

cf44395

update test and see if it is compiled

09d75ee

Add some suggestions from review

b690384

typo

1f33724

AlexanderSinn added 2 commits March 14, 2026 21:35

Add include and use BoxIndexer for SYCL

155dd6f

fix

a3dfac3

AlexanderSinn and others added 2 commits March 16, 2026 19:29

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

d5e6840

Add missing include to Arena.H

3f2a3f8

AlexanderSinn added 2 commits March 16, 2026 20:09

move definition of gpuStream_t

cbad36a

fix

93dd7b2

WeiqunZhang reviewed Mar 16, 2026

View reviewed changes

Tests/LaunchRaw/GNUmakefile Outdated Show resolved Hide resolved

AlexanderSinn and others added 3 commits March 17, 2026 08:52

Update AMREX_HOME

b25c2a1

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

67688d3

Merge branch 'AMReX-Codes:development' into Add_amrex__Launch

b1d3184

AlexanderSinn requested a review from WeiqunZhang April 2, 2026 10:20

Merge branch 'development' into Add_amrex__Launch

5184333

Conversation

AlexanderSinn commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Additional background

Checklist

Uh oh!

AlexanderSinn commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

AlexanderSinn commented Mar 14, 2026

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

WeiqunZhang commented Mar 14, 2026

Uh oh!

AlexanderSinn commented Mar 16, 2026

Uh oh!

WeiqunZhang commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

amrex-gitlab-ci-reporter bot commented Mar 16, 2026

Uh oh!

WeiqunZhang commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026

Uh oh!

amrex-gitlab-ci-reporter bot commented Mar 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

AlexanderSinn commented Jan 28, 2026 •

edited

Loading

AlexanderSinn commented Mar 10, 2026 •

edited

Loading

WeiqunZhang commented Mar 14, 2026 •

edited

Loading

WeiqunZhang commented Mar 16, 2026 •

edited

Loading