Skip to content

Add amrex::LaunchRaw#4926

Open
AlexanderSinn wants to merge 56 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Add_amrex__Launch
Open

Add amrex::LaunchRaw#4926
AlexanderSinn wants to merge 56 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Add_amrex__Launch

Conversation

@AlexanderSinn
Copy link
Copy Markdown
Member

@AlexanderSinn AlexanderSinn commented Jan 28, 2026

Summary

This PR aims to provide a unified interface to be able to write kernels using shared memory and __syncthreads for CUDA, HIP and SYCL without the need to use ifdefs.

The number of threads per block is always a compile-time known 1D value, while the number of blocks can be 1d, 2d or 3d using the build-in platform indexes like blockIdx.y etc.

  • Perf testing for some existing kernels
  • porting/simplifying existing kernels to use this
  • Write documentation
  • add tests

Additional background

Example of an amrex::LaunchRaw kernel, which fuses a transpose operation in shared memory with data preprocess and postprocess stencils. (Only works on GPUs due to threads_per_block > 1)

constexpr int tile_dim_x = 16;
constexpr int tile_dim_x_ex = 34;
constexpr int tile_dim_y = 32;
constexpr int block_rows_x = 8;
constexpr int block_rows_y = 16;

const int nx = n_half + 1;
const int ny = n_data;

const int nx_sin = n_data;
const int ny_sin = n_batch;

const int num_blocks_x = (nx + tile_dim_x - 1)/tile_dim_x;
const int num_blocks_y = (ny + tile_dim_y - 1)/tile_dim_y;

amrex::LaunchRaw<tile_dim_x*block_rows_y, amrex::Real>(
    amrex::IntVectND<2>{num_blocks_x, num_blocks_y}, tile_dim_x_ex * tile_dim_y,
    [=] AMREX_GPU_DEVICE(auto lh) noexcept
    {
        const auto [block_x, block_y] = lh.blockIdxND();

        const int tile_begin_x = 2 * block_x * tile_dim_x - 2;
        const int tile_begin_y = block_y * tile_dim_y;

        const int tile_end_x = tile_begin_x + tile_dim_x_ex;
        const int tile_end_y = tile_begin_y + tile_dim_y;

        Array2<amrex::Real> shared{{lh.shared_memory(),
                                    {tile_begin_x, tile_begin_y, 0},
                                    {tile_end_x, tile_end_y, 1}, 1}};

        {
            const auto [thread_y, thread_x] =
                lh.template threadIdxND<tile_dim_y, block_rows_x>();

            for (int tx = thread_x; tx < tile_dim_x_ex; tx += block_rows_x) {
                const int i = tile_begin_x + tx;
                const int j = tile_begin_y + thread_y;

                if (j < nx_sin && i < ny_sin && i >= 0 ) {
                    shared(i, j) = transpose_to_sine(i, j);
                }
            }
        }

        lh.syncthreads();

        {
            const auto [thread_x, thread_y] =
                lh.template threadIdxND<tile_dim_x, block_rows_y>();

            for (int ty = thread_y; ty < tile_dim_y; ty += block_rows_y) {
                const int i = block_x * tile_dim_x + thread_x;
                const int j = tile_begin_y + ty;

                if (i < nx && j < ny) {
                    out(i, j) = to_complex(shared, i, j, n_half, n_batch);
                }
            }
        }
    });

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@ax3l ax3l added the GPU label Jan 29, 2026
@AlexanderSinn
Copy link
Copy Markdown
Member Author

AlexanderSinn commented Mar 10, 2026

I don't understand why the HYPRE and SUNDIALS tests keep failing. Maybe it is a CUDA compiler bug?

[ 12%] Building CUDA object Src/CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o
cd /home/runner/work/amrex/amrex/build/Src && ccache /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAMREX_SPACEDIM=3 -Damrex_3d_EXPORTS --options-file CMakeFiles/amrex_3d.dir/includes_CUDA.rsp -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_80,code=[compute_80,sm_80]" -Xcompiler=-fPIC --expt-relaxed-constexpr --expt-extended-lambda -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored -Xcudafe --diag_suppress=implicit_return_from_non_void_function -maxrregcount=255 -Xcudafe --display_error_number --Wext-lambda-captures-this --use_fast_math --generate-line-info -MD -MT Src/CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o -MF CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o.d -x cu -rdc=true -c /home/runner/work/amrex/amrex/Src/Base/AMReX_ParallelContext.cpp -o CMakeFiles/amrex_3d.dir/Base/AMReX_ParallelContext.cpp.o
/home/runner/work/amrex/amrex/Src/Base/AMReX_Reduce.H(679): error: no instance of function template "amrex::Reduce::detail::call_f_intvect_n" matches the argument list
            argument types are: (lambda [](int, int, int)->ReduceTuple, amrex::IntVectND<3>, int)
          detected during:
            instantiation of "void amrex::ReduceOps<Ps...>::eval_box(I, const amrex::BoxND<dim> &, int, D &, const F &) [with Ps=<amrex::ReduceOpSum>, I=amrex::Reduce::detail::iterate_box, dim=3, D=amrex::ReduceData<amrex::Real>, F=lambda [](int, int, int)->ReduceTuple]" 
(753): here
            instantiation of "void amrex::ReduceOps<Ps...>::eval(const amrex::BoxND<dim> &, D &, const F &) [with Ps=<amrex::ReduceOpSum>, D=amrex::ReduceData<amrex::Real>, F=lambda [](int, int, int)->ReduceTuple, dim=3]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_BaseFab.H(3749): here
            instantiation of "T amrex::BaseFab<T>::sum<run_on>(const amrex::Box &, amrex::DestComp, amrex::NumComps) const noexcept [with T=amrex::Real, run_on=amrex::RunOn::Device]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_BaseFab.H(2735): here
            instantiation of "T amrex::BaseFab<T>::sum<run_on>(const amrex::Box &, int, int) const noexcept [with T=amrex::Real, run_on=amrex::RunOn::Device]" 
/home/runner/work/amrex/amrex/Src/Base/AMReX_DistributionMapping.cpp(1708): here

1 error detected in the compilation of "/home/runner/work/amrex/amrex/Src/Base/AMReX_DistributionMapping.cpp".

Edit: Fixed by changed the input type of the device lambda from auto to the concrete type.

@WeiqunZhang
Copy link
Copy Markdown
Member

WeiqunZhang commented Mar 14, 2026

diff --git a/Tests/LaunchRaw/GNUmakefile b/Tests/LaunchRaw/GNUmakefile
index 173eef4c67..b23330bb62 100644
--- a/Tests/LaunchRaw/GNUmakefile
+++ b/Tests/LaunchRaw/GNUmakefile
@@ -1,4 +1,4 @@
-AMREX_HOME = ../../../
+AMREX_HOME = ../../
 
 DEBUG  = FALSE

@WeiqunZhang
Copy link
Copy Markdown
Member

Codex:

• The new LaunchRaw API is not fully usable on SYCL for its advertised 2D/3D cases, and the newly added GNUmake test cannot be built from its checked-in path. Those issues make the patch incorrect as submitted.

Full review comments:

  • [P2] Preserve 2D/3D SYCL support in LaunchHandler::handler() — /home/wqzhang/mygitrepo/amrex/Src/Base/AMReX_GpuTypes.H:290-292
    Under SYCL, handler() unconditionally forwards m_item into Gpu::Handler, but Gpu::Handler only has a constructor taking sycl::nd_item<1> const*. That means any new 2D or 3D LaunchRaw kernel that calls lh.handler()—for example to use Gpu::blockReduce* or another helper that still expects a Gpu::Handler—will fail to compile, even though LaunchRaw advertises 1D/2D/3D block support.
  • [P2] Point the new GNUmake test at the repository root — /home/wqzhang/mygitrepo/amrex/Tests/LaunchRaw/GNUmakefile:1-1
    From Tests/LaunchRaw, AMREX_HOME = ../../../ resolves one level above the checkout, so make -C Tests/LaunchRaw cannot find Tools/GNUMake/Make.rules and the new test does not build with the GNUmake-based test flow. The other tests use ../.. here.

@WeiqunZhang
Copy link
Copy Markdown
Member

For the SYCL issue handle above, Codex suggests,

launchraw-current.patch

@WeiqunZhang
Copy link
Copy Markdown
Member

The work-group-size runtime issue still exists. Codex suggests the following. Note that it's the diff against your branch with all the previous Codex changes.

launchraw-workgroups-fix.patch

@WeiqunZhang
Copy link
Copy Markdown
Member

Re: MT > 1 on CPU, could you add a message to static_assert?

@AlexanderSinn
Copy link
Copy Markdown
Member Author

Added. For SYCL it is getting a bit more complicated than I expected. Maybe we could just use a 1D range and split the block index manually using FastDivmodU64?

@WeiqunZhang
Copy link
Copy Markdown
Member

Okay

@WeiqunZhang
Copy link
Copy Markdown
Member

Codex:

• The patch introduces backend regressions: CPU-only builds now use an incomplete IntVectND type in LaunchHandler.

Full review comments:

  • [P1] Include the full IntVectND definition before storing it in LaunchHandler — /home/wqzhang/mygitrepo/amrex/Src/Base/AMReX_GpuTypes.H:401-403
    In non-GPU builds this class now stores IntVectND members by value, but AMReX_GpuTypes.H only pulls in AMReX_BaseFwd.H, which forward-declares IntVectND. AMReX_GpuLaunch.H includes this header before AMReX_Box.H/AMReX_IntVect.H, so CPU-only configurations now see an incomplete type here and fail to compile as soon as they include the launch headers.

@AlexanderSinn
Copy link
Copy Markdown
Member Author

Can you start GPU CI again?

@WeiqunZhang
Copy link
Copy Markdown
Member

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476241.

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1476241 finished with status: failed. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476241.

@WeiqunZhang
Copy link
Copy Markdown
Member

WeiqunZhang commented Mar 16, 2026

It's one of the known minor issues that we define gpuStream_t for CUDA/HIP in AMReX_Control.H and for SYCL in AMReX_GpuTypes.H. We probably should move them to AMReX_GpuTypes.H (and add appropriate CUDA/HIP headers).

Initially Arena only included GpuControl.H, which in turn included GpuTypes.H. But now GpuTypes.H is removed from GpuControl.H. If we move gpuStream_t to GpuTypes.H, Arena.H will no longer need to include GpuControl.H since you just added GpuTypes.H. There is another minor issue (found by AI recently) that can be fixed. GpuTypes.H uses macros defined in AMReX_Qualifiers.H. I was planning to fix it. But maybe you can just fix it in this PR.

@WeiqunZhang
Copy link
Copy Markdown
Member

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476793.

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1476793 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1476793.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants