Skip to content

Align Arena allocations to 256 bytes#5278

Merged
WeiqunZhang merged 5 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Align_Arena_allocations_to_256_bytes
Apr 15, 2026
Merged

Align Arena allocations to 256 bytes#5278
WeiqunZhang merged 5 commits intoAMReX-Codes:developmentfrom
AlexanderSinn:Align_Arena_allocations_to_256_bytes

Conversation

@AlexanderSinn
Copy link
Copy Markdown
Member

Summary

This PR increases the alignment of Arena allocations from 16 to 256 bytes. This gives a small speedup when running on GPU.
In this test the alignment affects only the particle data, while the field elements are only 8-byte aligned due to there being no padding for the 2D arrays.

alignment: runtime in seconds of 3 runs
512: 8.443, 8.461, 8.444
256: 8.449, 8.459, 8.446
128: 8.473, 8.466, 8.491
64:  8.506, 8.508, 8.511
32:  8.49,  8.491, 8.483
16:  8.491, 8.489, 8.486
8:   misaligned address !!!

This could also help with SIMD on CPU

Additional background

It was not clear from the documentation what the alignment of memory allocated by hipMalloc is. I assume it is 256, the same as cudaMalloc.

Checklist

The proposed changes:

  • fix a bug or incorrect behavior in AMReX
  • add new capabilities to AMReX
  • changes answers in the test suite to more than roundoff level
  • are likely to significantly affect the results of downstream AMReX users
  • include documentation in the code and/or rst files, if appropriate

@WeiqunZhang WeiqunZhang self-requested a review April 8, 2026 16:50
@ax3l
Copy link
Copy Markdown
Member

ax3l commented Apr 8, 2026

Oh nice, maybe I can vector align the CPU SIMD loads and stores then, too.

I would have otherwise added alignas to the SoA arrays, but I guess this ends up to be the same.

@ax3l ax3l requested a review from atmyers April 8, 2026 21:01
@ax3l
Copy link
Copy Markdown
Member

ax3l commented Apr 8, 2026

With AMReX_SIMD enabled, you can even do a compile-time check for future architectures, e.g.,

  template <class Simd>
  constexpr bool needs_more_than_256B =
      (std::memory_alignment_v<Simd> > 256);

  static_assert(!needs_more_than_256B<SIMDReal>);
  static_assert(!needs_more_than_256B<SIMDParticleReal>);
  static_assert(!needs_more_than_256B<SIMDInt>);
  static_assert(!needs_more_than_256B<SIMDIdCpu>);

Copy link
Copy Markdown
Member

@ax3l ax3l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you!

If you like, you can update AMReX_SIMD.H as well for load_1d/store_1d.

@AlexanderSinn
Copy link
Copy Markdown
Member Author

Since that receives a general pointer, it is not necessarily aligned. E.g., for field data or particle data that was allocated through std::allocator instead of the Arena. There would need to be additional load_1d_aligned etc. functions, which I won't add in this PR.

@WeiqunZhang
Copy link
Copy Markdown
Member

To be safe, we should add an assertion on the alignment after hipMalloc.

@AlexanderSinn
Copy link
Copy Markdown
Member Author

Even if hipMalloc does not align to 256 bytes, things should work with this PR (besides the assert).
Unlike CUDA GPUs, gfx90a is happy to read fully unaligned memory. For example, to read 31 bytes, the compiler generates two 16-byte read instructions where the second one has an offset of 15 bytes (CUDA does 31 1-byte reads). There could still be a performance difference, however.
With this PR, memory allocated through AMReX will have the same alignment guarantee as through the vendor API, up to 256 bytes. So the vendors determine the alignment that is good for their specific GPU.

@WeiqunZhang
Copy link
Copy Markdown
Member

This is what I found on ROCm's github repos. There are many layers, here is a line that calls malloc with required alignment.
https://github.com/ROCm/rocm-systems/blob/c7d0197a5118a6006c2cb72366edeb2fc0ed8c77/projects/clr/hipamd/src/hip_vm.cpp#L114
dev_info.memBaseAddrAlign_ is the number of bits defined here
https://github.com/ROCm/rocm-systems/blob/c7d0197a5118a6006c2cb72366edeb2fc0ed8c77/projects/clr/rocclr/device/rocm/rocdevice.cpp#L1333
It's at least 256 bytes, and it's even more for debug build.
https://github.com/ROCm/clr/blob/41709ee843e3715ebec72db854e0be1119ef620a/rocclr/utils/flags.hpp#L37

@AlexanderSinn
Copy link
Copy Markdown
Member Author

Maybe I will add an assert for all backends, then we can run HPSF CI, and finally I will take out both asserts so it doesn't crash for users.

@WeiqunZhang
Copy link
Copy Markdown
Member

Thinking more about this. You are absolutely right that even if hipMalloc does not align to 256 bytes, things should work with this PR (besides the assert). So maybe we should remove the assert. Then add a comment in Arena.H explaining that 256 is only forced in certain situations. For cudaMalloc, it's guaranteed in the their document to be 256. For hipMalloc, it's 256 in practice, etc. I am not sure about cuda/hip host memory

@WeiqunZhang
Copy link
Copy Markdown
Member

/run-hpsf-gitlab-ci

@github-actions
Copy link
Copy Markdown

GitLab CI has started at https://gitlab.spack.io/amrex/amrex/-/pipelines/1527178.

@amrex-gitlab-ci-reporter
Copy link
Copy Markdown

GitLab CI 1527178 finished with status: success. See details at https://gitlab.spack.io/amrex/amrex/-/pipelines/1527178.

Comment thread Src/Base/AMReX_Arena.cpp Outdated
Comment thread Src/Base/AMReX_Arena.H Outdated
@AlexanderSinn
Copy link
Copy Markdown
Member Author

Managed memory should be aligned to 4k page sizes; otherwise, the explicit prefetch to host / device would move unrelated allocations. CUDA/HIP host memory probably is also aligned to 4k pages, again poorly documented.

@WeiqunZhang WeiqunZhang merged commit 3c7601f into AMReX-Codes:development Apr 15, 2026
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants