Skip to content

perf: improve fsum speed via multiple accumulators (issue #824)#826

Open
SebKrantz wants to merge 1 commit intomasterfrom
claude/issue-824-20260504-0332
Open

perf: improve fsum speed via multiple accumulators (issue #824)#826
SebKrantz wants to merge 1 commit intomasterfrom
claude/issue-824-20260504-0332

Conversation

@SebKrantz
Copy link
Copy Markdown
Member

Incorporates the loop unrolling optimization proposed in #824 by @TylerSagendorf.

Adds #define FSUM_N_ACC 4 and uses 4 independent accumulators in the na.rm = FALSE path of fsum_double_impl and fsum_double_omp_impl.

Why it helps: A single accumulator creates a serial data dependency chain that prevents SIMD vectorization. Multiple independent accumulators break this chain, allowing the compiler's auto-vectorizer to issue SIMD instructions. On macOS without OpenMP, #pragma omp simd is silently ignored so this is the only way to get SIMD.

Closes #824

Generated with Claude Code

Use FSUM_N_ACC=4 independent accumulators in fsum_double_impl and
fsum_double_omp_impl (no-NA path). This breaks the serial dependency chain
on a single sum variable, allowing the compiler's auto-vectorizer to emit
SIMD instructions. On macOS without OpenMP configured, #pragma omp simd is
silently ignored leaving no SIMD; the multiple accumulator approach provides
~7x speedup in that case. On Linux with OpenMP+AVX the same approach gives
~2x speedup even over the existing omp simd path.

Closes #824

Co-authored-by: Sebastian Krantz <SebKrantz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Improve speed of fsum when OpenMP support is unavailable (macOS)

1 participant