perf: improve fsum speed via multiple accumulators (issue #824) by SebKrantz · Pull Request #826 · fastverse/collapse

SebKrantz · 2026-05-04T04:35:26Z

Incorporates the loop unrolling optimization proposed in #824 by @TylerSagendorf.

Adds #define FSUM_N_ACC 4 and uses 4 independent accumulators in the na.rm = FALSE path of fsum_double_impl and fsum_double_omp_impl.

Why it helps: A single accumulator creates a serial data dependency chain that prevents SIMD vectorization. Multiple independent accumulators break this chain, allowing the compiler's auto-vectorizer to issue SIMD instructions. On macOS without OpenMP, #pragma omp simd is silently ignored so this is the only way to get SIMD.

Closes #824

Generated with Claude Code

Use FSUM_N_ACC=4 independent accumulators in fsum_double_impl and fsum_double_omp_impl (no-NA path). This breaks the serial dependency chain on a single sum variable, allowing the compiler's auto-vectorizer to emit SIMD instructions. On macOS without OpenMP configured, #pragma omp simd is silently ignored leaving no SIMD; the multiple accumulator approach provides ~7x speedup in that case. On Linux with OpenMP+AVX the same approach gives ~2x speedup even over the existing omp simd path. Closes #824 Co-authored-by: Sebastian Krantz <SebKrantz@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: improve fsum speed via multiple accumulators (issue #824)#826

perf: improve fsum speed via multiple accumulators (issue #824)#826
SebKrantz wants to merge 1 commit intomasterfrom
claude/issue-824-20260504-0332

SebKrantz commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SebKrantz commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant