Skip to content

perf: extend multiple accumulator optimization to all fsum paths (#824)#827

Open
SebKrantz wants to merge 1 commit intomasterfrom
claude/issue-824-20260504-0437
Open

perf: extend multiple accumulator optimization to all fsum paths (#824)#827
SebKrantz wants to merge 1 commit intomasterfrom
claude/issue-824-20260504-0437

Conversation

@SebKrantz
Copy link
Copy Markdown
Member

Extends the loop unrolling optimization proposed in #824 by @TylerSagendorf to cover all scalar sum paths, not just na.rm = FALSE.

Functions updated

  • fsum_double_impl: all paths (na.rm=TRUE default, fill, na.rm=FALSE)
  • fsum_double_omp_impl: all paths using OpenMP array reduction
  • fsum_weights_impl: all paths
  • fsum_weights_omp_impl: all paths using OpenMP array reduction
  • fsum_int_impl: na.rm=TRUE path with long long accumulators
  • fsum_int_omp_impl: both paths using OpenMP array reduction

Grouped functions are unchanged — the scatter pattern (pout[pg[i]] += ...) prevents vectorization regardless.

Closes #824

Generated with Claude Code

#824)

Apply loop unrolling with FSUM_N_ACC=4 independent accumulators to all
scalar (non-grouped) sum functions, breaking the serial dependency chain
and enabling SIMD vectorization across all na.rm and type combinations.

Functions updated:
- fsum_double_impl: na.rm=TRUE (narm==1 and narm==2) and na.rm=FALSE paths
- fsum_double_omp_impl: all paths, using OpenMP array reduction
- fsum_weights_impl: all paths (narm==1, narm==2, narm==0)
- fsum_weights_omp_impl: all paths, using OpenMP array reduction
- fsum_int_impl: na.rm=TRUE path (na.rm=FALSE has early-return on NA, not vectorizable)
- fsum_int_omp_impl: both paths, using OpenMP array reduction on long long array

The OMP versions switch from reduction(+:sum) on a single scalar to
reduction(+:partial_sums[:FSUM_N_ACC]) on an array of accumulators, which
combines OpenMP parallelism with SIMD within each thread's chunk.

Co-authored-by: Sebastian Krantz <SebKrantz@users.noreply.github.com>
@TylerSagendorf
Copy link
Copy Markdown

TylerSagendorf commented May 4, 2026

@SebKrantz The code generated by claude is a bit inconsistent. I think this is because it tries to incorporate the code from all of the iterations that I posted: sometimes it uses acc for the accumulators, other times it uses partial_sums. Additionally, some of the code it generated may be slower because it uses a chunk approach where it has to perform an additional multiplication every FSUM_N_ACC (4) elements:

code_screenshot
Could I implement the changes to the sum functions myself and submit a PR?

@SebKrantz
Copy link
Copy Markdown
Member Author

@TylerSagendorf Thanks! And you are certainly very welcome to implement it!

@TylerSagendorf
Copy link
Copy Markdown

@SebKrantz no problem! I will work on that PR tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT] Improve speed of fsum when OpenMP support is unavailable (macOS)

2 participants