Skip to content

Optimize binary search methods#3110

Merged
IvoDD merged 1 commit into
masterfrom
binary-search-utils-optimization
May 21, 2026
Merged

Optimize binary search methods#3110
IvoDD merged 1 commit into
masterfrom
binary-search-utils-optimization

Conversation

@IvoDD
Copy link
Copy Markdown
Collaborator

@IvoDD IvoDD commented May 14, 2026

Reference Issues/PRs

Optimizations on top of #3091
Used in #3062

What does this implement or fix?

Some micro optimizations on binary search methods:

  • Don't keep TypedBlockData in ColumnDataIterator. Instead only keep block_data_ and block_size_
  • Don't recalculate block pointer and size when we already know them during gallop

Any other comments?

Benchmarks for all search and iteration methods:

Benchmark Before (ns) After (ns) Delta
iterate_irregular_blocks_1 (one row per block) 478,496 311,163 −35.0%
iterate_with_iterator (100 rows) 798 719 −9.9%
exponential_lb_single_block (in first 100) 356 323 −9.2%
exponential_lb_single_block (full gallop) 458 424 −7.4%
exponential_lb_regular (in first 100) 364 339 −6.7%
exponential_lb_irregular_1000 (in first 100) 360 335 −6.7%
exponential_lb_irregular_1000 (full gallop) 496 476 −3.9%
exponential_lb_regular (full gallop) 504 489 −2.9%
exponential_lb_irregular_1 (in first 100) 464 455 −2.0%
exponential_lb_irregular_1 (full gallop) 687 679 −1.3%
lower_bound_single_block 411 394 −4.1%
lower_bound_irregular_1000 444 431 −3.0%
lower_bound_irregular_1 595 579 −2.8%
lower_bound_regular_blocks 443 436 −1.4%
iterate_single_block 27,305 27,247 −0.2%
iterate_regular_blocks 29,051 28,734 −1.1%
iterate_irregular_blocks_1000 28,136 27,893 −0.9%
iterate_with_scalar_at (100 rows) 182,183,122 182,088,026 −0.1%

Checklist

Checklist for code changes...
  • Have you updated the relevant docstrings, documentation and copyright notice?
  • Is this contribution tested against all ArcticDB's features?
  • Do all exceptions introduced raise appropriate error messages?
  • Are API changes highlighted in the PR description?
  • Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

@IvoDD IvoDD added no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version labels May 14, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 14, 2026

ArcticDB Code Review Summary

No items requiring attention. The optimization is correct, well-scoped, and the benchmark deltas in the PR description validate it.

Verified:

  • gallop_bracket first-block lambdas are safe: prev_block/cur_block remain first_block_idx throughout the first-block probing phase, so the optimized variants do not need to track the block field.
  • The raw-pointer block_begin_ replacing std::optional<TypedBlockData<TDT>> is consistently propagated (copy constructor, dereference, end-sentinel). All callers in column_algorithms.hpp and test_column.cpp updated to current_block_data() == nullptr.
  • New load_current_block computes block->logical_size() / sizeof(RawType), equivalent to the previous TypedBlockData::row_count() for Dim0 (which is static_assert-enforced by the search code paths).

Comment on lines +468 to +472
auto record_probe_in_first_block = [&](size_t next_offset, RawType probe_value) {
prev_offset = cur_offset;
cur_offset = next_offset;
return is_before(probe_value, value);
};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these two extra assignments that are omitted really make a difference?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most probably they do not.

Most of the benefit is from reusing the already calculated first_block_row_count and first_block_data in make_iter_in_first_block.

It made sense to also add a first block variant of record_probe as well, to make the invariant clearer

Base automatically changed from binary-search-utils to master May 21, 2026 09:15
Additional micro optimizations on binary search methods:
- Don't keep `TypedBlockData` in `ColumnDataIterator`
- Don't recalculate block pointer and size when we already know them
  during gallop
@IvoDD IvoDD force-pushed the binary-search-utils-optimization branch from 0c2d98c to 6120021 Compare May 21, 2026 09:18
@IvoDD IvoDD merged commit f7767c2 into master May 21, 2026
226 checks passed
@IvoDD IvoDD deleted the binary-search-utils-optimization branch May 21, 2026 12:06
IvoDD added a commit that referenced this pull request Jun 4, 2026
…port (#3062)

#### Reference Issues/PRs
Monday ref: 11679866800

Depends on PRs #3091 and #3110 

### Issues

- There is complicated bucket hopping logic in three places:
`generate_output_index_column`, `generate_resampling_output_column`,
`SortedAggregator::aggregate`
- The bucket hopping logic involves many branches with loads of checks

### Changes (split per commit for easier review)

0. Adds C++ benchmarks which measures the CPU intensive part of
resampling
1. Pure move of the `generate_output_index_column` to
`sorted_aggregation.cpp`.
   - This way all bucket hopping logic is in one place.
2. Construct a `ResampleMapping` in `generate_output_index_column` and
use it directly in other methods.
- `ResampleMapping` just has a mapping from `output_row` to
`(start_column_index, start_column_offset), (end_column_index,
end_column_offset)`.
   - Resolves the 3 places with similar logic.
   - Makes the implementation of sparse aggregation easier.
3. Use [galloping
search](https://en.wikipedia.org/wiki/Exponential_search) in
`generate_output_index_column` to skip past all rows in a single bucket
at once.
- Index column construction was the bottleneck: aggregation vectorises
well but index iteration does not.
- Changes complexity from `O(num_input_rows + num_buckets)` to
`O(num_buckets × log(rows_per_bucket))`.
- Always ≤ `O(num_input_rows + num_buckets)` even when `num_buckets ≥
num_input_rows`.
4. Preallocate the output index column to `min(num_buckets,
num_input_rows)` instead of `num_buckets`.
- Galloping search has a higher constant than linear scan and regresses
at low rows per bucket.
- Slightly improves the case where most buckets are empty due to smaller
allocation.
5. Use a runtime heuristic to choose between linear scan and galloping
search.
- Linear scan is faster below ~32 rows/bucket (because of smaller
constant and better branch prediction); galloping search is faster
above.
- Threshold determined empirically from benchmarks at intermediate
bucket counts. Extra benchmarking was done with more parametrization of
the existing benchmark. Not kept in PR to avoid a huge amount of
benchmarking code.
- Recovers the Dense-100k and Empty regressions from commit 3 while
retaining all gains elsewhere.
  6. Implement sparse resampling.
- Small change made straightforward by the `ResampleMapping` from commit
2.
     - Minimal overhead for the dense case.

### Resample benchmark timings
                                               
`BM_resample/<rows_per_seg>/<num_segs>/<num_buckets>/<num_cols>`. Total
rows ~1M.
Source: `cpp/arcticdb/processing/test/benchmark_resample.cpp`. Times in
**ms**, `--benchmark_min_time=2s`.

  | Regime | Args | rows/bucket | Description |
  |---|---|---|---|
| Dense-1k | `100k × 10, 1k buckets` | ~1000 | Many rows/bucket, single
row-slice |
| Dense-100 | `100k × 10, 10k buckets` | ~100 | Medium rows/bucket,
single row-slice |
| Dense-10 | `100k × 10, 100k buckets` | ~10 | Few rows/bucket, single
row-slice |
| Spanning | `2k × 500, 100 buckets` | ~10k | Buckets span multiple
row-slices |
| Empty | `100k × 10, 10M buckets` | <1 | Bucket smaller than row
spacing; most empty |

  **1 aggregation column**

  | # | Change | D-1k | D-100 | D-10 | Spanning | Empty |
  |---|---|---|---|---|---|---|
  | 0 | Baseline | 1.27 | 1.34 | 1.47 | 1.65 | 11.1 |
| 1 | Code move | 1.02 (−20%) | 1.12 (−16%) | 1.27 (−14%) | 1.40 (−15%)
| 11.1 (0%) |
| 2 | ResampleMapping | 1.02 (−20%) | 1.12 (−16%) | 1.32 (−10%) | 1.40
(−15%) | 11.8 (+6%) |
| 3 | Galloping search | 0.059 (−95%) | 0.385 (−71%) | 2.94 (+100%) |
0.285 (−83%) | 21.9 (+97%) |
| 4 | Bounded allocation | 0.058 (−95%) | 0.396 (−70%) | 2.91 (+98%) |
0.291 (−82%) | 21.5 (+94%) |
| 5 | Heuristic (lin/EUB) | 0.059 (−95%) | 0.383 (−71%) | 1.27 (−14%) |
0.293 (−82%) | 11.5 (+4%) |
| 6 | Sparse-input support | 0.068 (−95%) | 0.449 (−66%) | 1.28 (−13%) |
0.296 (−82%) | 11.5 (+4%) |

  **100 aggregation columns**

  | # | Change | D-1k | D-100 | D-10 | Spanning | Empty |
  |---|---|---|---|---|---|---|
  | 0 | Baseline | 1.37 | 1.43 | 1.56 | 6.22 | 48.0 |
| 1 | Code move | 1.11 (−19%) | 1.18 (−17%) | 1.34 (−14%) | 5.92 (−5%) |
46.2 (−4%) |
| 2 | ResampleMapping | 1.11 (−19%) | 1.19 (−17%) | 1.39 (−11%) | 5.87
(−6%) | 50.4 (+5%) |
| 3 | Galloping search | 0.148 (−89%) | 0.471 (−67%) | 2.96 (+90%) |
4.65 (−25%) | 63.1 (+31%) |
| 4 | Bounded allocation | 0.148 (−89%) | 0.480 (−66%) | 2.95 (+89%) |
4.67 (−25%) | 44.1 (−8%) |
| 5 | Heuristic (lin/EUB) | 0.149 (−89%) | 0.477 (−67%) | 1.33 (−15%) |
4.70 (−24%) | 35.9 (−25%) |
| 6 | Sparse-input support | 0.158 (−88%) | 0.537 (−62%) | 1.35 (−13%) |
4.94 (−21%) | 36.0 (−25%) |

  Deltas vs baseline (row 0).

  #### Notes on benchmark results

- Load average varied across runs so there are some artifacts in results
like "Code move" improvements.
- Galloping search improves the speed when there are more rows in a
single bucket significantly. Thorough benchmarking showed exponential
upper bound (EUB) becomes faster than linear search at ~32 rows per
bucket. Hence we see some performance regressions in the 10 rows per
bucket and in the mostly empty bucket cases.
  - Bounded allocation mostly helps the empty case as expected
- Using the heuristic to choose between EUB and linear search helps when
rows_per_bucket < 32. It is even more efficient than the baseline due to
slightly better branch prediction (improved use of `ARCTICDB_LIKELY` and
`ARCTICDB_UNLIKELY`).
- Final state: every regime at or faster than baseline; Dense 1000 rows
per bucket is the biggest winner with 20x improvement; Mostly empty
bucket is the only usecase with no improvement and remains around
baseline (+4%)

---------

Co-authored-by: Ivo <ivo.dilov@man.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release-notes This PR shouldn't be added to release notes. patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants