Optimize binary search methods#3110
Merged
Merged
Conversation
Contributor
ArcticDB Code Review SummaryNo items requiring attention. The optimization is correct, well-scoped, and the benchmark deltas in the PR description validate it. Verified:
|
alexowens90
approved these changes
May 18, 2026
vasil-pashov
approved these changes
May 18, 2026
Comment on lines
+468
to
+472
| auto record_probe_in_first_block = [&](size_t next_offset, RawType probe_value) { | ||
| prev_offset = cur_offset; | ||
| cur_offset = next_offset; | ||
| return is_before(probe_value, value); | ||
| }; |
Collaborator
There was a problem hiding this comment.
Do these two extra assignments that are omitted really make a difference?
Collaborator
Author
There was a problem hiding this comment.
Most probably they do not.
Most of the benefit is from reusing the already calculated first_block_row_count and first_block_data in make_iter_in_first_block.
It made sense to also add a first block variant of record_probe as well, to make the invariant clearer
Additional micro optimizations on binary search methods: - Don't keep `TypedBlockData` in `ColumnDataIterator` - Don't recalculate block pointer and size when we already know them during gallop
0c2d98c to
6120021
Compare
IvoDD
added a commit
that referenced
this pull request
Jun 4, 2026
…port (#3062) #### Reference Issues/PRs Monday ref: 11679866800 Depends on PRs #3091 and #3110 ### Issues - There is complicated bucket hopping logic in three places: `generate_output_index_column`, `generate_resampling_output_column`, `SortedAggregator::aggregate` - The bucket hopping logic involves many branches with loads of checks ### Changes (split per commit for easier review) 0. Adds C++ benchmarks which measures the CPU intensive part of resampling 1. Pure move of the `generate_output_index_column` to `sorted_aggregation.cpp`. - This way all bucket hopping logic is in one place. 2. Construct a `ResampleMapping` in `generate_output_index_column` and use it directly in other methods. - `ResampleMapping` just has a mapping from `output_row` to `(start_column_index, start_column_offset), (end_column_index, end_column_offset)`. - Resolves the 3 places with similar logic. - Makes the implementation of sparse aggregation easier. 3. Use [galloping search](https://en.wikipedia.org/wiki/Exponential_search) in `generate_output_index_column` to skip past all rows in a single bucket at once. - Index column construction was the bottleneck: aggregation vectorises well but index iteration does not. - Changes complexity from `O(num_input_rows + num_buckets)` to `O(num_buckets × log(rows_per_bucket))`. - Always ≤ `O(num_input_rows + num_buckets)` even when `num_buckets ≥ num_input_rows`. 4. Preallocate the output index column to `min(num_buckets, num_input_rows)` instead of `num_buckets`. - Galloping search has a higher constant than linear scan and regresses at low rows per bucket. - Slightly improves the case where most buckets are empty due to smaller allocation. 5. Use a runtime heuristic to choose between linear scan and galloping search. - Linear scan is faster below ~32 rows/bucket (because of smaller constant and better branch prediction); galloping search is faster above. - Threshold determined empirically from benchmarks at intermediate bucket counts. Extra benchmarking was done with more parametrization of the existing benchmark. Not kept in PR to avoid a huge amount of benchmarking code. - Recovers the Dense-100k and Empty regressions from commit 3 while retaining all gains elsewhere. 6. Implement sparse resampling. - Small change made straightforward by the `ResampleMapping` from commit 2. - Minimal overhead for the dense case. ### Resample benchmark timings `BM_resample/<rows_per_seg>/<num_segs>/<num_buckets>/<num_cols>`. Total rows ~1M. Source: `cpp/arcticdb/processing/test/benchmark_resample.cpp`. Times in **ms**, `--benchmark_min_time=2s`. | Regime | Args | rows/bucket | Description | |---|---|---|---| | Dense-1k | `100k × 10, 1k buckets` | ~1000 | Many rows/bucket, single row-slice | | Dense-100 | `100k × 10, 10k buckets` | ~100 | Medium rows/bucket, single row-slice | | Dense-10 | `100k × 10, 100k buckets` | ~10 | Few rows/bucket, single row-slice | | Spanning | `2k × 500, 100 buckets` | ~10k | Buckets span multiple row-slices | | Empty | `100k × 10, 10M buckets` | <1 | Bucket smaller than row spacing; most empty | **1 aggregation column** | # | Change | D-1k | D-100 | D-10 | Spanning | Empty | |---|---|---|---|---|---|---| | 0 | Baseline | 1.27 | 1.34 | 1.47 | 1.65 | 11.1 | | 1 | Code move | 1.02 (−20%) | 1.12 (−16%) | 1.27 (−14%) | 1.40 (−15%) | 11.1 (0%) | | 2 | ResampleMapping | 1.02 (−20%) | 1.12 (−16%) | 1.32 (−10%) | 1.40 (−15%) | 11.8 (+6%) | | 3 | Galloping search | 0.059 (−95%) | 0.385 (−71%) | 2.94 (+100%) | 0.285 (−83%) | 21.9 (+97%) | | 4 | Bounded allocation | 0.058 (−95%) | 0.396 (−70%) | 2.91 (+98%) | 0.291 (−82%) | 21.5 (+94%) | | 5 | Heuristic (lin/EUB) | 0.059 (−95%) | 0.383 (−71%) | 1.27 (−14%) | 0.293 (−82%) | 11.5 (+4%) | | 6 | Sparse-input support | 0.068 (−95%) | 0.449 (−66%) | 1.28 (−13%) | 0.296 (−82%) | 11.5 (+4%) | **100 aggregation columns** | # | Change | D-1k | D-100 | D-10 | Spanning | Empty | |---|---|---|---|---|---|---| | 0 | Baseline | 1.37 | 1.43 | 1.56 | 6.22 | 48.0 | | 1 | Code move | 1.11 (−19%) | 1.18 (−17%) | 1.34 (−14%) | 5.92 (−5%) | 46.2 (−4%) | | 2 | ResampleMapping | 1.11 (−19%) | 1.19 (−17%) | 1.39 (−11%) | 5.87 (−6%) | 50.4 (+5%) | | 3 | Galloping search | 0.148 (−89%) | 0.471 (−67%) | 2.96 (+90%) | 4.65 (−25%) | 63.1 (+31%) | | 4 | Bounded allocation | 0.148 (−89%) | 0.480 (−66%) | 2.95 (+89%) | 4.67 (−25%) | 44.1 (−8%) | | 5 | Heuristic (lin/EUB) | 0.149 (−89%) | 0.477 (−67%) | 1.33 (−15%) | 4.70 (−24%) | 35.9 (−25%) | | 6 | Sparse-input support | 0.158 (−88%) | 0.537 (−62%) | 1.35 (−13%) | 4.94 (−21%) | 36.0 (−25%) | Deltas vs baseline (row 0). #### Notes on benchmark results - Load average varied across runs so there are some artifacts in results like "Code move" improvements. - Galloping search improves the speed when there are more rows in a single bucket significantly. Thorough benchmarking showed exponential upper bound (EUB) becomes faster than linear search at ~32 rows per bucket. Hence we see some performance regressions in the 10 rows per bucket and in the mostly empty bucket cases. - Bounded allocation mostly helps the empty case as expected - Using the heuristic to choose between EUB and linear search helps when rows_per_bucket < 32. It is even more efficient than the baseline due to slightly better branch prediction (improved use of `ARCTICDB_LIKELY` and `ARCTICDB_UNLIKELY`). - Final state: every regime at or faster than baseline; Dense 1000 rows per bucket is the biggest winner with 20x improvement; Mostly empty bucket is the only usecase with no improvement and remains around baseline (+4%) --------- Co-authored-by: Ivo <ivo.dilov@man.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reference Issues/PRs
Optimizations on top of #3091
Used in #3062
What does this implement or fix?
Some micro optimizations on binary search methods:
TypedBlockDatainColumnDataIterator. Instead only keepblock_data_andblock_size_Any other comments?
Benchmarks for all search and iteration methods:
Checklist
Checklist for code changes...