Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437
Open
kosiew wants to merge 4 commits intoapache:mainfrom
Open
Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437kosiew wants to merge 4 commits intoapache:mainfrom
kosiew wants to merge 4 commits intoapache:mainfrom
Conversation
Introduce string_agg_payloads Criterion group for benchmarking against small_3b, medium_64b, and large_1024b datasets. Update comments for clarity on dataset sizes and comparisons. Implement Utf8PayloadProfile and create_table_provider_with_payload to generate UTF-8 values with varying payload sizes, enhancing the assessment of CPU and memory costs during string aggregation.
Eliminate redundancy in create_context_with_payload by always using create_table_provider_with_payload. Remove the Small special case and extra imports. Replace standalone payload-label helpers with inline tuples, and extract a string_agg_sql helper for cleaner SQL text in benchmark setups. Optimize payload string computation in create_record_batches to run once per dataset build, rather than for each batch. Unify Utf8PayloadProfile::payloads with a single from_fn construction, and inline filler-character logic into payload_string for improved clarity and efficiency.
This commit introduces a `#[allow(dead_code)]` annotation to the `Utf8PayloadProfile` enum in the data_utils module to suppress warnings for unused code. This change aids in maintaining cleaner code while allowing for future extensions without triggering compiler alerts.
Replace the lint-attribute workaround in mod.rs at line 77 with a structural fix. Remove #[allow] and #[expect] from Utf8PayloadProfile. Update create_table_provider_with_payload(...) to reference Utf8PayloadProfile::all(), ensuring all three variants are constructed while keeping the shared bench module free of dead-code warnings across targets.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
string_aggGroupsAccumulator#21156Rationale for this change
The current
string_aggbenchmarks use very small (≈3-byte) string payloads (e.g.,hi0..hi3). This makes it difficult to observe the real cost of string aggregation in scenarios where payload sizes are larger, particularly the cost of copying and memory pressure during grouped aggregation.This PR introduces configurable UTF-8 payload sizes so benchmarks can better reflect realistic workloads and expose CPU and memory behavior differences across payload sizes.
What changes are included in this PR?
Introduced a new
Utf8PayloadProfileenum with three profiles:Small(≈3 bytes, existing baseline)Medium(≈64 bytes)Large(≈1024 bytes)Added
create_table_provider_with_payloadto allow generating tables with configurable string payload sizes.Refactored record batch generation to use precomputed payload arrays instead of formatting strings per row.
Added helper:
payload_stringfor generating fixed-size string payloadsUtf8PayloadProfile::payloads()for producing 4-value low-cardinality payload setsUpdated benchmark setup:
Introduced
create_context_with_payloadParameterized
string_aggqueries across:few,mid,many)small_3b,medium_64b,large_1024b)Replaced individual benchmark functions with a
criterionbenchmark group usingBenchmarkIdto produce a matrix of results.Are these changes tested?
No new unit tests were added.
Reason:
Small) payload profile.Are there any user-facing changes?
No.
This PR only impacts internal benchmarking infrastructure and does not modify public APIs or query behavior.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.
Additional Notes