Skip to content

Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437

Open
kosiew wants to merge 4 commits intoapache:mainfrom
kosiew:deferredcopying-01-21156
Open

Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437
kosiew wants to merge 4 commits intoapache:mainfrom
kosiew:deferredcopying-01-21156

Conversation

@kosiew
Copy link
Copy Markdown
Contributor

@kosiew kosiew commented Apr 7, 2026

Which issue does this PR close?


Rationale for this change

The current string_agg benchmarks use very small (≈3-byte) string payloads (e.g., hi0..hi3). This makes it difficult to observe the real cost of string aggregation in scenarios where payload sizes are larger, particularly the cost of copying and memory pressure during grouped aggregation.

This PR introduces configurable UTF-8 payload sizes so benchmarks can better reflect realistic workloads and expose CPU and memory behavior differences across payload sizes.


What changes are included in this PR?

  • Introduced a new Utf8PayloadProfile enum with three profiles:

    • Small (≈3 bytes, existing baseline)
    • Medium (≈64 bytes)
    • Large (≈1024 bytes)
  • Added create_table_provider_with_payload to allow generating tables with configurable string payload sizes.

  • Refactored record batch generation to use precomputed payload arrays instead of formatting strings per row.

  • Added helper:

    • payload_string for generating fixed-size string payloads
    • Utf8PayloadProfile::payloads() for producing 4-value low-cardinality payload sets
  • Updated benchmark setup:

    • Introduced create_context_with_payload

    • Parameterized string_agg queries across:

      • group cardinality (few, mid, many)
      • payload size (small_3b, medium_64b, large_1024b)
  • Replaced individual benchmark functions with a criterion benchmark group using BenchmarkId to produce a matrix of results.


Are these changes tested?

No new unit tests were added.

Reason:

  • This change only affects benchmark utilities and benchmark definitions.
  • Existing functionality remains unchanged for default (Small) payload profile.
  • Benchmarks themselves act as validation for correctness and performance behavior.

Are there any user-facing changes?

No.

This PR only impacts internal benchmarking infrastructure and does not modify public APIs or query behavior.


LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.


Additional Notes

  • The benchmark matrix now isolates the impact of payload size while preserving low cardinality (4 distinct values), ensuring that observed differences are primarily due to string size rather than grouping distribution.
  • Medium payloads aim to expose copy costs without excessive allocator overhead, while large payloads stress both CPU and memory behavior.

kosiew added 2 commits April 7, 2026 16:12
Introduce string_agg_payloads Criterion group for benchmarking
against small_3b, medium_64b, and large_1024b datasets.
Update comments for clarity on dataset sizes and comparisons.

Implement Utf8PayloadProfile and create_table_provider_with_payload
to generate UTF-8 values with varying payload sizes, enhancing
the assessment of CPU and memory costs during string aggregation.
Eliminate redundancy in create_context_with_payload by always
using create_table_provider_with_payload. Remove the Small
special case and extra imports. Replace standalone payload-label
helpers with inline tuples, and extract a string_agg_sql
helper for cleaner SQL text in benchmark setups.

Optimize payload string computation in create_record_batches
to run once per dataset build, rather than for each batch.
Unify Utf8PayloadProfile::payloads with a single from_fn
construction, and inline filler-character logic into
payload_string for improved clarity and efficiency.
@github-actions github-actions bot added the core Core DataFusion crate label Apr 7, 2026
kosiew added 2 commits April 7, 2026 20:16
This commit introduces a `#[allow(dead_code)]` annotation to the `Utf8PayloadProfile` enum in the data_utils module to suppress warnings for unused code. This change aids in maintaining cleaner code while allowing for future extensions without triggering compiler alerts.
Replace the lint-attribute workaround in mod.rs at line 77
with a structural fix. Remove #[allow] and #[expect] from
Utf8PayloadProfile. Update create_table_provider_with_payload(...)
to reference Utf8PayloadProfile::all(), ensuring all three
variants are constructed while keeping the shared bench module
free of dead-code warnings across targets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant