Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix by kosiew · Pull Request #21437 · apache/datafusion

kosiew · 2026-04-07T12:04:21Z

Which issue does this PR close?

Part of Consider deferred copying in string_agg GroupsAccumulator #21156

Rationale for this change

The current string_agg benchmarks use very small (≈3-byte) string payloads (e.g., hi0..hi3). This makes it difficult to observe the real cost of string aggregation in scenarios where payload sizes are larger, particularly the cost of copying and memory pressure during grouped aggregation.

This PR introduces configurable UTF-8 payload sizes so benchmarks can better reflect realistic workloads and expose CPU and memory behavior differences across payload sizes.

What changes are included in this PR?

Introduced a new Utf8PayloadProfile enum with three profiles:
- Small (≈3 bytes, existing baseline)
- Medium (≈64 bytes)
- Large (≈1024 bytes)
Added create_table_provider_with_payload to allow generating tables with configurable string payload sizes.
Refactored record batch generation to use precomputed payload arrays instead of formatting strings per row.
Added helper:
- payload_string for generating fixed-size string payloads
- Utf8PayloadProfile::payloads() for producing 4-value low-cardinality payload sets
Updated benchmark setup:
- Introduced create_context_with_payload
- Parameterized string_agg queries across:
  - group cardinality (few, mid, many)
  - payload size (small_3b, medium_64b, large_1024b)
Replaced individual benchmark functions with a criterion benchmark group using BenchmarkId to produce a matrix of results.

Are these changes tested?

No new unit tests were added.

Reason:

This change only affects benchmark utilities and benchmark definitions.
Existing functionality remains unchanged for default (Small) payload profile.
Benchmarks themselves act as validation for correctness and performance behavior.

Are there any user-facing changes?

No.

This PR only impacts internal benchmarking infrastructure and does not modify public APIs or query behavior.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Additional Notes

The benchmark matrix now isolates the impact of payload size while preserving low cardinality (4 distinct values), ensuring that observed differences are primarily due to string size rather than grouping distribution.
Medium payloads aim to expose copy costs without excessive allocator overhead, while large payloads stress both CPU and memory behavior.

Introduce string_agg_payloads Criterion group for benchmarking against small_3b, medium_64b, and large_1024b datasets. Update comments for clarity on dataset sizes and comparisons. Implement Utf8PayloadProfile and create_table_provider_with_payload to generate UTF-8 values with varying payload sizes, enhancing the assessment of CPU and memory costs during string aggregation.

Eliminate redundancy in create_context_with_payload by always using create_table_provider_with_payload. Remove the Small special case and extra imports. Replace standalone payload-label helpers with inline tuples, and extract a string_agg_sql helper for cleaner SQL text in benchmark setups. Optimize payload string computation in create_record_batches to run once per dataset build, rather than for each batch. Unify Utf8PayloadProfile::payloads with a single from_fn construction, and inline filler-character logic into payload_string for improved clarity and efficiency.

This commit introduces a `#[allow(dead_code)]` annotation to the `Utf8PayloadProfile` enum in the data_utils module to suppress warnings for unused code. This change aids in maintaining cleaner code while allowing for future extensions without triggering compiler alerts.

Replace the lint-attribute workaround in mod.rs at line 77 with a structural fix. Remove #[allow] and #[expect] from Utf8PayloadProfile. Update create_table_provider_with_payload(...) to reference Utf8PayloadProfile::all(), ensuring all three variants are constructed while keeping the shared bench module free of dead-code warnings across targets.

kosiew added 2 commits April 7, 2026 16:12

github-actions bot added the core Core DataFusion crate label Apr 7, 2026

kosiew added 2 commits April 7, 2026 20:16

kosiew marked this pull request as ready for review April 8, 2026 13:55

kosiew mentioned this pull request Apr 8, 2026

Hybrid eager/deferred accumulation for string_agg GroupsAccumulator to reduce copying and memory usage #21469

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437

Add configurable UTF8 payload profiles to string_agg benchmarks and parameterize benchmark matrix#21437
kosiew wants to merge 4 commits intoapache:mainfrom
kosiew:deferredcopying-01-21156

kosiew commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kosiew commented Apr 7, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Additional Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant