Skip to content

[ntuple] Support multiple column representations in the merger#22017

Draft
silverweed wants to merge 8 commits intoroot-project:masterfrom
silverweed:ntuple_merge_colrep2
Draft

[ntuple] Support multiple column representations in the merger#22017
silverweed wants to merge 8 commits intoroot-project:masterfrom
silverweed:ntuple_merge_colrep2

Conversation

@silverweed
Copy link
Copy Markdown
Contributor

@silverweed silverweed commented Apr 22, 2026

This Pull request:

Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:

  • turns all L3 merging cases into L2/L1.
  • no longer rejects merging fields with different column representations (previously this was only supported for representations that were the split/unsplit version of each other, and only via L3 merging).

A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.

In order to achieve this, some new internal functionality had to be added, most notably RPagePersistentSink::AddColumnRepresentation.

TODO

  • check if we need a feature flag for the changes in AddExtendedColumnRanges
  • add a test for merging of Real32Trunc/Quant columns with different bit width/value range
  • properly split the big merger commit
  • update Merging.md

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

Instead of calling continue multiple times in the AddColumnFromField
loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.

e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
    - col 0 -> field A, repr 0
    - col 1 -> field B, repr 0

Now you add a new column representation to field "A"; this new column
has id 2:
    - col 2 -> field A, repr 1

When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.

After this change, they get written as [0, 1, 2] which is the correct
order.

Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
Internal functionality to be used by the Merger
@silverweed silverweed requested a review from jblomer as a code owner April 22, 2026 15:07
@silverweed silverweed marked this pull request as draft April 22, 2026 15:07
@silverweed silverweed changed the title Ntuple merge colrep2 [ntuple] Support multiple column representations in the merger Apr 22, 2026
@silverweed silverweed self-assigned this Apr 22, 2026
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch 2 times, most recently from 2accf31 to b2ae5fc Compare April 22, 2026 15:22
@silverweed silverweed force-pushed the ntuple_merge_colrep2 branch from b2ae5fc to 1db6b5e Compare April 22, 2026 15:22
@github-actions
Copy link
Copy Markdown

Test Results

    16 files      16 suites   2d 10h 53m 8s ⏱️
 3 843 tests  3 841 ✅ 0 💤  2 ❌
54 767 runs  54 751 ✅ 0 💤 16 ❌

For more details on these failures, see this check.

Results for commit 1db6b5e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant