[ntuple] Support multiple column representations in the merger#22017
Draft
silverweed wants to merge 8 commits intoroot-project:masterfrom
Draft
[ntuple] Support multiple column representations in the merger#22017silverweed wants to merge 8 commits intoroot-project:masterfrom
silverweed wants to merge 8 commits intoroot-project:masterfrom
Conversation
Instead of calling continue multiple times in the AddColumnFromField loop, just early return in case of projected fields.
We are currently serializing columns per-field, but in case of late
column extension this might result in inconsistent sorting of the columns
in the serialized footer.
e.g. assume you have fields "A" and "B", both late model extended, both
with a single column:
- col 0 -> field A, repr 0
- col 1 -> field B, repr 0
Now you add a new column representation to field "A"; this new column
has id 2:
- col 2 -> field A, repr 1
When serializing this RNTuple, all columns are written in the footer by
RNTupleSerialize::SerializeColumnsForFields(). Before this change, they
would end up on disk in order: [0, 2, 1].
This would corrupt the data by swapping the pages for columns 2 and 1.
After this change, they get written as [0, 1, 2] which is the correct
order.
Note that this exact case is tested in ntuple_merger in the unit test
MergeDeferredAdvanced.
Also fix the type of result
Internal functionality to be used by the Merger
2accf31 to
b2ae5fc
Compare
b2ae5fc to
1db6b5e
Compare
Test Results 16 files 16 suites 2d 10h 53m 8s ⏱️ For more details on these failures, see this check. Results for commit 1db6b5e. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This Pull request:
Significantly reworks the innards of the RNTupleMerger to support fast merging of fields with different but compatible column representations.
Basically it does two things:
A potentially negative consequence that we might want to revisit is that now the merger won't ever adapt the columns' splitness to the output compression (e.g. if merging changes the source compression from 0 to 505 it will still encode the columns as unsplit, and vice-versa). This will probably be readded in a future PR.
In order to achieve this, some new internal functionality had to be added, most notably
RPagePersistentSink::AddColumnRepresentation.TODO
AddExtendedColumnRangesChecklist: