-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ntuple] set NEntries in RPagePersistentSink::InitFromDescriptor #17306
[ntuple] set NEntries in RPagePersistentSink::InitFromDescriptor #17306
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Test Results 17 files 17 suites 4d 4h 27m 18s ⏱️ Results for commit 525ef54. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also for the record, I'm getting less sure about all those reserve
s sprinkled in; at least for me, they really decrease code readability for oftentimes questionable benefit...)
tree/ntuple/v7/src/RPageStorage.cxx
Outdated
// Even though the number of entries are set when we call AddClusterGroup, we want them to be correct already | ||
// at the end of this function. | ||
fDescriptorBuilder.AddNEntries(nEntries); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on why we want this? This now makes InitFromDescriptor
different from normal descriptor building where NEntries
is only updated when adding a cluster group. This is also documented in the comment of RNTupleDescriptor::fNEntries
, which would need at least some updating with this change. More generally, for knowing the number of "entries in flight", we can just look at fPrevClusterNEntries
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason has to do with support for incremental merging with late model extension: currently if you open an existing RNTuple with the RNTupleMerger and you try to late-model extend it, it will call RPageSink::UpdateSchema
passing the number of entries processed so far, which doesn't include the number of previously-existing entries in the destination.
In theory this information is already stored in fPrevClusterNEntries
when initting the Sink from the descriptor, but it's not accessible. I figured it would make more sense from a user standpoint to have GetNEntries()
return the correct number of already-existing entries immediately after initting the descriptor, rather than creating a separate getter specifically for the Sink (which would only make sense for the PersistentSink at that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, changing fNEntries
in normal descriptor building earlier than the cluster group may give us problems later on with sharded clusters. We could also ignore them for now and worry later.
Looking at this again, I think we have two other options:
- We could create a separate getter in the sink. After all, incremental merging was the use case I added
RPageSink::GetDescriptor()
for, which returns an empty descriptor for example forRPageNullSink
. I think adding another one would be acceptable. - In
InitFromDescriptor
, we could in principle commit a cluster group at the end, which would updatefNEntries
in the descriptor. When I added support for incremental merging, we decided to fully flatten out all cluster groups in the input - we could potentially revisit this.
287563b
to
3002d75
Compare
fOpenColumnRanges.reserve(fOpenColumnRanges.size() + (nColumns - nColumnsBeforeUpdate)); | ||
fOpenPageRanges.reserve(fOpenPageRanges.size() + (nColumns - nColumnsBeforeUpdate)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these really provide any measurable improvement? IMHO it should at least be a separate commit, maybe even a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reserving an array when the size is known a priori is just the correct thing to do and there should be a very good reason not to do so, rather than having to justify doing it. I don't agree with the fact that these hinder readability. Otherwise, for the same reasoning, we should provide measurements every time we choose to pass a const std::string &
rather than a std::string
and similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can agree to disagree. It's an extra line with an argument that you need to use brain power to understand and hopefully get right, where in many cases it doesn't buy you anything measurable. I think the comparison to passing by reference doesn't make sense because that one a) is the same line, so it's not extra, and b) there is a semantic difference between passing a const-ref and creating a copy.
3002d75
to
9835693
Compare
This is needed by the RNTupleMerger to properly know the initial number of entries in the destination sink in case of incremental merging. Since the Descriptor's NEntries is not updated until the first cluster group is committed - and since we don't commit a cluster group in InitFromDescriptor() - it cannot be used for that purpose.
9835693
to
525ef54
Compare
After root-project/root#17306; the sinks do not have access to the descriptor, nor the total number of entries.
Also add a bunch of
reserve()
where appropriateChecklist: