Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] set NEntries in RPagePersistentSink::InitFromDescriptor #17306

Merged
merged 3 commits into from
Jan 14, 2025

Conversation

silverweed
Copy link
Contributor

Also add a bunch of reserve() where appropriate

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link

github-actions bot commented Dec 19, 2024

Test Results

    17 files      17 suites   4d 4h 27m 18s ⏱️
 2 694 tests  2 694 ✅ 0 💤 0 ❌
44 129 runs  44 129 ✅ 0 💤 0 ❌

Results for commit 525ef54.

♻️ This comment has been updated with latest results.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also for the record, I'm getting less sure about all those reserves sprinkled in; at least for me, they really decrease code readability for oftentimes questionable benefit...)

Comment on lines 951 to 953
// Even though the number of entries are set when we call AddClusterGroup, we want them to be correct already
// at the end of this function.
fDescriptorBuilder.AddNEntries(nEntries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on why we want this? This now makes InitFromDescriptor different from normal descriptor building where NEntries is only updated when adding a cluster group. This is also documented in the comment of RNTupleDescriptor::fNEntries, which would need at least some updating with this change. More generally, for knowing the number of "entries in flight", we can just look at fPrevClusterNEntries, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason has to do with support for incremental merging with late model extension: currently if you open an existing RNTuple with the RNTupleMerger and you try to late-model extend it, it will call RPageSink::UpdateSchema passing the number of entries processed so far, which doesn't include the number of previously-existing entries in the destination.

In theory this information is already stored in fPrevClusterNEntries when initting the Sink from the descriptor, but it's not accessible. I figured it would make more sense from a user standpoint to have GetNEntries() return the correct number of already-existing entries immediately after initting the descriptor, rather than creating a separate getter specifically for the Sink (which would only make sense for the PersistentSink at that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, changing fNEntries in normal descriptor building earlier than the cluster group may give us problems later on with sharded clusters. We could also ignore them for now and worry later.

Looking at this again, I think we have two other options:

  1. We could create a separate getter in the sink. After all, incremental merging was the use case I added RPageSink::GetDescriptor() for, which returns an empty descriptor for example for RPageNullSink. I think adding another one would be acceptable.
  2. In InitFromDescriptor, we could in principle commit a cluster group at the end, which would update fNEntries in the descriptor. When I added support for incremental merging, we decided to fully flatten out all cluster groups in the input - we could potentially revisit this.

@silverweed silverweed force-pushed the ntuple_pagestorage_fixes branch 4 times, most recently from 287563b to 3002d75 Compare January 9, 2025 12:05
@silverweed silverweed requested a review from hahnjo January 9, 2025 13:23
hahnjo
hahnjo previously requested changes Jan 13, 2025
Comment on lines +861 to +862
fOpenColumnRanges.reserve(fOpenColumnRanges.size() + (nColumns - nColumnsBeforeUpdate));
fOpenPageRanges.reserve(fOpenPageRanges.size() + (nColumns - nColumnsBeforeUpdate));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these really provide any measurable improvement? IMHO it should at least be a separate commit, maybe even a separate PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reserving an array when the size is known a priori is just the correct thing to do and there should be a very good reason not to do so, rather than having to justify doing it. I don't agree with the fact that these hinder readability. Otherwise, for the same reasoning, we should provide measurements every time we choose to pass a const std::string & rather than a std::string and similar.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can agree to disagree. It's an extra line with an argument that you need to use brain power to understand and hopefully get right, where in many cases it doesn't buy you anything measurable. I think the comparison to passing by reference doesn't make sense because that one a) is the same line, so it's not extra, and b) there is a semantic difference between passing a const-ref and creating a copy.

@hahnjo hahnjo requested a review from jblomer January 13, 2025 07:54
@silverweed silverweed force-pushed the ntuple_pagestorage_fixes branch from 3002d75 to 9835693 Compare January 13, 2025 12:29
silverweed and others added 2 commits January 13, 2025 13:32
This is needed by the RNTupleMerger to properly know the initial
number of entries in the destination sink in case of incremental
merging. Since the Descriptor's NEntries is not updated until the
first cluster group is committed - and since we don't commit a cluster
group in InitFromDescriptor() - it cannot be used for that purpose.
@silverweed silverweed force-pushed the ntuple_pagestorage_fixes branch from 9835693 to 525ef54 Compare January 13, 2025 12:32
@hahnjo hahnjo dismissed their stale review January 14, 2025 07:57

dismiss

@silverweed silverweed merged commit f801abd into root-project:master Jan 14, 2025
20 checks passed
@silverweed silverweed deleted the ntuple_pagestorage_fixes branch January 14, 2025 08:32
hahnjo added a commit to hahnjo/rntuple-apps that referenced this pull request Jan 22, 2025
After root-project/root#17306; the sinks
do not have access to the descriptor, nor the total number of entries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants