-
Notifications
You must be signed in to change notification settings - Fork 2k
[ENH] Only load blocks for updated records #6007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Only load blocks for updated records #6007
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Optimize selective record block loading during materialization The patch moves block prefetching out of Key Changes• Invoke Affected Areas• rust/worker/src/execution/operators/materialize_logs.rs This summary was automatically generated by @propel-code-bot |
| matches!( | ||
| log.get_operation(), | ||
| MaterializedLogOperation::UpdateExisting | ||
| ) | ||
| .then_some(log.get_offset_id()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Performance] The current filtering for pre-loading data is too restrictive. It only considers UpdateExisting operations. However, records with DeleteExisting and OverwriteExisting operations also require their original data from the record segment to correctly calculate the logical size delta and for segment writers to process them.
Without pre-loading, the data for these records will be fetched individually within the hydrate call inside the loop, which negates some of the performance benefit of this optimization.
To fix this, we should pre-load data for all operations on existing records. Since Initial operations are already filtered out by materialize_logs, this means we should load for all operations except AddNew.
Context for Agents
The current filtering for pre-loading data is too restrictive. It only considers `UpdateExisting` operations. However, records with `DeleteExisting` and `OverwriteExisting` operations also require their original data from the record segment to correctly calculate the logical size delta and for segment writers to process them.
Without pre-loading, the data for these records will be fetched individually within the `hydrate` call inside the loop, which negates some of the performance benefit of this optimization.
To fix this, we should pre-load data for all operations on existing records. Since `Initial` operations are already filtered out by `materialize_logs`, this means we should load for all operations except `AddNew`.
File: rust/worker/src/execution/operators/materialize_logs.rs
Line: 87|
Inspected downstream changes, may need more optimization to make this work. Will close for now |

Description of changes
Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustMigration plan
Are there any migrations, or any forwards/backwards compatibility changes needed in order to make sure this change deploys reliably?
Observability plan
What is the plan to instrument and monitor this change?
Documentation Changes
Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?