-
Notifications
You must be signed in to change notification settings - Fork 651
Data migrations worker rework #27014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Data migrations worker rework #27014
Conversation
CI test resultstest results on build#69755
test results on build#69827
test results on build#69949
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this fixing a bug we can reference to justify the backport or is ther more context around the motivation for backproting this?
This reverts commit 48f2cce.
03bd809
to
ee1c027
Compare
@dotnwat just the bug. It invokes an UB, and we're lucky (or ignorant) it didn't result in anything serious. Should I add a release note line about it? |
@bashtanov can you add a bit more detailed commit message for the second commit in this PR ? Please provide the motivation for changes and describe the idea behind the new work tracking logic |
Previously work tracking logic was flawed. When for an NTP a work was already running and backend requested another work (e.g. for a different stage or for a new migration), it was handled inappropriately: 1) existing running work was only aborted by triggering abort source, but not waited to actually complete; 2) work info, which is supplementary data a work uses, was overwritten by the new one; the old work which was still running might access deallocated or reused memory where the old work info was; 3) per-work abort sources were not in use, only main one was Reorganised logic: 1) allow no more than one running work per NTP; 2) store its belongigns separately from the one requested if they are different; 3) use both main and individual abort sources
ee1c027
to
3acb6f0
Compare
confluentinc/librdkafka#4963 Failed to fetch committed offsets for 0 partition(s)
3acb6f0
to
e272f5f
Compare
Can you expand on this and why a refactor is needed as opposed to backporting a fix for UB and refactoring in upstream? |
@dotnwat I've updated the main commit message and the PR description accordingly. I cannot think of a way to eliminate the UB without major chages in the logic. We do need to store up to two "work info" objects per NTP, and the rest of the change is about juggling them correctly. |
I'm probably vastly simplifying things, but
Sounds like it's just a matter of protecting a shared data structure? |
Invalid memory access is not the only problem that needs to be fixed. Allowing concurrent work on the same NTP is also wrong because they can conflict or run in the wrong order. E.g. imagine a migration progressing up to some point and then cancelled. Quite possibly operations on affected partitions will be something opposite there, e.g. block writes when preparing to migrate away and then unblock to cancel the migration.
Well, it is not shared, we need to store both separately, as one is needed for the running task and the other one for the requested one. We need them both physically on the shard, so we either need to alter To protect from concurrent execution we would need a mutex. It would introduce an implicit queue of those waiting for it, while in reality we need a simpler logic, as only the last one in the queue is needed. All in all, my attempts to make less changes resulted in logic still quite broken or at least in much less confidence in its correctness. |
Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:
but not waited to actually complete;
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;
Reorganised logic:
different;
Backports Required
Release Notes