Skip to content

Data migrations worker rework #27014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

bashtanov
Copy link
Contributor

@bashtanov bashtanov commented Jul 28, 2025

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:

  1. existing running work was only aborted by triggering abort source,
    but not waited to actually complete;
  2. work info, which is supplementary data a work uses, was overwritten
    by the new one; the old work which was still running might access
    deallocated or reused memory where the old work info was;
  3. per-work abort sources were not in use, only main one was

Reorganised logic:

  1. allow no more than one running work per NTP;
  2. store its belongigns separately from the one requested if they are
    different;
  3. use both main and individual abort sources

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.2.x
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

  • none

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Jul 28, 2025

CI test results

test results on build#69755
test_class test_method test_arguments test_kind job_url test_status passed reason
EndToEndCloudTopicsTxTest test_write null integration https://buildkite.com/redpanda/redpanda/builds/69755#0198516c-59d7-4bbc-9397-756af75d70eb FLAKY 20/21 upstream reliability is '94.84066767830045'. current run reliability is '95.23809523809523'. drift is -0.39743 and the allowed drift is set to 50. The test should PASS
test results on build#69827
test_class test_method test_arguments test_kind job_url test_status passed reason
FeaturesMultiNodeTest test_license_upload_and_query null integration https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8aec-4d81-98fe-560b0d76c8ef FLAKY 16/21 upstream reliability is '95.67567567567568'. current run reliability is '76.19047619047619'. drift is 19.4852 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": true, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8ae8-4feb-9bba-6ad374e9610a FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS
test results on build#69949
test_class test_method test_arguments test_kind job_url test_status passed reason
DataMigrationsApiTest test_creating_and_listing_migrations null integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb5-af59-49c4-97ad-54e561968789 FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
DataMigrationsApiTest test_higher_level_migration_api null integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad2f-44f5-a757-8a7637903b2d FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": false, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad36-4f21-975e-d01b47933cf8 FLAKY 19/21 upstream reliability is '98.7551867219917'. current run reliability is '90.47619047619048'. drift is 8.279 and the allowed drift is set to 50. The test should PASS

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this fixing a bug we can reference to justify the backport or is ther more context around the motivation for backproting this?

@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch from 03bd809 to ee1c027 Compare July 29, 2025 08:41
@bashtanov
Copy link
Contributor Author

@dotnwat just the bug. It invokes an UB, and we're lucky (or ignorant) it didn't result in anything serious. Should I add a release note line about it?

@mmaslankaprv
Copy link
Member

@bashtanov can you add a bit more detailed commit message for the second commit in this PR ? Please provide the motivation for changes and describe the idea behind the new work tracking logic

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:
1) existing running work was only aborted by triggering abort source,
but not waited to actually complete;
2) work info, which is supplementary data a work uses, was overwritten
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;
3) per-work abort sources were not in use, only main one was

Reorganised logic:
1) allow no more than one running work per NTP;
2) store its belongigns separately from the one requested if they are
different;
3) use both main and individual abort sources
@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch from ee1c027 to 3acb6f0 Compare July 30, 2025 13:22
@bashtanov bashtanov force-pushed the data-migrations-worker-rework branch from 3acb6f0 to e272f5f Compare July 30, 2025 13:54
@dotnwat
Copy link
Member

dotnwat commented Jul 30, 2025

It invokes an UB,

Can you expand on this and why a refactor is needed as opposed to backporting a fix for UB and refactoring in upstream?

@bashtanov
Copy link
Contributor Author

@dotnwat I've updated the main commit message and the PR description accordingly. I cannot think of a way to eliminate the UB without major chages in the logic. We do need to store up to two "work info" objects per NTP, and the rest of the change is about juggling them correctly.

@dotnwat
Copy link
Member

dotnwat commented Jul 31, 2025

I'm probably vastly simplifying things, but

  1. work info, which is supplementary data a work uses, was overwritten
    by the new one; the old work which was still running might access
    deallocated or reused memory where the old work info was;

Sounds like it's just a matter of protecting a shared data structure?

@bashtanov
Copy link
Contributor Author

Invalid memory access is not the only problem that needs to be fixed. Allowing concurrent work on the same NTP is also wrong because they can conflict or run in the wrong order. E.g. imagine a migration progressing up to some point and then cancelled. Quite possibly operations on affected partitions will be something opposite there, e.g. block writes when preparing to migrate away and then unblock to cancel the migration.

Sounds like it's just a matter of protecting a shared data structure?

Well, it is not shared, we need to store both separately, as one is needed for the running task and the other one for the requested one. We need them both physically on the shard, so we either need to alter ntp_state structure to accomodate them (which is what I did) or pass a shared pointer or a value copy to do_work (which would be a major change too).

To protect from concurrent execution we would need a mutex. It would introduce an implicit queue of those waiting for it, while in reality we need a simpler logic, as only the last one in the queue is needed.

All in all, my attempts to make less changes resulted in logic still quite broken or at least in much less confidence in its correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants