Data migrations worker rework #27014

bashtanov · 2025-07-28T13:32:29Z

Previously work tracking logic was flawed. When for an NTP a work was
already running and backend requested another work (e.g. for a different
stage or for a new migration), it was handled inappropriately:

existing running work was only aborted by triggering abort source,
but not waited to actually complete;
work info, which is supplementary data a work uses, was overwritten
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;
per-work abort sources were not in use, only main one was

Reorganised logic:

allow no more than one running work per NTP;
store its belongigns separately from the one requested if they are
different;
use both main and individual abort sources

Backports Required

Release Notes

none

vbotbuildovich · 2025-07-28T15:58:30Z

CI test results

test results on build#69755

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
EndToEndCloudTopicsTxTest	test_write	null	integration	https://buildkite.com/redpanda/redpanda/builds/69755#0198516c-59d7-4bbc-9397-756af75d70eb	FLAKY	20/21	upstream reliability is '94.84066767830045'. current run reliability is '95.23809523809523'. drift is -0.39743 and the allowed drift is set to 50. The test should PASS

test results on build#69827

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
FeaturesMultiNodeTest	test_license_upload_and_query	null	integration	https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8aec-4d81-98fe-560b0d76c8ef	FLAKY	16/21	upstream reliability is '95.67567567567568'. current run reliability is '76.19047619047619'. drift is 19.4852 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": true, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/69827#01985586-8ae8-4feb-9bba-6ad374e9610a	FLAKY	20/21	upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 50. The test should PASS

test results on build#69949

test_class	test_method	test_arguments	test_kind	job_url	test_status	passed	reason
DataMigrationsApiTest	test_creating_and_listing_migrations	null	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb5-af59-49c4-97ad-54e561968789	FLAKY	19/21	upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
DataMigrationsApiTest	test_higher_level_migration_api	null	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad2f-44f5-a757-8a7637903b2d	FLAKY	16/21	upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest	test_node_operations	{"cloud_storage_type": 2, "compaction_mode": "adjacent_merge", "enable_failures": false, "mixed_versions": false, "with_iceberg": false}	integration	https://buildkite.com/redpanda/redpanda/builds/69949#01985bb3-ad36-4f21-975e-d01b47933cf8	FLAKY	19/21	upstream reliability is '98.7551867219917'. current run reliability is '90.47619047619048'. drift is 8.279 and the allowed drift is set to 50. The test should PASS

dotnwat

is this fixing a bug we can reference to justify the backport or is ther more context around the motivation for backproting this?

This reverts commit 48f2cce.

bashtanov · 2025-07-29T12:19:48Z

@dotnwat just the bug. It invokes an UB, and we're lucky (or ignorant) it didn't result in anything serious. Should I add a release note line about it?

mmaslankaprv · 2025-07-30T06:35:42Z

@bashtanov can you add a bit more detailed commit message for the second commit in this PR ? Please provide the motivation for changes and describe the idea behind the new work tracking logic

src/v/cluster/data_migration_worker.cc

Previously work tracking logic was flawed. When for an NTP a work was already running and backend requested another work (e.g. for a different stage or for a new migration), it was handled inappropriately: 1) existing running work was only aborted by triggering abort source, but not waited to actually complete; 2) work info, which is supplementary data a work uses, was overwritten by the new one; the old work which was still running might access deallocated or reused memory where the old work info was; 3) per-work abort sources were not in use, only main one was Reorganised logic: 1) allow no more than one running work per NTP; 2) store its belongigns separately from the one requested if they are different; 3) use both main and individual abort sources

confluentinc/librdkafka#4963 Failed to fetch committed offsets for 0 partition(s)

dotnwat · 2025-07-30T14:46:05Z

It invokes an UB,

Can you expand on this and why a refactor is needed as opposed to backporting a fix for UB and refactoring in upstream?

bashtanov · 2025-07-30T14:57:24Z

@dotnwat I've updated the main commit message and the PR description accordingly. I cannot think of a way to eliminate the UB without major chages in the logic. We do need to store up to two "work info" objects per NTP, and the rest of the change is about juggling them correctly.

dotnwat · 2025-07-31T01:52:49Z

I'm probably vastly simplifying things, but

work info, which is supplementary data a work uses, was overwritten
by the new one; the old work which was still running might access
deallocated or reused memory where the old work info was;

Sounds like it's just a matter of protecting a shared data structure?

bashtanov · 2025-07-31T10:54:06Z

Invalid memory access is not the only problem that needs to be fixed. Allowing concurrent work on the same NTP is also wrong because they can conflict or run in the wrong order. E.g. imagine a migration progressing up to some point and then cancelled. Quite possibly operations on affected partitions will be something opposite there, e.g. block writes when preparing to migrate away and then unblock to cancel the migration.

Sounds like it's just a matter of protecting a shared data structure?

Well, it is not shared, we need to store both separately, as one is needed for the running task and the other one for the requested one. We need them both physically on the shard, so we either need to alter ntp_state structure to accomodate them (which is what I did) or pass a shared pointer or a value copy to do_work (which would be a major change too).

To protect from concurrent execution we would need a mutex. It would introduce an implicit queue of those waiting for it, while in reality we need a simpler logic, as only the last one in the queue is needed.

All in all, my attempts to make less changes resulted in logic still quite broken or at least in much less confidence in its correctness.

github-actions bot added the area/redpanda label Jul 28, 2025

dotnwat reviewed Jul 28, 2025

View reviewed changes

Revert "tests/data-migrate: do not test with groups until bugs fixed"

3beb2f0

This reverts commit 48f2cce.

bashtanov force-pushed the data-migrations-worker-rework branch from 03bd809 to ee1c027 Compare July 29, 2025 08:41

bashtanov requested review from bharathv, mmaslankaprv and joe-redpanda July 29, 2025 12:18

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Outdated Show resolved Hide resolved

mmaslankaprv reviewed Jul 30, 2025

View reviewed changes

src/v/cluster/data_migration_worker.cc Show resolved Hide resolved

bashtanov force-pushed the data-migrations-worker-rework branch from ee1c027 to 3acb6f0 Compare July 30, 2025 13:22

tests/data-migrate: retry on librdkafka "0 partition(s)" bug

e272f5f

confluentinc/librdkafka#4963 Failed to fetch committed offsets for 0 partition(s)

bashtanov force-pushed the data-migrations-worker-rework branch from 3acb6f0 to e272f5f Compare July 30, 2025 13:54

bashtanov requested review from mmaslankaprv and dotnwat July 30, 2025 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data migrations worker rework #27014

Data migrations worker rework #27014

bashtanov commented Jul 28, 2025 •

edited

Loading

Uh oh!

vbotbuildovich commented Jul 28, 2025 •

edited

Loading

Uh oh!

dotnwat left a comment

Uh oh!

bashtanov commented Jul 29, 2025

Uh oh!

mmaslankaprv commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dotnwat commented Jul 30, 2025

Uh oh!

bashtanov commented Jul 30, 2025

Uh oh!

dotnwat commented Jul 31, 2025

Uh oh!

bashtanov commented Jul 31, 2025

Uh oh!

Uh oh!

Data migrations worker rework #27014

Are you sure you want to change the base?

Data migrations worker rework #27014

Conversation

bashtanov commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Backports Required

Release Notes

Uh oh!

vbotbuildovich commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI test results

Uh oh!

dotnwat left a comment

Choose a reason for hiding this comment

Uh oh!

bashtanov commented Jul 29, 2025

Uh oh!

mmaslankaprv commented Jul 30, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dotnwat commented Jul 30, 2025

Uh oh!

bashtanov commented Jul 30, 2025

Uh oh!

dotnwat commented Jul 31, 2025

Uh oh!

bashtanov commented Jul 31, 2025

Uh oh!

Uh oh!

bashtanov commented Jul 28, 2025 •

edited

Loading

vbotbuildovich commented Jul 28, 2025 •

edited

Loading