Skip to content

[v25.2.x] storage: pass abort_source to log::start() #27069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #27050

`_log->start()` is called from `consensus::do_start()`. Within `log::start()`
is a call to `offset_translator().sync_with_log()`, which also passes an
`abort_source`. This synchronization function may be a somewhat heavy
operation, as a `log_reader` over the range `(_highest_known_offset, dirty_offset]`
is used to read the `log`.

Previously, the `abort_source` passed to `sync_with_log()` was the
`disk_log_impl::_compaction_as`. However, this isn't very helpful because
this `abort_source` only has `request_abort()` called upon it within
`disk_log_impl::remove()` and `disk_log_impl::close()`- both of which
are called at the _end_ of the typical high-level `partition` removal process
starting from the `partition_manager`.

What this can lead to is the following:

1. `redpanda` is launched. A relatively heavy `partition` with a lagging
   `_highest_known_offset` enters the `sync_with_log()` operation.
2. The user hits ^C to kill `redpanda`.
3. The high level shutdown process is entered from `application.cc`.
4. Within `partition_manager::do_shutdown(partition)`, we first wait
   for the `_raft_manager` to shutdown the `partition->raft()` pointer
   before proceeding to stop the `partition` and then its underlying `log`.
5. In reality, the `raft` pointer hasn't even finished its starting process
   yet, because the `sync_with_log()` process is taking so long!
6. The user waits for `sync_with_log()` to complete, and only then does
   shutdown of the `raft` object finish (as the `_compaction_as` only requests
   abort when the `log` is closed).
7. `log` is closed, `_compaction_as.request_abort()` is called.

Clearly, passing `_compaction_as` to `sync_with_log()` is essentially a no-op,
since it is guaranteed that `sync_with_log()` _has_ to complete before an abort
of `_compaction_as` can be requested.

To make the shutdown process smoother and avoid this sub-optimal behavior, pass
the `consensus::_as` as an argument to `_log->start()` instead. Then, when
an abort is requested of `redpanda` early on, the `sync_with_log()` process
will be correctly terminated early instead of potentially appearing as a long
winded hang.

(cherry picked from commit 9a5c819)
@vbotbuildovich vbotbuildovich added this to the v25.2.x-next milestone Jul 30, 2025
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jul 30, 2025
@WillemKauf WillemKauf enabled auto-merge July 30, 2025 21:01
@vbotbuildovich
Copy link
Collaborator Author

CI test results

test results on build#69971
test_class test_method test_arguments test_kind job_url test_status passed reason
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} integration https://buildkite.com/redpanda/redpanda/builds/69971#01985d56-c219-450b-86bf-d102901d5a4b FLAKY 19/21 upstream reliability is '100.0'. current run reliability is '90.47619047619048'. drift is 9.52381 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "chunked_sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": true} integration https://buildkite.com/redpanda/redpanda/builds/69971#01985d56-c216-4474-b296-5e1bb643b8b7 FLAKY 17/21 upstream reliability is '95.74468085106383'. current run reliability is '80.95238095238095'. drift is 14.7923 and the allowed drift is set to 50. The test should PASS

@WillemKauf WillemKauf merged commit bcd1b83 into redpanda-data:v25.2.x Jul 31, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants