[AHM/Staking/VMP] Paginated Offence Reports + Retries for Validator Set #9619

kianenigma · 2025-09-01T16:42:44Z

Please see the full design do here
closes https://github.com/paritytech-secops/srlabs_findings/issues/520

This PR makes the following changes:

Common

SendToRelayChain and SendToAssetHub traits now return a result, allowing the caller to know if the underlying XCM was sent or not.
Adds a number of testing facilities to pallet-root-offences, and staking-async/papi-tests. Both of which can be ignored in the review.

Offences

SendToAssetHub::relay_new_offence is removed. Instead, we use the new relay_new_offence_paged which is a vector of self-contained offences, not requiring us to group offences per session in each message.
Offences are not sent immediately anymore.
Instead, they are stored in a paginated OffenceSendQueue.
on-init, we grab one page of this storage map, and sent it.

Session Report

Session reports now also have a retry mechanism.
Upon each failure, we emit an UnexpectedEvent
If our retries run out and we still can't send the session report, we will emit a different UnexpectedEvent. We also retore the validator points that we meant to send, and merge them back, so that they are sent in the next session report.

Validator Set

Similar to offences, they are not sent immediately anymore.
Instead, they are stored in a storage item, and are sent on subsequent on-inits.
A maximum retry count is added.

Review notes

As noted above, ignore all changes in

staking-async/runtimes
staking-async/runtimes/papi-tests
root-offences

As they are only related to testing.

…e-slot

…offence-dropping

kianenigma · 2025-09-01T16:53:14Z

substrate/frame/staking-async/rc-client/src/lib.rs

+			log::error!(target: "runtime::staking-async::rc-client", "📨 Failed to split message {}: {:?}", message_type_name, e);
+		})?;
+
+		match with_transaction_opaque_err(|| {


@franciscoaguirre @bkontur I don't expect us to need this mechanism anymore, but as far as I can say this code should still be correct. Can you confirm?

Ank4n

Haven't gone through the whole thing yet.

substrate/frame/staking-async/ah-client/src/lib.rs

Ank4n · 2025-09-04T08:30:05Z

substrate/frame/staking-async/ah-client/src/lib.rs

+		/// * a too low of a value is assigned to [`Config::MaximumValidatorsWithPoints`]
+		/// * Those who are calling into our `RewardsReporter` likely have a bad view of the
+		///   validator set, and are spamming us.
+		ValidatorPointDropped,


Might be better to send the message in chunks instead of dropping points, or? If this issue can really happen (points being set on non-validator accounts), then in theory all real validator points could be dropped while only the spammed ones are sent.

About splitting, I learned a few new things:

All of our messages are small enough to fit in the limits of a single-message.

If they don't fit in the limits of the entire queue, I decided to not send them and wait until the queue can accept it. Because otherwise, we would have to buffer parts of a message that was not sent, and deal with that, which I suspect opens more cans of worms than helping.

Note that we still have machinery to both split all messages, and combine them, but for now I have removed them from the main code path.

(PTAL at this as well for more context)

substrate/frame/staking-async/ah-client/src/lib.rs

Co-authored-by: Ankan <[email protected]>

sigurpol · 2025-09-05T08:48:17Z

Still have to review the whole PR but since I have noticed... nit and out of scope: @kianenigma , since you are changing papi-test's cmd.ts please replace experimental-monitor-multi-block at line 87 with just monitor and use a much more recent miner in your PATH (you are using a very old miner)

Ank4n · 2025-09-07T02:13:09Z

substrate/frame/staking-async/rc-client/src/lib.rs

+							OutgoingValidatorSet::<T>::put((report, new_retries_left))
+						} else {
+							Self::deposit_event(Event::<T>::Unexpected(
+								UnexpectedKind::ValidatorSetDropped,


If this happens, we need to either kick off a new election (or decrement current era), right? Otherwise we will not do a re-election again?

Correct, we will get stuck, and for now in favor of time I assume we will do the recovery manually with openGov.

what would be the long term solution? Chunking validator sets and send a chunk only if it fits in message limits?

assuming we recovery manually with openGov for the time being, would it make still sense to save to storage the validator set we failed to send instead of completely dropping it in something like DroppedValidatorSet or similar, and add a recovery extrinsic for openGov ?

ok, dropping 1. as per #9619 (comment), we eventually will adopt a chunking-over-blocks in the future.
Keeping 2. around to 👂 from you what you think about it

Ank4n · 2025-09-07T02:18:21Z

substrate/frame/staking-async/rc-client/src/lib.rs

+						Self::deposit_event(Event::<T>::Unexpected(
+							UnexpectedKind::ValidatorSetSendFailed,
+						));
+						if let Some(new_retries_left) = retries_left.checked_sub(One::one()) {


When would this fail? Do we need some backoff strategy as well?

The size of a single message can never cause this to fail. Only if an attacker is filling the UMP queue. The retry mechanism, combined with exponential fee increase in the UMP, are meant to be the defence.

Ank4n

Symbolic approve. Critical things that I think still needs to be addressed:

Retry mechanism for session report (RC to AH).
Ensuring validator election can resume normally If validator set is dropped on AH (while sending to RC).

polkadot/runtime/westend/src/lib.rs

…-sdk into kiz-offence-dropping

…offence-dropping

polkadot/runtime/westend/src/lib.rs

substrate/frame/staking-async/ah-client/src/lib.rs

prdoc/pr_9619.prdoc

substrate/frame/staking-async/ah-client/src/lib.rs

sigurpol · 2025-09-09T09:17:19Z

substrate/frame/staking-async/rc-client/src/lib.rs

+							OutgoingValidatorSet::<T>::put((report, new_retries_left))
+						} else {
+							Self::deposit_event(Event::<T>::Unexpected(
+								UnexpectedKind::ValidatorSetDropped,


what would be the long term solution? Chunking validator sets and send a chunk only if it fits in message limits?

assuming we recovery manually with openGov for the time being, would it make still sense to save to storage the validator set we failed to send instead of completely dropping it in something like DroppedValidatorSet or similar, and add a recovery extrinsic for openGov ?

sigurpol · 2025-09-09T09:30:09Z

substrate/frame/staking-async/ah-client/src/lib.rs

-			weight = weight.saturating_add(processing_weight);
+			// then, take a page from our send queue, and if present, send it.
+			weight.saturating_accrue(T::DbWeight::get().reads(2));
+			OffenceSendQueue::<T>::get_and_maybe_delete(|page| {


Being OffenceSendQueue unbounded (only page size is via MaxOffenceBatchSize), it can grow infinitely, aren't we at risk of potentially increasing storage indefinitely in case of attack scenario and/or network congestion? I understand we can't drop offences so we would need a mechanism (out of scope of this PR) to somehow slow down offence reporting in consensus code to be on the safe side or did I get it wrong?
Maybe on staking side, we should add some kind of monitoring / event when some threshold is exceeded.

Your intuition is right, this is a band-aid and a useful mechanism to begin with, but eventually we exactly need some way to deduplicate offences, or lower their quantity somehow, before this code path is reached.

This band-aid will just remove the risk of an attacker being able to cause offences to drop for chaep.

For some mad scenario where a bug is causing perpetual offences for everyone forever, it won't help of course.

sigurpol · 2025-09-09T09:49:23Z

substrate/frame/staking-async/ah-client/src/lib.rs

-				return weight;
+			// if we have any pending session reports, send it.
+			weight.saturating_accrue(T::DbWeight::get().reads(1));
+			if let Some((session_report, retries_left)) = OutgoingSessionReport::<T>::take() {


We are always prioritizing session report over offence report in terms of block weight - which is probably fine since it happens much more rarely. Ideally we should support chunking for session reports and somehow split weight budget between session report (if present) and offences (if present) here to prevent some offence starving

The weight is not divided. Everything in on_initialize is mandatory¹ and will execute. It is just that since we enqueue the session report first, we are giving it more priority in consuming any of the DMP resource limits (more info in the doc).

But, since session report gets a finite number of retries

and offences get basically infinite retries

and as you said session report is systematically a rare event

I think it is a better choice to prioritize it.

Footnotes

A bit of an archaic topic in Substrate, and probably not well understood or documented except in PBA. Checkout DispatchClass, and all places where we use DispatchClass::Mandatory. These are all mandatory hooks that will MUST always happen. And therefore are also kinda dangerous as well. Another improvement I want to do for Polkadot AHM is to move as much as we can to on_poll, which is exactly the same, but is weight aware and skip if the block is full. ↩

Definitely a strong +1 for migrating to on_poll() when the time is right (and thanks for the extra explanation around DispatchClass::Mandatory and on_initialize vs on_poll, TIL 🙇)

sigurpol · 2025-09-09T09:59:33Z

substrate/frame/staking-async/ah-client/src/lib.rs

+			// if we have any pending session reports, send it.
+			weight.saturating_accrue(T::DbWeight::get().reads(1));
+			if let Some((session_report, retries_left)) = OutgoingSessionReport::<T>::take() {
+				match T::SendToAssetHub::relay_session_report(session_report.clone()) {


Would it make sense to implement both here and on rc-client a sort of size checking vs max UMP/DMP message size? E.g. from the risk document we know that for Kusama / Polkadot we have UMP message limit of 64kb and DM 50kb. We could try to check before sending and avoid to send if we are close to the risk threshold. It would play well in the future once/if we introduce chunking.

Oh great point, let me explain:

I envisioned the need for this actually quite early, and this is why SessionReport and ValidatorSet both had a leftover: bool field from the get go. There is also a XCMSeder::split_then_send that mostly does what you suggest. Moreover, #8409 came to make sure such errors are reported upwards.

But, now that I have revisited the topic, I have more info:

We know that the size of all of our messages is small enough to fit the single-message-size-limit. Only a mistaken change in the configuration involved might break this. See mod message_queue_sizes.

So the real limit we have is the whole-queue-limits. For that, I realized that splitting a message and sending it all at once won't actually help. We have to chunk the message, send it over blocks.

This is why I no longer use XCMSender::split_then_send, and instead only use a singleton XCMSender::send

I have also changed split_then_send such that it makes sure all chunks can be sent, and if revert all, in case it is used.

TLDR: single message size is not a problem. Whole queue is the issue, and chunk-and-send-all won't help with it. A chunk-and-send-over-many-blocks is better, but also more complicated than the approach here which retries the whole message. This can be a future improvement.

Great explanation, thank you!
Nitpick: why not removing split_then_send instead of keeping as deprecated?

sigurpol

Great stuff! I would be in favor of having a better recovery mechanism for ValidatorSet. One relatively low hanging fruit is suggested in the review comments (saving to storage something like DeprecatedValidatorSet + ad-hoc extrinsic for governance). Maybe we could do even better automatically.
Not a blocker though

paritytech-workflow-stopper · 2025-09-09T19:42:38Z

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/17583640822
Failed job name: build-linux-substrate

paritytech-release-backport-bot · 2025-09-10T07:57:52Z

Created backport PR for unstable2507:

[unstable2507] Backport #9619 [DO NOT MERGE YET - WE WANT TO HAVE IT AFTER KAHM / AFTER 1.8 release] #9696 with remaining conflicts!

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-9619-to-unstable2507
git worktree add --checkout .worktree/backport-9619-to-unstable2507 backport-9619-to-unstable2507
cd .worktree/backport-9619-to-unstable2507
git reset --hard HEAD^
git cherry-pick -x f7b0396e3b7f826166cb5acc4a4248307af6d708
git push --force-with-lease

…et (#9619) * Please see the full design do [here](https://docs.google.com/document/d/1l2COWct1f-gC8nM0tq7Xs8pBWeAP6pX0LociWC6enUg/edit?tab=t.0) * closes https://github.com/paritytech-secops/srlabs_findings/issues/520 This PR makes the following changes: #### Common * `SendToRelayChain` and `SendToAssetHub` traits now return a result, allowing the caller to know if the underlying XCM was sent or not. * Adds a number of testing facilities to `pallet-root-offences`, and `staking-async/papi-tests`. Both of which can be ignored in the review. #### Offences * `SendToAssetHub::relay_new_offence` is removed. Instead, we use the new `relay_new_offence_paged` which is a vector of self-contained offences, not requiring us to group offences per session in each message. * Offences are not sent immediately anymore. * Instead, they are stored in a paginated `OffenceSendQueue`. * `on-init`, we grab one page of this storage map, and sent it. #### Session Report * Session reports now also have a retry mechanism. * Upon each failure, we emit an `UnexpectedEvent` * If our retries run out and we still can't send the session report, we will emit a different `UnexpectedEvent`. We also retore the validator points that we meant to send, and merge them back, so that they are sent in the next session report. #### Validator Set * Similar to offences, they are not sent immediately anymore. * Instead, they are stored in a storage item, and are sent on subsequent on-inits. * A maximum retry count is added. ### Review notes As noted above, ignore all changes in * `staking-async/runtimes` * `staking-async/runtimes/papi-tests` * `root-offences` As they are only related to testing. --------- Co-authored-by: Ankan <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit f7b0396)

paritytech-release-backport-bot · 2025-09-10T07:57:56Z

Successfully created backport PR for stable2509:

[stable2509] Backport #9619 #9697

Backport #9619 into `stable2509` from kianenigma. See the [documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md) on how to use this bot. NOTE: this PR introduces **major** changes in the staking-async pallet, needed to address critical issues related to Staking vs VMP. They are needed to improve robustness and resilience of the staking machinery (paginated offences, retry mechanism for session report and validator set), this is why we are backporting it.  Co-authored-by: Kian Paimani <[email protected]> Co-authored-by: Ankan <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…et (#9619) * Please see the full design do [here](https://docs.google.com/document/d/1l2COWct1f-gC8nM0tq7Xs8pBWeAP6pX0LociWC6enUg/edit?tab=t.0) * closes paritytech-secops/srlabs_findings#520 This PR makes the following changes: * `SendToRelayChain` and `SendToAssetHub` traits now return a result, allowing the caller to know if the underlying XCM was sent or not. * Adds a number of testing facilities to `pallet-root-offences`, and `staking-async/papi-tests`. Both of which can be ignored in the review. * `SendToAssetHub::relay_new_offence` is removed. Instead, we use the new `relay_new_offence_paged` which is a vector of self-contained offences, not requiring us to group offences per session in each message. * Offences are not sent immediately anymore. * Instead, they are stored in a paginated `OffenceSendQueue`. * `on-init`, we grab one page of this storage map, and sent it. * Session reports now also have a retry mechanism. * Upon each failure, we emit an `UnexpectedEvent` * If our retries run out and we still can't send the session report, we will emit a different `UnexpectedEvent`. We also retore the validator points that we meant to send, and merge them back, so that they are sent in the next session report. * Similar to offences, they are not sent immediately anymore. * Instead, they are stored in a storage item, and are sent on subsequent on-inits. * A maximum retry count is added. As noted above, ignore all changes in * `staking-async/runtimes` * `staking-async/runtimes/papi-tests` * `root-offences` As they are only related to testing. --------- Co-authored-by: Ankan <[email protected]> Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com> (cherry picked from commit f7b0396)

kianenigma added 10 commits August 21, 2025 15:02

test to demonstrate the issue

73be128

unfinished approach with a complicated queueing, but not tweaking tim…

9bffa3e

…e-slot

e2e test works well

9705fe3

Merge branch 'master' of github.com:paritytech/polkadot-sdk into kiz-…

339f754

…offence-dropping

cleanup and lite self review

32dac67

self review + vmp spamming infra

6e7ddc6

more work on vmp spamming demonstration

7840146

SendQueue -> OffenceSendQueue

0864c63

validator set buffering

3f2ba86

fmtc

38ca009

kianenigma requested a review from a team as a code owner September 1, 2025 16:42

kianenigma added A4-backport-unstable2507 Pull request must be backported to the unstable2507 release branch A4-backport-stable2509 Pull request must be backported to the stable2509 release branch T2-pallets This PR/Issue is related to a particular pallet. labels Sep 1, 2025

kianenigma commented Sep 1, 2025

View reviewed changes

self review, westend fixes

7d6659e

acatangiu requested a review from serban300 September 3, 2025 12:23

Ank4n reviewed Sep 4, 2025

View reviewed changes

kianenigma and others added 3 commits September 4, 2025 12:43

updates

33b5298

Update substrate/frame/staking-async/ah-client/src/lib.rs

4ce26f6

Co-authored-by: Ankan <[email protected]>

Update substrate/frame/staking-async/ah-client/src/lib.rs

6289237

Co-authored-by: Ankan <[email protected]>

kianenigma mentioned this pull request Sep 5, 2025

[AHM] Final Staking Configs for Kusama polkadot-fellows/runtimes#883

Merged

Ank4n reviewed Sep 7, 2025

View reviewed changes

Ank4n approved these changes Sep 7, 2025

View reviewed changes

serban300 reviewed Sep 8, 2025

View reviewed changes

polkadot/runtime/westend/src/lib.rs Outdated Show resolved Hide resolved

kianenigma added 3 commits September 8, 2025 15:52

reuse the same offense sending pipeline for migration buffered offences

efb4cb1

session reprot retry

72b4184

Merge branch 'kiz-offence-dropping' of github.com:paritytech/polkadot…

bfb93b0

…-sdk into kiz-offence-dropping

kianenigma added 7 commits September 8, 2025 17:34

call indices

a47b525

nits

3feb2c6

Merge branch 'master' of github.com:paritytech/polkadot-sdk into kiz-…

0e55f64

…offence-dropping

fix

aa61622

fix rustdoc

c04bc62

remove bench

8a5d5b7

fix prdoc

ad185f4

kianenigma commented Sep 9, 2025

View reviewed changes

polkadot/runtime/westend/src/lib.rs Outdated Show resolved Hide resolved

Apply suggestion from @kianenigma

8c178fd

kianenigma commented Sep 9, 2025

View reviewed changes

substrate/frame/staking-async/ah-client/src/lib.rs Outdated Show resolved Hide resolved

deeper self review

1b70662

sigurpol reviewed Sep 9, 2025

View reviewed changes

review notes

684b63c

sigurpol approved these changes Sep 9, 2025

View reviewed changes

kianenigma and others added 2 commits September 10, 2025 10:32

fix

eba827e

Merge branch 'master' into kiz-offence-dropping

1a2deab

kianenigma enabled auto-merge September 10, 2025 06:39

kianenigma added this pull request to the merge queue Sep 10, 2025

Merged via the queue into master with commit f7b0396 Sep 10, 2025
246 of 248 checks passed

kianenigma deleted the kiz-offence-dropping branch September 10, 2025 07:57

paritytech-release-backport-bot bot mentioned this pull request Sep 10, 2025

[unstable2507] Backport #9619 [DO NOT MERGE YET - WE WANT TO HAVE IT AFTER KAHM / AFTER 1.8 release] #9696

Draft

paritytech-release-backport-bot bot mentioned this pull request Sep 10, 2025

[stable2509] Backport #9619 #9697

Merged

kianenigma mentioned this pull request Sep 12, 2025

Few fixes to KAH Staking Configs + Preliminary PAH Configs polkadot-fellows/runtimes#907

Open

[AHM/Staking/VMP] Paginated Offence Reports + Retries for Validator Set #9619

[AHM/Staking/VMP] Paginated Offence Reports + Retries for Validator Set #9619

Uh oh!

Conversation

kianenigma commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Common

Offences

Session Report

Validator Set

Review notes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ank4n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sigurpol commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ank4n left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kianenigma Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sigurpol left a comment

Choose a reason for hiding this comment

Uh oh!

paritytech-workflow-stopper bot commented Sep 9, 2025

Uh oh!

Uh oh!

paritytech-release-backport-bot bot commented Sep 10, 2025

Uh oh!

paritytech-release-backport-bot bot commented Sep 10, 2025

kianenigma commented Sep 1, 2025 •

edited

Loading

kianenigma Sep 9, 2025 •

edited

Loading