Explainer: Why do nodes running 2.4 and 2.5 have such different initial proposals? #5612

ximinez · 2025-07-24T20:37:57Z

ximinez
Jul 24, 2025
Maintainer

Background

rippled version 2.5 contains a change titled "Improve transaction relay logic", which came about after analysis and consideration of issues found on Testnet in December, 2023. At that time, testing identified many dropped / lost transactions, as well as consecutive empty ledgers when transactions were expected. Most of the problems were due to poor network connectivity causing bottlenecks that exposed flaws and deficiencies in some of the existing transaction relay logic. These flaws were largely masked in a well-connected network like Mainnet, but were flaws nonetheless. The "Improve transaction relay logic" was therefore focused on poorly-connected environments, while also providing benefits during normal operations.

The flaws that were discovered can cause inconsistent transaction relaying, and could result in one, or possibly many, nodes to waste significant time processing transactions that just get dropped on the floor downstream by other nodes.

We generally don’t want to treat transactions as precious - for instance, we won’t hold on to any transaction indefinitely - but we still try to make sure that any valid transaction able to claim a fee does indeed claim a fee. The "Improve transaction relay logic" change was an attempt to better balance these two competing priorities by ensuring that any valid transaction that could claim a fee was more likely to be processed and relayed to peers correctly.

What changed?

The changes were divided into four mostly independent logical sets of modifications:

Decrease the time between attempts to re-broadcast a transaction from 5 minutes to 30 seconds.
- In version 2.4, transactions that remain in the open ledger for five minutes and are still valid will be broadcast to peers. Five minutes is an eternity from the perspective of most transactions. If transaction N gets "stuck" on a version 2.4 node, which has also has processed and relayed transaction N+1, then nodes that are holding on to N+1 waiting for N will have long since dropped N+1 by the time N is relayed.
- In version 2.5, transactions that remain in the open ledger for 30 seconds and are still valid will be broadcast to peers. In the case of "stuck" transaction N and broadcast transaction N+1, this significantly increases the chances that peers will still have N+1 available when N is broadcast.
Give a transaction more chances to be retried.
- In version 2.4, any transaction that gets a retry (ter) result from the transaction engine when executing against the open ledger is added to a list to be held and retried later. ¹
- In version 2.5, a transaction is held if the transaction gets a ter, tel, or tef result. This gives valid transactions more chances to succeed when they fail due to temporary server conditions. Additionally, to prevent a transaction from being repeatedly held and retried again indefinitely, it must meet at least one of these extra conditions:
  1. It was submitted locally. ²
  2. The transaction has a LastLedgerSequence, and the LastLedgerSequence is fewer than 5 ledgers into the future.
  3. The new SF_HELD³ flag is not set on the transaction. It will be set after checking this condition, creating a hard limit to the number of times a transaction can be reattempted.
If a transaction succeeds for an account, quickly reattempt all transactions for that account that have sequential sequences or otherwise use tickets.
- In version 2.4, when a transaction is processed successfully only one held transaction for the same account (if any) will be popped out of the held transactions list described above, and queued up for the next transaction batch.
- In version 2.5, all transactions for the account are popped out of the list, but only if they have sequential sequences or use tickets, and queued up for the next batch. This change helps get held transactions cleared out more quickly, particularly if a missing earlier transaction is what held them up.
Process held transactions through existing open ledger batching.
- In version 2.4, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for 5 minutes, if it stays valid for that long. If the node is a bottleneck ⁴, the transaction may not be seen by any other nodes or validators before it expires or causes other problems. This is most likely how transactions get "stuck" on a given node in the first place.
- In version 2.5, the same code that initially processes transactions into the open ledger is reused, so that held transactions don't end up stuck on one node.

These changes did not require an amendment, because they did not affect the consensus set or transaction processing. Their effects were limited to the server's decision of whether and when to relay / broadcast a transaction to peers, which is a decision that a server is free to make as it wishes.

How did versions 2.4 and 2.5 interact?

As UNL validators upgraded from version 2.4 to 2.5, operators observed that the different versions were producing significantly different initial proposals at the beginning of each consensus round, causing consensus to take longer. There may have been additional issues, such as some of them losing sync with the network, although we're not sure about those at this point.

Differing initial proposals is a normal property of consensus, but it was magnified by a much larger number of discrepancies than typically seen. This sometimes required validators to request large numbers of transactions from their peers, causing additional network load, processing load, and generally just slowing things down.

While we anticipated the differing transaction relaying behavior, we did not anticipate how the different versions would interact as the nodes on the network upgraded over time. In particular, we did not anticipate the disproportionate request load that would be imposed on nodes that were in a tiny minority. More on that later.

Example of 2.4 vs 2.5 behavior

Consider an example scenario from the perspectives of nodes running the different versions.

User has transactions 1-10.
User incorrectly submits transactions 2-5 without first submitting transaction 1 ⁵. Those transactions get held, regardless of the rippled version.
Transaction 1 is submitted ⁶, succeeds, and is relayed.
Transactions 2-5
- On the 2.4 servers, transaction 2 is pulled out of the held transactions and processed and relayed.
- On the 2.5 servers, transactions 2-5 are all pulled out of the held transactions and processed and relayed together.
A consensus round starts
- On the 2.4 servers, transactions 3-5 get successfully processed into the open ledger, but not relayed.
- On the 2.5 servers, transactions 3-5 have already been relayed.
Now transactions 6-10 arrive.
- On nodes peered only to 2.4 servers, these transactions appear invalid because those nodes have never seen transactions 3-5.
- On nodes peered to sufficient 2.5 servers, all transactions are processed successfully and are relayed.
Nodes running 2.4 will hold on to transactions 6-10 for a little while, but they eventually expire and are dropped well before 3-5 are rebroadcast.

How that causes a problem

Validators build their initial proposals from their open ledger. As described above, validators running version 2.5 could have many more transactions in their open ledger than those running version 2.4. Validators running version 2.4 will potentially discover many "missing" transactions from the proposals coming from validators running version 2.5.

When a UNL validator publicizes a transaction set in a proposal, every node on the network ⁷ that has not seen or built that transaction set is going to attempt to obtain it by requesting it from its peers. Then those nodes will request every transaction in the set that the node doesn't already have. When the node obtains those transactions, they will cache, process, and relay them as usual. If a small minority of validators builds a transaction set containing transactions that have not widely propagated across the network, they are going to be busier than usual answering requests from peers for those transactions until they have propagated more widely. Remember that nodes running version 2.4 have not relayed transactions they see as faulty, so many other nodes will be seeing these transactions for the first time.

The problem is that an unpopular transaction set is hard to get, precisely because it is unpopular. With a popular proposal, a large portion of the validators generate that set, and many non-validators will have generated it, too. Thus fewer nodes ask for it, and the nodes that do ask for it have a good chance of getting it quickly from their peers. On the other hand, with an unpopular proposal, many more nodes will need to ask for it, and far fewer nodes will be able to provide it. That problem is compounded when the transactions in the transaction set are also unpopular. Additionally, all nodes on the network will try to get every transaction set that they see proposed. Those that don't have a given transaction set will repeatedly try to get the set until they succeed or eventually give up ⁸, because it is possible for an unpopular position to become popular as validators change their votes.

This can, under sub-optimal circumstances, lead to a cascade effect where a large majority of nodes are continuously trying to get information that only a small minority of nodes has available.

Thus we end up with a transitory problem that dissipates as more nodes upgrade. In retrospect, this change could have been gated by an amendment or other mechanism, not because of the usual ledger consistency reasons, but to allow the network to synchronize when to switch from the old behavior to the new.

Transactions are not held if the fail_hard parameter is provided, but for simplicity I will ignore that option in this document. ↩
A locally submitted transaction is only treated as local on the initial submission. Subsequent retries treat it like any other transaction. ↩
SF_HELD is a new flag used by HashRouter, used to globally track the state of transactions and other objects. If a transaction does not make it into the open ledger, but is retryable, doesn't meet any of the other criteria to be held, and does not have SF_HELD set, the transaction will be held, and SF_HELD will be set on it. Once this flag is set for a transaction, it won't be unset, and the next attempt to put it into the open ledger will likely be its last chance. ↩
A transaction can see a node as a bottleneck due to poor network configuration, such as if the node is the sole connection between two "islands" of other nodes. A transaction will also see the node that it was submitted to as a bottleneck, since the only way for other nodes to see that transaction is for it to be relayed by the submission node or submitted to other nodes. ↩
It's also possible for the user to submit the transactions in order, but due to network delays and other factors, they arrive at another node out of order. ↩
Or finally arrives to the receiving node via the peer network. ↩
Every node on the network attempts to keep up with consensus, not just validators. This includes processing proposals, requesting transaction sets, etc. This saves time and resources in the long run because it increases the number of nodes that will have already built the validated ledger when the validations come in. Nodes that have not build the validated ledger need to request the missing ledger data from peers, which is more expensive process. ↩
However, nodes only give up if UNL validators stop proposing them. A bug related to this was fixed in version 2.4, and was the ultimate root cause of the network halt in February, 2025. The bug was that nodes which had given up would never try to get the transaction set again, even as validators continued to propose it. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Explainer: Why do nodes running 2.4 and 2.5 have such different initial proposals? #5612

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Explainer: Why do nodes running 2.4 and 2.5 have such different initial proposals? #5612

Uh oh!

ximinez Jul 24, 2025 Maintainer

Background

What changed?

How did versions 2.4 and 2.5 interact?

Example of 2.4 vs 2.5 behavior

How that causes a problem

Footnotes

Replies: 0 comments

ximinez
Jul 24, 2025
Maintainer