Explainer: Why do nodes running 2.4 and 2.5 have such different initial proposals? #5612
ximinez
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background
rippledversion 2.5 contains a change titled "Improve transaction relay logic", which came about after analysis and consideration of issues found on Testnet in December, 2023. At that time, testing identified many dropped / lost transactions, as well as consecutive empty ledgers when transactions were expected. Most of the problems were due to poor network connectivity causing bottlenecks that exposed flaws and deficiencies in some of the existing transaction relay logic. These flaws were largely masked in a well-connected network like Mainnet, but were flaws nonetheless. The "Improve transaction relay logic" was therefore focused on poorly-connected environments, while also providing benefits during normal operations.The flaws that were discovered can cause inconsistent transaction relaying, and could result in one, or possibly many, nodes to waste significant time processing transactions that just get dropped on the floor downstream by other nodes.
We generally don’t want to treat transactions as precious - for instance, we won’t hold on to any transaction indefinitely - but we still try to make sure that any valid transaction able to claim a fee does indeed claim a fee. The "Improve transaction relay logic" change was an attempt to better balance these two competing priorities by ensuring that any valid transaction that could claim a fee was more likely to be processed and relayed to peers correctly.
What changed?
The changes were divided into four mostly independent logical sets of modifications:
Ngets "stuck" on a version 2.4 node, which has also has processed and relayed transactionN+1, then nodes that are holding on toN+1waiting forNwill have long since droppedN+1by the timeNis relayed.Nand broadcast transactionN+1, this significantly increases the chances that peers will still haveN+1available whenNis broadcast.ter) result from the transaction engine when executing against the open ledger is added to a list to be held and retried later. 1ter,tel, ortefresult. This gives valid transactions more chances to succeed when they fail due to temporary server conditions. Additionally, to prevent a transaction from being repeatedly held and retried again indefinitely, it must meet at least one of these extra conditions:LastLedgerSequence, and theLastLedgerSequenceis fewer than 5 ledgers into the future.SF_HELD3 flag is not set on the transaction. It will be set after checking this condition, creating a hard limit to the number of times a transaction can be reattempted.These changes did not require an amendment, because they did not affect the consensus set or transaction processing. Their effects were limited to the server's decision of whether and when to relay / broadcast a transaction to peers, which is a decision that a server is free to make as it wishes.
How did versions 2.4 and 2.5 interact?
As UNL validators upgraded from version 2.4 to 2.5, operators observed that the different versions were producing significantly different initial proposals at the beginning of each consensus round, causing consensus to take longer. There may have been additional issues, such as some of them losing sync with the network, although we're not sure about those at this point.
Differing initial proposals is a normal property of consensus, but it was magnified by a much larger number of discrepancies than typically seen. This sometimes required validators to request large numbers of transactions from their peers, causing additional network load, processing load, and generally just slowing things down.
While we anticipated the differing transaction relaying behavior, we did not anticipate how the different versions would interact as the nodes on the network upgraded over time. In particular, we did not anticipate the disproportionate request load that would be imposed on nodes that were in a tiny minority. More on that later.
Example of 2.4 vs 2.5 behavior
Consider an example scenario from the perspectives of nodes running the different versions.
rippledversion.How that causes a problem
Validators build their initial proposals from their open ledger. As described above, validators running version 2.5 could have many more transactions in their open ledger than those running version 2.4. Validators running version 2.4 will potentially discover many "missing" transactions from the proposals coming from validators running version 2.5.
When a UNL validator publicizes a transaction set in a proposal, every node on the network 7 that has not seen or built that transaction set is going to attempt to obtain it by requesting it from its peers. Then those nodes will request every transaction in the set that the node doesn't already have. When the node obtains those transactions, they will cache, process, and relay them as usual. If a small minority of validators builds a transaction set containing transactions that have not widely propagated across the network, they are going to be busier than usual answering requests from peers for those transactions until they have propagated more widely. Remember that nodes running version 2.4 have not relayed transactions they see as faulty, so many other nodes will be seeing these transactions for the first time.
The problem is that an unpopular transaction set is hard to get, precisely because it is unpopular. With a popular proposal, a large portion of the validators generate that set, and many non-validators will have generated it, too. Thus fewer nodes ask for it, and the nodes that do ask for it have a good chance of getting it quickly from their peers. On the other hand, with an unpopular proposal, many more nodes will need to ask for it, and far fewer nodes will be able to provide it. That problem is compounded when the transactions in the transaction set are also unpopular. Additionally, all nodes on the network will try to get every transaction set that they see proposed. Those that don't have a given transaction set will repeatedly try to get the set until they succeed or eventually give up 8, because it is possible for an unpopular position to become popular as validators change their votes.
This can, under sub-optimal circumstances, lead to a cascade effect where a large majority of nodes are continuously trying to get information that only a small minority of nodes has available.
Thus we end up with a transitory problem that dissipates as more nodes upgrade. In retrospect, this change could have been gated by an amendment or other mechanism, not because of the usual ledger consistency reasons, but to allow the network to synchronize when to switch from the old behavior to the new.
Footnotes
Transactions are not held if the
fail_hardparameter is provided, but for simplicity I will ignore that option in this document. ↩A locally submitted transaction is only treated as local on the initial submission. Subsequent retries treat it like any other transaction. ↩
SF_HELDis a new flag used byHashRouter, used to globally track the state of transactions and other objects. If a transaction does not make it into the open ledger, but is retryable, doesn't meet any of the other criteria to be held, and does not haveSF_HELDset, the transaction will be held, andSF_HELDwill be set on it. Once this flag is set for a transaction, it won't be unset, and the next attempt to put it into the open ledger will likely be its last chance. ↩A transaction can see a node as a bottleneck due to poor network configuration, such as if the node is the sole connection between two "islands" of other nodes. A transaction will also see the node that it was submitted to as a bottleneck, since the only way for other nodes to see that transaction is for it to be relayed by the submission node or submitted to other nodes. ↩
It's also possible for the user to submit the transactions in order, but due to network delays and other factors, they arrive at another node out of order. ↩
Or finally arrives to the receiving node via the peer network. ↩
Every node on the network attempts to keep up with consensus, not just validators. This includes processing proposals, requesting transaction sets, etc. This saves time and resources in the long run because it increases the number of nodes that will have already built the validated ledger when the validations come in. Nodes that have not build the validated ledger need to request the missing ledger data from peers, which is more expensive process. ↩
However, nodes only give up if UNL validators stop proposing them. A bug related to this was fixed in version 2.4, and was the ultimate root cause of the network halt in February, 2025. The bug was that nodes which had given up would never try to get the transaction set again, even as validators continued to propose it. ↩
Beta Was this translation helpful? Give feedback.
All reactions