feat: remove CAT jitter and shuffle before sending seenTx #2437

rach-id · 2025-09-11T11:31:35Z

As part of #2237, this PR removes the jitter from CAT and shares only up to 10 peers the seenTx not to overwhelm nodes.

This reverts commit 2b2666e.

This reverts commit b6cc18b.

This reverts commit b2a7710.

This reverts commit 1c1dc40.

rach-id · 2025-09-11T11:33:44Z

mempool/cat/reactor.go

+
+// ShufflePeers shuffles the peers map from GetAll() and returns a new shuffled map.
+// Uses Fisher-Yates shuffle algorithm for maximum speed.
+func ShufflePeers(peers map[uint16]p2p.Peer) map[uint16]p2p.Peer {


Tested this shuffle with 100 peers and it took <1ms to execute. So it's not impacting performance

evan-forbes · 2025-09-11T18:44:52Z

mempool/cat/reactor.go

+	// Build a new map with shuffled order
+	result := make(map[uint16]p2p.Peer, len(peers))
+	for _, id := range keys {
+		result[id] = peers[id]
+	}


[hate to block on this but I have no chill]

wait why do we create a new map here?

I get that this is hopefully making the map iteration more random in an extremely opaque-hacky way, but we already have a shuffled set above so why not just return 10 peers from that?

which shuffled set we have?

evan-forbes · 2025-09-11T18:45:48Z

mempool/cat/reactor.go

@@ -33,6 +33,9 @@ const (

 	// ReactorIncomingMessageQueueSize the size of the reactor's message queue.
 	ReactorIncomingMessageQueueSize = 5000
+
+	// maxSeenTxBroadcast defines the maximum number of peers to which a SeenTx message should be broadcasted.
+	maxSeenTxBroadcast = 10


not blocking

I'd be curious to see a measurement on throughput and latency of just CAT with this change

in theory, while the queues certainly get fuller, sending SeenTxs could improve throughput and latency by quite a bit due to congestion control.

if we able to process irrelevant seenTx in micro seconds and the SeenTxs are small then it just seems weird to optimize for that

it's mostly to reduce the burden on nodes to upload transactions to a high number of peers at the same time. In fact, for a 100 validators network, each peer is connected to 50 peers. So when they receive, for example, a 4mb transaction and they send 50 seenTx. They will receive immediately 50 wantTx, and they will need to upload that transaction to 50 peers, 50 * 4 = 200mb. Added to this, they will also be downloading transactions they received seenTx for, which would increase the bandwidth usage.

The idea is to only upload the transaction up to 10 or 20 or some number of peers. Then, once those peers receive the transaction, they will also send a seenTx to their connected peers.
This slows down the transaction propagation for each peer but ensures peers have also bandwidth to download other transactions, and also have remaining bandwidth for block prop messages, consensus etc.

I tested this and didn't see a change in block time. However, the queue sizes were definitely bigger when uploading to all connected peers. And the more load it is on the network, the more the queue will get fuller. And once the CAT queue gets full, it will start discarding messages and therefore wasting bandwidth (we don't want to block on adding new messages because that might end up disconnecting peers). So to avoid this happening, we reduce the number of uploads, which reduces the load, which leaves room for the other reactors to send messages freely, while still fully using the mempool.

I didn't run a CAT only experiment but even if we know how much bandwidth it's using, what would we get from that? I can run an experiment with very long block times and spam the network with transactions and see how many transactions are downloaded by CAT etc if you want.

I ran experiments for 95 validators talis network, jitter is disabled and we just want to compare sending to all peers vs 10 peers:

Raw diagrams

Spamming the network with 95 instance, each sending 20 blob of 1mb (total 1.9gb), 16gb mempool size

Sending seenTx to only to 10 peers

Block size:

Block time:

SeenTxs received in a peer per second:

Total received txs size per second:

Average bandwidth used per validator:

Average data received per channel:

49 is CAT

Sending seenTx to all connected peers

Block size:

Block time:

SeenTxs received in a peer per second:

Total received txs size per second:

Average bandwidth used per validator:

Average data received per channel:

49 is CAT

Spamming the network with 95 instance, each sending 30 blob of 4mb (total 11gb), 16gb mempool size

Sending seenTx to only to 10 peers

Block size:

Block time:

SeenTxs received in a peer per second:

Total received txs size per second:

Average bandwidth used per validator:

Average data received per channel:

49 is CAT

Sending seenTx to all connected peers

Block size:

Block time:

SeenTxs received in a peer per second

Total received txs size per second:

Average bandwidth used per validator:

Average data received per channel:

49 is CAT

Organised diagrams

Experiments: 95 Validators on Talis Network (Jitter Disabled)

We compare two strategies:

Sending seenTx to only 10 peers

Sending seenTx to all connected peers

Spamming the network with 95 instances, each sending 20 blobs of 1MB (total 1.9GB), 16GB mempool size

Block size

10 peers
All peers

Block time

10 peers
All peers

SeenTxs received in a peer per second

10 peers
All peers

Total received txs size per second

10 peers
All peers

Average bandwidth used per validator

10 peers
All peers

Average data received per channel

10 peers
All peers

Spamming the network with 95 instances, each sending 30 blobs of 4MB (total 11GB), 16GB mempool size

Block size

10 peers
All peers

Block time

10 peers
All peers

SeenTxs received in a peer per second

10 peers
All peers

Total received txs size per second

10 peers
All peers

Average bandwidth used per validator

10 peers
All peers

Average data received per channel

10 peers
All peers

Conclusion

We see that the difference between limiting to 10 peers and sending to all connected peers, ~50, is that the number of seenTxs received goes from 2000 to 8000 per second at peak. Whereas the downloaded blobs remain the same. So I guess we can keep the limitations. What you think?

Also, do you think we should run an experiment with mempool ttl set to 1 block so that we always have the smae amount of transactions circulating in the network? this would stress test the mempool even more.

They will receive immediately 50 wantTx, and they will need to upload that transaction to 50 peers, 50 * 4 = 200mb

this is fine as we have congestion control and pull based gossip. the only time they receive 50 wantTx is when channels are completely clear and they are the first ones with the tx. Under congestion, the seenTx will take different paths and nodes will download the data from different peers.

you are definitely correct that the number of seenTx is very large, so the remaining problematic thing would be measuring overhead there.

In the worst case scenario we receive P seen Txs from each tx (T) in the network where P is the number of peers we have.

if we assume an average of 64KiB txs (most PFBs are probably bigger I made this number conservatively small). Then we can fit 2000 Txs in each block. Simiarly if U is the average time to process a SeenTx then

P * T * 36 = Overhead from SeenTx in bytes
P * T * U = serial processing time from SeenTx in milliseconds

50 * 2000 * 36 = 3.6 MB per block of overhead in the worst case
50 * 2000 * 10us = 1s serial processing time in the worst case (ofc we are not blocking for 1s)

the 3.6MB is not a problem. the processing time might be a problem if its not on the order of microseconds! however provided we're not blocking other things then it shouldn't either

makes sense.

order of microseconds

Actually, some seenTxs take up to 160 micro seconds... The average is below 1 micro second but some of them take time to process (this is a local benchmark, not from a real network).

however provided we're not blocking other things then it shouldn't either

we're not blocking, however, after the queue is full, CAT messages are discarded. So, if you fill the 5000 queue with seenTxs, if you receive the actual transactions, they will be discarded. That's why it's good to have seenTxs under control.

Also, if you check the above experiments, we only saw an increase in seenTxs, the performance remained almost the same for the mempool. So, I guess it's good to limit it, and we can remove this limit anytime we make processing seenTx and sending wantTx faster

if you check the above experiments, we only saw an increase in seenTxs, the performance remained almost the same for the mempool.

I fine with limiting it for now if you think we should, but imo this is not a good decision making process long term. Many of the issues w/ comet's performance issues (ie the jitter that this PR fixes) we're done prematurely without actually measuring the important things.

The data we have isn't showing a difference, but we also don't have all the data such as latency or cat throughput. therefore we can't actually reach a conclusion yet beyond the queues fill up sometimes, especially in the beginnings of experiments with massive throughput (11GB of mempool txs total afaict and the experiment only ran a few blocks).

I can run experiments to have more data. What do you mean by CAT throuput and latency? How can we measure those?

My reasoning is: since we didnt see any significant improvement from propagating only to 10 peers vs to 50, its better to limit to 10 not to overwhelm the network with messages. But if thats not the case, we should definitly change.

another approach would be to try to decrease the processing time of the queues for example.

if there are instances of 160us, then is that due to a mutex issue or are those seenTx where we send a want Tx? do we need to block there etc. or were these experiments ran with the abci mutex so processproposal is taking up a meaningful portion of the block time causing the queues to spike.

to me, adjusting the number of seenTxs sent feels like a hack and not a solution. broadcasting the seenTxs with the congestion control mech has a purpose and we're breaking that when we might not actually need to

Makes sense, ill measure thoroughly those in a realistic network and come back with numbers.

What do you mean by CAT throuput and latency? How can we measure those?

latency is easy! we can just measure how long on average does it take for a tx to be included in a blocks. The slower the distribution in the mempool, the longer this should take (on average ofc since there are round robin proposers)

throughput is much more difficult as we're trying to push CAT to its limits. that means removing existing limits. txsim will eventaully have the same throughput as blocks since it waits to submit txs. to remove that limit we will likely need to run cat w/o consensus or something and remove the sequence check, not sure entirely