-
Notifications
You must be signed in to change notification settings - Fork 334
feat: remove CAT jitter and shuffle before sending seenTx #2437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This reverts commit 2b2666e.
|
||
// ShufflePeers shuffles the peers map from GetAll() and returns a new shuffled map. | ||
// Uses Fisher-Yates shuffle algorithm for maximum speed. | ||
func ShufflePeers(peers map[uint16]p2p.Peer) map[uint16]p2p.Peer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this shuffle with 100 peers and it took <1ms to execute. So it's not impacting performance
// Build a new map with shuffled order | ||
result := make(map[uint16]p2p.Peer, len(peers)) | ||
for _, id := range keys { | ||
result[id] = peers[id] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[hate to block on this but I have no chill]
wait why do we create a new map here?
I get that this is hopefully making the map iteration more random in an extremely opaque-hacky way, but we already have a shuffled set above so why not just return 10 peers from that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which shuffled set we have?
@@ -33,6 +33,9 @@ const ( | |||
|
|||
// ReactorIncomingMessageQueueSize the size of the reactor's message queue. | |||
ReactorIncomingMessageQueueSize = 5000 | |||
|
|||
// maxSeenTxBroadcast defines the maximum number of peers to which a SeenTx message should be broadcasted. | |||
maxSeenTxBroadcast = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking
I'd be curious to see a measurement on throughput and latency of just CAT with this change
in theory, while the queues certainly get fuller, sending SeenTxs could improve throughput and latency by quite a bit due to congestion control.
if we able to process irrelevant seenTx in micro seconds and the SeenTxs are small then it just seems weird to optimize for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's mostly to reduce the burden on nodes to upload transactions to a high number of peers at the same time. In fact, for a 100 validators network, each peer is connected to 50 peers. So when they receive, for example, a 4mb transaction and they send 50 seenTx. They will receive immediately 50 wantTx, and they will need to upload that transaction to 50 peers, 50 * 4 = 200mb. Added to this, they will also be downloading transactions they received seenTx for, which would increase the bandwidth usage.
The idea is to only upload the transaction up to 10 or 20 or some number of peers. Then, once those peers receive the transaction, they will also send a seenTx to their connected peers.
This slows down the transaction propagation for each peer but ensures peers have also bandwidth to download other transactions, and also have remaining bandwidth for block prop messages, consensus etc.
I tested this and didn't see a change in block time. However, the queue sizes were definitely bigger when uploading to all connected peers. And the more load it is on the network, the more the queue will get fuller. And once the CAT queue gets full, it will start discarding messages and therefore wasting bandwidth (we don't want to block on adding new messages because that might end up disconnecting peers). So to avoid this happening, we reduce the number of uploads, which reduces the load, which leaves room for the other reactors to send messages freely, while still fully using the mempool.
I didn't run a CAT only experiment but even if we know how much bandwidth it's using, what would we get from that? I can run an experiment with very long block times and spam the network with transactions and see how many transactions are downloaded by CAT etc if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran experiments for 95 validators talis network, jitter is disabled and we just want to compare sending to all peers vs 10 peers:
Raw diagrams
Spamming the network with 95 instance, each sending 20 blob of 1mb (total 1.9gb), 16gb mempool size
Sending seenTx to only to 10 peers
SeenTxs received in a peer per second:

Total received txs size per second:

Average bandwidth used per validator:

Average data received per channel:

49 is CAT
Sending seenTx to all connected peers
SeenTxs received in a peer per second:
Total received txs size per second:
Average bandwidth used per validator:
Average data received per channel:
49 is CAT
Spamming the network with 95 instance, each sending 30 blob of 4mb (total 11gb), 16gb mempool size
Sending seenTx to only to 10 peers
SeenTxs received in a peer per second:
Total received txs size per second:
Average bandwidth used per validator:
Average data received per channel:
49 is CAT
Sending seenTx to all connected peers
SeenTxs received in a peer per second
Total received txs size per second:
Average bandwidth used per validator:
Average data received per channel:
49 is CAT
Organised diagrams
Experiments: 95 Validators on Talis Network (Jitter Disabled)
We compare two strategies:
- Sending seenTx to only 10 peers
- Sending seenTx to all connected peers
Spamming the network with 95 instances, each sending 20 blobs of 1MB (total 1.9GB), 16GB mempool size
Block size
10 peers![]() |
All peers![]() |
Block time
10 peers![]() |
All peers![]() |
SeenTxs received in a peer per second
10 peers![]() |
All peers![]() |
Total received txs size per second
10 peers![]() |
All peers![]() |
Average bandwidth used per validator
10 peers![]() |
All peers![]() |
Average data received per channel
10 peers![]() |
All peers![]() |
Spamming the network with 95 instances, each sending 30 blobs of 4MB (total 11GB), 16GB mempool size
Block size
10 peers![]() |
All peers![]() |
Block time
10 peers![]() |
All peers![]() |
SeenTxs received in a peer per second
10 peers![]() |
All peers![]() |
Total received txs size per second
10 peers![]() |
All peers![]() |
Average bandwidth used per validator
10 peers![]() |
All peers![]() |
Average data received per channel
10 peers![]() |
All peers![]() |
Conclusion
We see that the difference between limiting to 10 peers and sending to all connected peers, ~50, is that the number of seenTxs received goes from 2000 to 8000 per second at peak. Whereas the downloaded blobs remain the same. So I guess we can keep the limitations. What you think?
Also, do you think we should run an experiment with mempool ttl set to 1 block so that we always have the smae amount of transactions circulating in the network? this would stress test the mempool even more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They will receive immediately 50 wantTx, and they will need to upload that transaction to 50 peers, 50 * 4 = 200mb
this is fine as we have congestion control and pull based gossip. the only time they receive 50 wantTx is when channels are completely clear and they are the first ones with the tx. Under congestion, the seenTx will take different paths and nodes will download the data from different peers.
you are definitely correct that the number of seenTx is very large, so the remaining problematic thing would be measuring overhead there.
In the worst case scenario we receive P seen Txs from each tx (T) in the network where P is the number of peers we have.
if we assume an average of 64KiB txs (most PFBs are probably bigger I made this number conservatively small). Then we can fit 2000 Txs in each block. Simiarly if U is the average time to process a SeenTx then
P * T * 36 = Overhead from SeenTx in bytes
P * T * U = serial processing time from SeenTx in milliseconds
50 * 2000 * 36 = 3.6 MB per block of overhead in the worst case
50 * 2000 * 10us = 1s serial processing time in the worst case (ofc we are not blocking for 1s)
the 3.6MB is not a problem. the processing time might be a problem if its not on the order of microseconds! however provided we're not blocking other things then it shouldn't either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense.
order of microseconds
Actually, some seenTxs take up to 160 micro seconds... The average is below 1 micro second but some of them take time to process (this is a local benchmark, not from a real network).
however provided we're not blocking other things then it shouldn't either
we're not blocking, however, after the queue is full, CAT messages are discarded. So, if you fill the 5000 queue with seenTxs, if you receive the actual transactions, they will be discarded. That's why it's good to have seenTxs under control.
Also, if you check the above experiments, we only saw an increase in seenTxs, the performance remained almost the same for the mempool. So, I guess it's good to limit it, and we can remove this limit anytime we make processing seenTx and sending wantTx faster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you check the above experiments, we only saw an increase in seenTxs, the performance remained almost the same for the mempool.
I fine with limiting it for now if you think we should, but imo this is not a good decision making process long term. Many of the issues w/ comet's performance issues (ie the jitter that this PR fixes) we're done prematurely without actually measuring the important things.
The data we have isn't showing a difference, but we also don't have all the data such as latency or cat throughput. therefore we can't actually reach a conclusion yet beyond the queues fill up sometimes, especially in the beginnings of experiments with massive throughput (11GB of mempool txs total afaict and the experiment only ran a few blocks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can run experiments to have more data. What do you mean by CAT throuput and latency? How can we measure those?
My reasoning is: since we didnt see any significant improvement from propagating only to 10 peers vs to 50, its better to limit to 10 not to overwhelm the network with messages. But if thats not the case, we should definitly change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another approach would be to try to decrease the processing time of the queues for example.
if there are instances of 160us, then is that due to a mutex issue or are those seenTx where we send a want Tx? do we need to block there etc. or were these experiments ran with the abci mutex so processproposal is taking up a meaningful portion of the block time causing the queues to spike.
to me, adjusting the number of seenTxs sent feels like a hack and not a solution. broadcasting the seenTxs with the congestion control mech has a purpose and we're breaking that when we might not actually need to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, ill measure thoroughly those in a realistic network and come back with numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by CAT throuput and latency? How can we measure those?
latency is easy! we can just measure how long on average does it take for a tx to be included in a blocks. The slower the distribution in the mempool, the longer this should take (on average ofc since there are round robin proposers)
throughput is much more difficult as we're trying to push CAT to its limits. that means removing existing limits. txsim will eventaully have the same throughput as blocks since it waits to submit txs. to remove that limit we will likely need to run cat w/o consensus or something and remove the sequence check, not sure entirely
As part of #2237, this PR removes the jitter from CAT and shares only up to 10 peers the seenTx not to overwhelm nodes.