-
Notifications
You must be signed in to change notification settings - Fork 306
netsync: Convert to synchronous model. #3496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I should also note that I have been running this for several days with the race detector to ensure there are no reported races and everything has been working great. |
This is the first in a series of commits that will ultimately convert all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access. The motivation for the change in architecture follows. Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access. The existing implementation has worked well for many years since syncing currently only happens via a single sync peer. However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways: - It is difficult to improve parallel processing since everything is forced into a single goroutine - It requires additional plumbing to provide access to any related information - The use of channels significantly hinders dynamically modifying requests based on state which is necessary for efficiently syncing from multiple peers Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits. For example, the new model: - Ultimately allows more code to run in parallel in the individual peer goroutines - Requires less plumbing for handling events - Makes the state available to calling code so it can make better decisions - Blocks further reads until the messages are processed by default - Provides better overall throughput One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data. This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler. Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed. The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed. So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages. With all of that in mind, this commit starts the conversion by introducing a separate mutex to protect the peers map and updates the code to protect all references to the map with that mutex. This will allow future commits to methodically refactor the various operations without introducing races.
Currently, the logic that transitions from the initial header sync to the initial chain sync is quite buried in a conditional in the header handling routine which makes it somewhat challenging to find and follow for those not familiar with the code. It also makes it more difficult to update, particularly if more than a single sync peer for the initial header sync is implemented, since it isn't concurrent safe and other code around it relies on local state it updates. This aims to improve the aforementioned by separating the handling for when the manager determines the initial header sync process completes into a separate method named onInitialHeaderSyncDone. The main benefits are: - It is much easier to find and follow the overall sequence of events that take place during the initial sync process - It is more consistent the existing onInitialChainSyncDone method that is invoked when the initial chain sync process completes - It will allow future commits to more easily add additional handling and make it concurrent safe This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
A peer that serves data is not necessarily a sync peer candidate, but serving data is a necessary condition for being a sync peer candidate. This helps make that distinction clear by adding a new flag to the Peer struct that specifies whether or not the peer serves data and updates a few places in the code that only rely on serving data to use the new flag instead of overloading the meaning of the sync candidate flag. It also has the benefit of being concurrent safe, unlike the sync candidate flag, since whether or not a peer serves data can't change while sync candidacy can. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
a2bef0b
to
7fadaf3
Compare
7fadaf3
to
84aec9f
Compare
Commit |
84aec9f
to
da352a1
Compare
Added. |
This simplifies sync peer candidacy determination and makes it concurrent safe by removing the sync candidate flag from the Peer struct and instead basing candidacy on the new immutable serves data flag and imposing the other necessary conditions when choosing the best sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This modifies the sync height to use an atomic.Int64 instead of a mutex to protect concurrent access and introduces a new method that atomically and conditionally sets it to a higher value. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. It also adds a test to ensure the new method works as expected including under concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This modifies the isCurrent flag to use an atomic.Bool instead of a mutex to protect concurrent access. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This changes the headersSynced flag to an atomic.Bool so it is safe to access concurrently. It also introduces a new method on the header sync state named InitialHeaderSyncDone to simplify and improve the readability of the various places that check its state. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This makes access to the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites appropriately. It also ensures the methods related to starting the various chain sync steps from the sync peer are protected by the mutex since their overall logic relies on its value to determine when they need to be invoked. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This makes all accesses to the request maps independently concurrent safe by protecting them with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This makes all logic related to requesting the next needed blocks from the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This modifies the stall timeout logic to make it concurrent safe by slightly loosening the ordering requirements as it relates to stopping and resetting the stall timeout. The current implementation was careful to prevent stale time values from being delivered on the timer's channel, however, there really is no reason to worry about that possibility because: 1) The stall timeout logic only depends on the event as opposed to the time values 2) The timeout is sufficiently long that it is almost never hit in practice 3) It is even more rare that the timer would just happen to be in the process of firing (sending to the channel) when data simultaneously arrives and resets the timer just before it actually fires 4) Even if all of that were to happen, the only effect is the standard stall timeout handling logic which means there are no adverse effects This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to adding a new valid peer to the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling valid peers disconnecting from the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This changes the numConsecutiveOrphanHeaders counter to an atomic.Int64 so it is independently safe for concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This makes all logic related to tracking the best announced block by each peer independently concurrent safe by protecting the related fields with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
I should note that the sections around logging during the initial header sync and transitioning from the initial headers sync mode to the initial chain sync mode would need to have the same treatment that was done for the transition out of the initial chain sync mode applied to them to truly avoid any potential logic races if parallel header downloading is ever implemented. I didn't spend any time updating that aspect since there is only a single header sync peer at the moment and thus the header sync is all running in that same peer's goroutine which prevents any potential logic races that could otherwise occur. There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue. I am considering addressing it now even though it's not technically required yet since it might not be super obvious later. Thoughts? |
seems reasonable to fix that here now as well seems reasonable to fix any issue as you find it and can quickly knock it out |
This splits the logic for starting the initial header sync and initial chain sync processes into separate methods and also improves the warning when there are no sync peers to be concurrent safe as well as to prevent multiple unnecessary warnings. The primary motivation for this change is to make it possible to independently make the state transitions between the modes concurrent safe in upcoming future commits. Looking even further ahead, it partially paves the way to eventually removing the entire notion of a chain sync peer (but not header sync peer) entirely. However, it also improves readability and has the nice minor benefit of slightly improved efficiency when peers connect and disconnect since it allows the logic that does not apply during the initial headers sync to more easily be skipped without needing to check the flag multiple times.
baac638
to
3123792
Compare
I updated this to add a couple of commits to address this. The header sync transition and related logging is now concurrent safe in case parallel header downloading is ever implemented during the headers sync process. The updated code also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer. The two commits are:
|
The current approach to logging during the initial headers sync and transitioning from the initial headers sync mode to the initial chain sync mode relies on the fact the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races if parallel header downloading is ever implemented during the headers sync process. There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall sync time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue. However, it is worthwhile to address it now even though it's not technically required yet since it might not be super obvious later. Consequently, this reworks the logic related to transitioning to the initial headers sync done state slightly to be independently concurrent safe by adding a mutex to the header sync state and ensuring the flag that tracks its completion, which is also separately an atomic for performance, is re-checked and updated under that mutex to ensure the transition is idempotent. It also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling header messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
The current approach to detecting when the initial chain sync process is done is based on determining if the chain wasn't current prior to processing the block and becomes current after. That logic is sound so long as the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races when processing multiple blocks concurrently. Consequently, this reworks the logic related to transitioning to the initial chain sync done state slightly to be independently concurrent safe by introducing a new flag and mutex to track its state and then transitioning based on the flag and status of whether the chain believes it is current. It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling block messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling inv messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling notfound messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling transaction messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted transactions to the caller so it can announce them as opposed to relying on a callback to do the announcements. The announcement callback is removed from the peer notifier interface since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to handling mix messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted messages to the caller so it can announce them as opposed to relying on a callback to do the announcements. The entire peer notifier interface and containing file is removed since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to requesting data from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to processing blocks from other sources than the network out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This refactors the logic related to returning the current sync peer ID out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This removes the event handler message channel since it is no longer used. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
This renames the event handler goroutine to stallHandler to more accurately reflect its only remaining purpose after all of the recent changes to move everything else out of the event handler. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.
3123792
to
4819a6e
Compare
This converts all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access and includes some other miscellaneous improvements related to readability, performance, and maintainability.
It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all.
The motivation for the change in architecture follows.
Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access.
The existing implementation has worked well for many years since syncing currently only happens via a single sync peer.
However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways:
Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits.
For example, the new model:
One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data.
This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler.
Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed.
The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed.
So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages.
This consists of a series of commits to help ease the review process. Each commit is intended to be a self-contained and logically easy to follow change such that the code continues to compile and works properly at each step.
See the description of each commit for further details.
This is work towards #1145.