Skip to content

netsync: Convert to synchronous model. #3496

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

davecgh
Copy link
Member

@davecgh davecgh commented Jul 2, 2025

This converts all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access and includes some other miscellaneous improvements related to readability, performance, and maintainability.

It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all.

The motivation for the change in architecture follows.

Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access.

The existing implementation has worked well for many years since syncing currently only happens via a single sync peer.

However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways:

  • It is difficult to improve parallel processing since everything is forced into a single goroutine
  • It requires additional plumbing to provide access to any related information
  • The use of channels significantly hinders dynamically modifying requests based on state which is necessary for efficiently syncing from multiple peers

Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits.

For example, the new model:

  • Ultimately allows more code to run in parallel in the individual peer goroutines
  • Requires less plumbing for handling events
  • Makes the state available to calling code so it can make better decisions
  • Blocks further reads until the messages are processed by default
  • Provides better overall throughput

One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data.

This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler.

Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed.

The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed.

So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages.

This consists of a series of commits to help ease the review process. Each commit is intended to be a self-contained and logically easy to follow change such that the code continues to compile and works properly at each step.

See the description of each commit for further details.

This is work towards #1145.

@davecgh davecgh added this to the 2.1.0 milestone Jul 2, 2025
@davecgh
Copy link
Member Author

davecgh commented Jul 2, 2025

I should also note that I have been running this for several days with the race detector to ensure there are no reported races and everything has been working great.

davecgh added 3 commits July 2, 2025 05:07
This is the first in a series of commits that will ultimately convert
all of the code related to the sync manager handling events from remote
peers to synchronous code that makes use of separate mutexes and atomics
to protect concurrent access.

The motivation for the change in architecture follows.

Currently, all operations performed by the sync manager are implemented
via a single event handler goroutine and a single message channel to
protect concurrent access.

The existing implementation has worked well for many years since syncing
currently only happens via a single sync peer.

However, with the goal of moving to a syncing paradigm where there is no
sync peer and data is intelligently downloaded from multiple peers in
parallel, the current implementation is not ideal in a few ways:

- It is difficult to improve parallel processing since everything is
  forced into a single goroutine
- It requires additional plumbing to provide access to any related
  information
- The use of channels significantly hinders dynamically modifying
  requests based on state which is necessary for efficiently syncing
  from multiple peers

Converting all the relevant code to synchronous code addresses the
aforementioned concerns and has some additional benefits.

For example, the new model:

- Ultimately allows more code to run in parallel in the individual peer
  goroutines
- Requires less plumbing for handling events
- Makes the state available to calling code so it can make better
  decisions
- Blocks further reads until the messages are processed by default
- Provides better overall throughput

One notable aspect of the overall change to the architecture is that the
peer read goroutines themselves will now block while processing the
messages by default as opposed to the current model where the messages
are queued up to be handled asynchronously which then allows the peers
to immediately read more data.

This existing async queuing approach requires care to prevent malicious
behavior as is evident by several of the handlers (for example, blocks,
transactions, and mix messages) currently manually implementing blocking
behavior by using channels that are notified once the events are
eventually processed by the async handler.

Moreover, since all of the events from all peers all currently go
through the single sync manager goroutine, the aforementioned messages
often end up blocking not only until the message itself is processed,
but also until all of the other messages from all of the other peers
that arrived first are processed.

The existing model also results in further induced delays because most
of the messages result in requests to the peer in response which
obviously can't be sent until the message is processed.

So, while at first glance the lack of read pipelining by default might
raise some potential questions about extra latency, in practice, the new
model will typically result in faster overall processing despite that.
Further, it permits much more efficient per-message targeted pipelining
if any major bottlenecks were to arise in new messages.

With all of that in mind, this commit starts the conversion by
introducing a separate mutex to protect the peers map and updates the
code to protect all references to the map with that mutex.

This will allow future commits to methodically refactor the various
operations without introducing races.
Currently, the logic that transitions from the initial header sync to
the initial chain sync is quite buried in a conditional in the header
handling routine which makes it somewhat challenging to find and follow
for those not familiar with the code.  It also makes it more difficult
to update, particularly if more than a single sync peer for the initial
header sync is implemented, since it isn't concurrent safe and other
code around it relies on local state it updates.

This aims to improve the aforementioned by separating the handling for
when the manager determines the initial header sync process completes
into a separate method named onInitialHeaderSyncDone.

The main benefits are:

- It is much easier to find and follow the overall sequence of
  events that take place during the initial sync process
- It is more consistent the existing onInitialChainSyncDone method that
  is invoked when the initial chain sync process completes
- It will allow future commits to more easily add additional handling
  and make it concurrent safe

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
A peer that serves data is not necessarily a sync peer candidate, but
serving data is a necessary condition for being a sync peer candidate.

This helps make that distinction clear by adding a new flag to the Peer
struct that specifies whether or not the peer serves data and updates a
few places in the code that only rely on serving data to use the new
flag instead of overloading the meaning of the sync candidate flag.

It also has the benefit of being concurrent safe, unlike the sync
candidate flag, since whether or not a peer serves data can't change
while sync candidacy can.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
@davecgh davecgh force-pushed the netsync_concurrent branch from a2bef0b to 7fadaf3 Compare July 2, 2025 10:09
@davecgh davecgh force-pushed the netsync_concurrent branch from 7fadaf3 to 84aec9f Compare July 2, 2025 10:47
@davecgh
Copy link
Member Author

davecgh commented Jul 2, 2025

Commit netsync: Use atomic for num consec orphan hdrs. has a typo where it is missing to in the message.

@davecgh davecgh force-pushed the netsync_concurrent branch from 84aec9f to da352a1 Compare July 2, 2025 17:26
@davecgh
Copy link
Member Author

davecgh commented Jul 2, 2025

Commit netsync: Use atomic for num consec orphan hdrs. has a typo where it is missing to in the message.

Added.

davecgh added 12 commits July 3, 2025 00:13
This simplifies sync peer candidacy determination and makes it
concurrent safe by removing the sync candidate flag from the Peer struct
and instead basing candidacy on the new immutable serves data flag and
imposing the other necessary conditions when choosing the best sync
peer.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the sync height to use an atomic.Int64 instead of a mutex
to protect concurrent access and introduces a new method that atomically
and conditionally sets it to a higher value.

This approach makes it both harder to misuse and quite a bit faster due
to using non-blocking optimistic locking.

It also adds a test to ensure the new method works as expected including
under concurrent access.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the isCurrent flag to use an atomic.Bool instead of a
mutex to protect concurrent access.

This approach makes it both harder to misuse and quite a bit faster due
to using non-blocking optimistic locking.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This changes the headersSynced flag to an atomic.Bool so it is safe to
access concurrently.

It also introduces a new method on the header sync state named
InitialHeaderSyncDone to simplify and improve the readability of the
various places that check its state.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes access to the sync peer independently concurrent safe by
protecting it with a mutex and updating all of the call sites
appropriately.

It also ensures the methods related to starting the various chain sync
steps from the sync peer are protected by the mutex since their overall
logic relies on its value to determine when they need to be invoked.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes all accesses to the request maps independently concurrent
safe by protecting them with a mutex and updating all of the call sites
accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes all logic related to requesting the next needed blocks from
the sync peer independently concurrent safe by protecting it with a
mutex and updating all of the call sites accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This modifies the stall timeout logic to make it concurrent safe by
slightly loosening the ordering requirements as it relates to stopping
and resetting the stall timeout.

The current implementation was careful to prevent stale time values from
being delivered on the timer's channel, however, there really is no
reason to worry about that possibility because:

1) The stall timeout logic only depends on the event as opposed to the
   time values
2) The timeout is sufficiently long that it is almost never hit in
   practice
3) It is even more rare that the timer would just happen to be in the
   process of firing (sending to the channel) when data simultaneously
   arrives and resets the timer just before it actually fires
4) Even if all of that were to happen, the only effect is the standard
   stall timeout handling logic which means there are no adverse effects

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to adding a new valid peer to the sync
manager out of the event handler since it is now independently
concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling valid peers disconnecting
from the sync manager out of the event handler since it is now
independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This changes the numConsecutiveOrphanHeaders counter to an atomic.Int64
so it is independently safe for concurrent access.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This makes all logic related to tracking the best announced block by
each peer independently concurrent safe by protecting the related fields
with a mutex and updating all of the call sites accordingly.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
@davecgh
Copy link
Member Author

davecgh commented Jul 3, 2025

I should note that the sections around logging during the initial header sync and transitioning from the initial headers sync mode to the initial chain sync mode would need to have the same treatment that was done for the transition out of the initial chain sync mode applied to them to truly avoid any potential logic races if parallel header downloading is ever implemented.

I didn't spend any time updating that aspect since there is only a single header sync peer at the moment and thus the header sync is all running in that same peer's goroutine which prevents any potential logic races that could otherwise occur.

There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue.

I am considering addressing it now even though it's not technically required yet since it might not be super obvious later. Thoughts?

@jrick
Copy link
Member

jrick commented Jul 3, 2025

seems reasonable to fix that here now as well

seems reasonable to fix any issue as you find it and can quickly knock it out

This splits the logic for starting the initial header sync and initial
chain sync processes into separate methods and also improves the warning
when there are no sync peers to be concurrent safe as well as to prevent
multiple unnecessary warnings.

The primary motivation for this change is to make it possible to
independently make the state transitions between the modes concurrent
safe in upcoming future commits.

Looking even further ahead, it partially paves the way to eventually
removing the entire notion of a chain sync peer (but not header sync
peer) entirely.

However, it also improves readability and has the nice minor benefit of
slightly improved efficiency when peers connect and disconnect since it
allows the logic that does not apply during the initial headers sync to
more easily be skipped without needing to check the flag multiple times.
@davecgh davecgh force-pushed the netsync_concurrent branch 2 times, most recently from baac638 to 3123792 Compare July 6, 2025 09:35
@davecgh
Copy link
Member Author

davecgh commented Jul 6, 2025

I should note that the sections around logging during the initial header sync and transitioning from the initial headers sync mode to the initial chain sync mode would need to have the same treatment that was done for the transition out of the initial chain sync mode applied to them to truly avoid any potential logic races if parallel header downloading is ever implemented.

I updated this to add a couple of commits to address this. The header sync transition and related logging is now concurrent safe in case parallel header downloading is ever implemented during the headers sync process.

The updated code also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer.

The two commits are:

  • netsync: Split initial header and chain sync.
  • netsync: Make hdr sync transition concurrent safe.
2025-07-06 04:39:33.949 [INF] SYNC: Syncing headers to block height 989740 from peer 207.148.12.196:9108 (outbound)
...
2025-07-06 04:39:43.822 [INF] SYNC: Processed 238000 headers in the last 10.08s (progress 24.03%)
2025-07-06 04:39:58.562 [INF] SYNC: Processed 274000 headers in the last 10.49s (progress 51.70%)
2025-07-06 04:40:08.859 [INF] SYNC: Processed 264000 headers in the last 10.27s (progress 78.43%)
2025-07-06 04:40:19.028 [INF] SYNC: Processed 213741 headers in the last 10.04s (progress 100.00%)
2025-07-06 04:40:19.017 [INF] SYNC: Initial headers sync complete (best header hash 987a2ea277e26d62dcd7d5ac11e83d4a4ae816b3bc68e9d2c271edbeae67066e, height 989741)
2025-07-06 04:40:19.018 [INF] SYNC: Performing initial chain sync...
2025-07-06 04:40:29.018 [INF] SYNC: Processed 2941 blocks in the last 10.00s (10085 transactions, 25853 tickets, 0 votes, 0 revocations, height 2941, progress 0.30%)
...

davecgh added 13 commits July 7, 2025 16:18
The current approach to logging during the initial headers sync and
transitioning from the initial headers sync mode to the initial chain
sync mode relies on the fact the entire sequence is mutually exclusive,
as is currently the case, but it would result in logic races if parallel
header downloading is ever implemented during the headers sync process.

There are also no immediate plans to implement parallel header
downloading, in part because doing so would require an entirely new and
more capable wire message to be implemented first, but also because it's
already such a tiny fraction of the overall sync time and headers don't
grow at a very fast rate, so it will be a very long time before any
bottleneck there would become an issue.

However, it is worthwhile to address it now even though it's not
technically required yet since it might not be super obvious later.

Consequently, this reworks the logic related to transitioning to the
initial headers sync done state slightly to be independently concurrent
safe by adding a mutex to the header sync state and ensuring the flag
that tracks its completion, which is also separately an atomic for
performance, is re-checked and updated under that mutex to ensure the
transition is idempotent.

It also gates the header sync progress logging behind the initial header
sync mode flag since it is technically more accurate even though it does
not really matter so long as there is only a single header sync peer.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling header messages from peers
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
The current approach to detecting when the initial chain sync process is
done is based on determining if the chain wasn't current prior to
processing the block and becomes current after.  That logic is sound so
long as the entire sequence is mutually exclusive, as is currently the
case, but it would result in logic races when processing multiple blocks
concurrently.

Consequently, this reworks the logic related to transitioning to the
initial chain sync done state slightly to be independently concurrent
safe by introducing a new flag and mutex to track its state and then
transitioning based on the flag and status of whether the chain believes
it is current.

It also squashes a very minor long-standing logging bug where the
transition out of the initial chain sync mode would sometimes not log
the last batch of processed blocks for either an extended period or not
at all.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling block messages from peers
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling inv messages from peers out
of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling notfound messages from
peers out of the event handler since it is now independently concurrent
safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling transaction messages from
peers out of the event handler since it is now independently concurrent
safe.

It also slightly modifies the behavior to return the slice of accepted
transactions to the caller so it can announce them as opposed to relying
on a callback to do the announcements.

The announcement callback is removed from the peer notifier interface
since it is no longer needed.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to handling mix messages from peers out
of the event handler since it is now independently concurrent safe.

It also slightly modifies the behavior to return the slice of accepted
messages to the caller so it can announce them as opposed to relying on
a callback to do the announcements.

The entire peer notifier interface and containing file is removed since
it is no longer needed.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to requesting data from peers out of
the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to processing blocks from other sources
than the network out of the event handler since it is now independently
concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This refactors the logic related to returning the current sync peer ID
out of the event handler since it is now independently concurrent safe.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This removes the event handler message channel since it is no longer
used.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
This renames the event handler goroutine to stallHandler to more
accurately reflect its only remaining purpose after all of the recent
changes to move everything else out of the event handler.

This is a part of the overall effort to convert the code related to
handling the various sync manager events to synchronous code that runs
in the various caller goroutines.
@davecgh davecgh force-pushed the netsync_concurrent branch from 3123792 to 4819a6e Compare July 7, 2025 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants