netsync: Convert to synchronous model. #3496

davecgh · 2025-07-02T09:08:24Z

This converts all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access and includes some other miscellaneous improvements related to readability, performance, and maintainability.

It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all.

The motivation for the change in architecture follows.

Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access.

The existing implementation has worked well for many years since syncing currently only happens via a single sync peer.

However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways:

It is difficult to improve parallel processing since everything is forced into a single goroutine
It requires additional plumbing to provide access to any related information
The use of channels significantly hinders dynamically modifying requests based on state which is necessary for efficiently syncing from multiple peers

Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits.

For example, the new model:

Ultimately allows more code to run in parallel in the individual peer goroutines
Requires less plumbing for handling events
Makes the state available to calling code so it can make better decisions
Blocks further reads until the messages are processed by default
Provides better overall throughput

One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data.

This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler.

Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed.

The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed.

So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages.

This consists of a series of commits to help ease the review process. Each commit is intended to be a self-contained and logically easy to follow change such that the code continues to compile and works properly at each step.

See the description of each commit for further details.

This is work towards #1145.

davecgh · 2025-07-02T09:11:18Z

I should also note that I have been running this for several days with the race detector to ensure there are no reported races and everything has been working great.

This is the first in a series of commits that will ultimately convert all of the code related to the sync manager handling events from remote peers to synchronous code that makes use of separate mutexes and atomics to protect concurrent access. The motivation for the change in architecture follows. Currently, all operations performed by the sync manager are implemented via a single event handler goroutine and a single message channel to protect concurrent access. The existing implementation has worked well for many years since syncing currently only happens via a single sync peer. However, with the goal of moving to a syncing paradigm where there is no sync peer and data is intelligently downloaded from multiple peers in parallel, the current implementation is not ideal in a few ways: - It is difficult to improve parallel processing since everything is forced into a single goroutine - It requires additional plumbing to provide access to any related information - The use of channels significantly hinders dynamically modifying requests based on state which is necessary for efficiently syncing from multiple peers Converting all the relevant code to synchronous code addresses the aforementioned concerns and has some additional benefits. For example, the new model: - Ultimately allows more code to run in parallel in the individual peer goroutines - Requires less plumbing for handling events - Makes the state available to calling code so it can make better decisions - Blocks further reads until the messages are processed by default - Provides better overall throughput One notable aspect of the overall change to the architecture is that the peer read goroutines themselves will now block while processing the messages by default as opposed to the current model where the messages are queued up to be handled asynchronously which then allows the peers to immediately read more data. This existing async queuing approach requires care to prevent malicious behavior as is evident by several of the handlers (for example, blocks, transactions, and mix messages) currently manually implementing blocking behavior by using channels that are notified once the events are eventually processed by the async handler. Moreover, since all of the events from all peers all currently go through the single sync manager goroutine, the aforementioned messages often end up blocking not only until the message itself is processed, but also until all of the other messages from all of the other peers that arrived first are processed. The existing model also results in further induced delays because most of the messages result in requests to the peer in response which obviously can't be sent until the message is processed. So, while at first glance the lack of read pipelining by default might raise some potential questions about extra latency, in practice, the new model will typically result in faster overall processing despite that. Further, it permits much more efficient per-message targeted pipelining if any major bottlenecks were to arise in new messages. With all of that in mind, this commit starts the conversion by introducing a separate mutex to protect the peers map and updates the code to protect all references to the map with that mutex. This will allow future commits to methodically refactor the various operations without introducing races.

Currently, the logic that transitions from the initial header sync to the initial chain sync is quite buried in a conditional in the header handling routine which makes it somewhat challenging to find and follow for those not familiar with the code. It also makes it more difficult to update, particularly if more than a single sync peer for the initial header sync is implemented, since it isn't concurrent safe and other code around it relies on local state it updates. This aims to improve the aforementioned by separating the handling for when the manager determines the initial header sync process completes into a separate method named onInitialHeaderSyncDone. The main benefits are: - It is much easier to find and follow the overall sequence of events that take place during the initial sync process - It is more consistent the existing onInitialChainSyncDone method that is invoked when the initial chain sync process completes - It will allow future commits to more easily add additional handling and make it concurrent safe This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

A peer that serves data is not necessarily a sync peer candidate, but serving data is a necessary condition for being a sync peer candidate. This helps make that distinction clear by adding a new flag to the Peer struct that specifies whether or not the peer serves data and updates a few places in the code that only rely on serving data to use the new flag instead of overloading the meaning of the sync candidate flag. It also has the benefit of being concurrent safe, unlike the sync candidate flag, since whether or not a peer serves data can't change while sync candidacy can. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

internal/netsync/manager.go

davecgh · 2025-07-02T16:47:34Z

Commit netsync: Use atomic for num consec orphan hdrs. has a typo where it is missing to in the message.

internal/netsync/manager_test.go

internal/netsync/manager.go

davecgh · 2025-07-02T17:26:52Z

Commit netsync: Use atomic for num consec orphan hdrs. has a typo where it is missing to in the message.

Added.

This simplifies sync peer candidacy determination and makes it concurrent safe by removing the sync candidate flag from the Peer struct and instead basing candidacy on the new immutable serves data flag and imposing the other necessary conditions when choosing the best sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the sync height to use an atomic.Int64 instead of a mutex to protect concurrent access and introduces a new method that atomically and conditionally sets it to a higher value. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. It also adds a test to ensure the new method works as expected including under concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the isCurrent flag to use an atomic.Bool instead of a mutex to protect concurrent access. This approach makes it both harder to misuse and quite a bit faster due to using non-blocking optimistic locking. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This changes the headersSynced flag to an atomic.Bool so it is safe to access concurrently. It also introduces a new method on the header sync state named InitialHeaderSyncDone to simplify and improve the readability of the various places that check its state. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes access to the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites appropriately. It also ensures the methods related to starting the various chain sync steps from the sync peer are protected by the mutex since their overall logic relies on its value to determine when they need to be invoked. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all accesses to the request maps independently concurrent safe by protecting them with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all logic related to requesting the next needed blocks from the sync peer independently concurrent safe by protecting it with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This modifies the stall timeout logic to make it concurrent safe by slightly loosening the ordering requirements as it relates to stopping and resetting the stall timeout. The current implementation was careful to prevent stale time values from being delivered on the timer's channel, however, there really is no reason to worry about that possibility because: 1) The stall timeout logic only depends on the event as opposed to the time values 2) The timeout is sufficiently long that it is almost never hit in practice 3) It is even more rare that the timer would just happen to be in the process of firing (sending to the channel) when data simultaneously arrives and resets the timer just before it actually fires 4) Even if all of that were to happen, the only effect is the standard stall timeout handling logic which means there are no adverse effects This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to adding a new valid peer to the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling valid peers disconnecting from the sync manager out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This changes the numConsecutiveOrphanHeaders counter to an atomic.Int64 so it is independently safe for concurrent access. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This makes all logic related to tracking the best announced block by each peer independently concurrent safe by protecting the related fields with a mutex and updating all of the call sites accordingly. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

davecgh · 2025-07-03T19:17:49Z

I should note that the sections around logging during the initial header sync and transitioning from the initial headers sync mode to the initial chain sync mode would need to have the same treatment that was done for the transition out of the initial chain sync mode applied to them to truly avoid any potential logic races if parallel header downloading is ever implemented.

I didn't spend any time updating that aspect since there is only a single header sync peer at the moment and thus the header sync is all running in that same peer's goroutine which prevents any potential logic races that could otherwise occur.

There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue.

I am considering addressing it now even though it's not technically required yet since it might not be super obvious later. Thoughts?

jrick · 2025-07-03T19:21:22Z

seems reasonable to fix that here now as well

seems reasonable to fix any issue as you find it and can quickly knock it out

This splits the logic for starting the initial header sync and initial chain sync processes into separate methods and also improves the warning when there are no sync peers to be concurrent safe as well as to prevent multiple unnecessary warnings. The primary motivation for this change is to make it possible to independently make the state transitions between the modes concurrent safe in upcoming future commits. Looking even further ahead, it partially paves the way to eventually removing the entire notion of a chain sync peer (but not header sync peer) entirely. However, it also improves readability and has the nice minor benefit of slightly improved efficiency when peers connect and disconnect since it allows the logic that does not apply during the initial headers sync to more easily be skipped without needing to check the flag multiple times.

davecgh · 2025-07-06T09:37:47Z

I should note that the sections around logging during the initial header sync and transitioning from the initial headers sync mode to the initial chain sync mode would need to have the same treatment that was done for the transition out of the initial chain sync mode applied to them to truly avoid any potential logic races if parallel header downloading is ever implemented.

I updated this to add a couple of commits to address this. The header sync transition and related logging is now concurrent safe in case parallel header downloading is ever implemented during the headers sync process.

The updated code also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer.

The two commits are:

netsync: Split initial header and chain sync.
netsync: Make hdr sync transition concurrent safe.

2025-07-06 04:39:33.949 [INF] SYNC: Syncing headers to block height 989740 from peer 207.148.12.196:9108 (outbound)
...
2025-07-06 04:39:43.822 [INF] SYNC: Processed 238000 headers in the last 10.08s (progress 24.03%)
2025-07-06 04:39:58.562 [INF] SYNC: Processed 274000 headers in the last 10.49s (progress 51.70%)
2025-07-06 04:40:08.859 [INF] SYNC: Processed 264000 headers in the last 10.27s (progress 78.43%)
2025-07-06 04:40:19.028 [INF] SYNC: Processed 213741 headers in the last 10.04s (progress 100.00%)
2025-07-06 04:40:19.017 [INF] SYNC: Initial headers sync complete (best header hash 987a2ea277e26d62dcd7d5ac11e83d4a4ae816b3bc68e9d2c271edbeae67066e, height 989741)
2025-07-06 04:40:19.018 [INF] SYNC: Performing initial chain sync...
2025-07-06 04:40:29.018 [INF] SYNC: Processed 2941 blocks in the last 10.00s (10085 transactions, 25853 tickets, 0 votes, 0 revocations, height 2941, progress 0.30%)
...

internal/netsync/manager.go

The current approach to logging during the initial headers sync and transitioning from the initial headers sync mode to the initial chain sync mode relies on the fact the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races if parallel header downloading is ever implemented during the headers sync process. There are also no immediate plans to implement parallel header downloading, in part because doing so would require an entirely new and more capable wire message to be implemented first, but also because it's already such a tiny fraction of the overall sync time and headers don't grow at a very fast rate, so it will be a very long time before any bottleneck there would become an issue. However, it is worthwhile to address it now even though it's not technically required yet since it might not be super obvious later. Consequently, this reworks the logic related to transitioning to the initial headers sync done state slightly to be independently concurrent safe by adding a mutex to the header sync state and ensuring the flag that tracks its completion, which is also separately an atomic for performance, is re-checked and updated under that mutex to ensure the transition is idempotent. It also gates the header sync progress logging behind the initial header sync mode flag since it is technically more accurate even though it does not really matter so long as there is only a single header sync peer. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling header messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

The current approach to detecting when the initial chain sync process is done is based on determining if the chain wasn't current prior to processing the block and becomes current after. That logic is sound so long as the entire sequence is mutually exclusive, as is currently the case, but it would result in logic races when processing multiple blocks concurrently. Consequently, this reworks the logic related to transitioning to the initial chain sync done state slightly to be independently concurrent safe by introducing a new flag and mutex to track its state and then transitioning based on the flag and status of whether the chain believes it is current. It also squashes a very minor long-standing logging bug where the transition out of the initial chain sync mode would sometimes not log the last batch of processed blocks for either an extended period or not at all. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling block messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling inv messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling notfound messages from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling transaction messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted transactions to the caller so it can announce them as opposed to relying on a callback to do the announcements. The announcement callback is removed from the peer notifier interface since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to handling mix messages from peers out of the event handler since it is now independently concurrent safe. It also slightly modifies the behavior to return the slice of accepted messages to the caller so it can announce them as opposed to relying on a callback to do the announcements. The entire peer notifier interface and containing file is removed since it is no longer needed. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to requesting data from peers out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to processing blocks from other sources than the network out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This refactors the logic related to returning the current sync peer ID out of the event handler since it is now independently concurrent safe. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This removes the event handler message channel since it is no longer used. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

This renames the event handler goroutine to stallHandler to more accurately reflect its only remaining purpose after all of the recent changes to move everything else out of the event handler. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

davecgh added this to the 2.1.0 milestone Jul 2, 2025

davecgh added 3 commits July 2, 2025 05:07

davecgh force-pushed the netsync_concurrent branch from a2bef0b to 7fadaf3 Compare July 2, 2025 10:09

davecgh commented Jul 2, 2025

View reviewed changes

internal/netsync/manager.go Outdated Show resolved Hide resolved

internal/netsync/manager.go Outdated Show resolved Hide resolved

internal/netsync/manager.go Outdated Show resolved Hide resolved

davecgh force-pushed the netsync_concurrent branch from 7fadaf3 to 84aec9f Compare July 2, 2025 10:47

davecgh commented Jul 2, 2025

View reviewed changes

internal/netsync/manager_test.go Outdated Show resolved Hide resolved

internal/netsync/manager.go Outdated Show resolved Hide resolved

davecgh force-pushed the netsync_concurrent branch from 84aec9f to da352a1 Compare July 2, 2025 17:26

davecgh added 12 commits July 3, 2025 00:13

davecgh force-pushed the netsync_concurrent branch 2 times, most recently from baac638 to 3123792 Compare July 6, 2025 09:35

davecgh commented Jul 7, 2025

View reviewed changes

internal/netsync/manager.go Outdated Show resolved Hide resolved

davecgh added 13 commits July 7, 2025 16:18

netsync: Remove unused msg channel.

f31e049

This removes the event handler message channel since it is no longer used. This is a part of the overall effort to convert the code related to handling the various sync manager events to synchronous code that runs in the various caller goroutines.

davecgh force-pushed the netsync_concurrent branch from 3123792 to 4819a6e Compare July 7, 2025 21:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

netsync: Convert to synchronous model. #3496

netsync: Convert to synchronous model. #3496

Uh oh!

davecgh commented Jul 2, 2025 •

edited

Loading

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

davecgh commented Jul 3, 2025 •

edited

Loading

Uh oh!

jrick commented Jul 3, 2025

Uh oh!

davecgh commented Jul 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

netsync: Convert to synchronous model. #3496

Are you sure you want to change the base?

netsync: Convert to synchronous model. #3496

Uh oh!

Conversation

davecgh commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025

Uh oh!

davecgh commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jrick commented Jul 3, 2025

Uh oh!

davecgh commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davecgh commented Jul 2, 2025 •

edited

Loading

davecgh commented Jul 3, 2025 •

edited

Loading

davecgh commented Jul 6, 2025 •

edited

Loading