Skip to content

Conversation

@elfedy
Copy link
Contributor

@elfedy elfedy commented Jan 15, 2026

  • Added a download endpoint for the trusted file transfer server that gets a file from storage and returns the (ChunkId, Chunk) pair.
  • Follower nodes now keep track of files they need to get from the leader. Every block, they will request the leader for those files via the trusted file transfer server.
  • MSP nodes now configure endpoints they need to "advertise" in order to be reachable. In a leader/follower setup, the leader that adquires the advisory lock will post these endpoints so other services (followers/backend) can reach them.

TODOS

  • This is a skeleton of the implementation. It has not been tested yet. Tests need to be written with a leader having changes and a follower tracking them correctly among some edge cases (leader deletes files, leader add then deletes file, then adds back up).
  • Evaluate wrapping the adquiring of the advisory lock and the posting of leader info inside a transaction, these guarantees that the leader info is there. For a stronger guarantee you can post the pid of the session holding the lock using pg_backend_pid() and veryfing the it is the pid holding the lock (via query to pg_locks table).
  • There is some logic to fallback to the local rpc/trusted_file_transfer urls when no advertised endpoints are set. That logic need to be improved. Maybe evaluate if it does make sense at all, I thought it more so that test setups don't need to pass the advertised urls in cases where nodes run on the same host.
  • There are some race conditions that have not been handled yet (And there are probably some more I did not detect):
  1. Analyze how to handle the case of a finalized TrieMutation::Remove where the file has not been downloaded yet. Not handling this might result in the Follower still having the file and not delete it (if it requests it for the follower before it was deleted and it gets the file after the finality event was processed).
  2. Right now, in ProcessFollowerDownloads we get a "snapshot" of the files to download an iterate on it. This is not ideal because it does not take into account that the list may change while the files are being downloaded (For example, same as 1) you can be downloading a file that was later deleted). The only solution that came to me was to have a more atomic state machine where you keep in state what file exactly are you downloading and process the case you get TrieMutation::Remove on that file. Also you would never get the snapshot, you would poll the remaining list one by one and get sort of a "download_lock" on a single file.
  • Naming for the tasks methods needs to be improved.
  • Make backend aware of the leader endpoints (maybe for another PR).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants