Some new backfill package noodling for efficient network crawling #1111
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Just playing around for now with a more efficient package for crawling the whole network and a little tool to dump the network to a JSONL file.
The basic premise of this crawling strategy is to initialize a crawler per-PDS and walk their
listRepos
responses concurrently, enqueueing jobs. Then each crawler can have its own concurrency limits on a per-PDS basis forgetRepo
allowing you to effectively horizontally scale your network crawling without putting outsized load on any one node in the network.Ideally you should be able to crawl the whole network in <16 hours if you have the compute and BW from it maxing out at 10
getRepos
per second per PDS.