Skip to content

Some new backfill package noodling for efficient network crawling #1111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

ericvolp12
Copy link
Contributor

@ericvolp12 ericvolp12 commented Jun 29, 2025

Just playing around for now with a more efficient package for crawling the whole network and a little tool to dump the network to a JSONL file.

The basic premise of this crawling strategy is to initialize a crawler per-PDS and walk their listRepos responses concurrently, enqueueing jobs. Then each crawler can have its own concurrency limits on a per-PDS basis for getRepo allowing you to effectively horizontally scale your network crawling without putting outsized load on any one node in the network.

Ideally you should be able to crawl the whole network in <16 hours if you have the compute and BW from it maxing out at 10 getRepos per second per PDS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant