Some new backfill package noodling for efficient network crawling #1111

ericvolp12 · 2025-06-29T02:25:04Z

Just playing around for now with a more efficient package for crawling the whole network and a little tool to dump the network to a JSONL file.

The basic premise of this crawling strategy is to initialize a crawler per-PDS and walk their listRepos responses concurrently, enqueueing jobs. Then each crawler can have its own concurrency limits on a per-PDS basis for getRepo allowing you to effectively horizontally scale your network crawling without putting outsized load on any one node in the network.

Ideally you should be able to crawl the whole network in <16 hours if you have the compute and BW from it maxing out at 10 getRepos per second per PDS.

ericvolp12 added 8 commits June 28, 2025 19:23

Some new backfill package noodling for efficient network crawling

e15d79c

Updates for performance and reduce crashing at high concurrency

deb6876

Performance tuning

660d4f4

More perf tuning

573f837

Perf tuning

e5e4f4d

Clickhouse backend for linear

682eba0

More tuning

1f363b2

Fix control flow on backfill

f9e6188

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some new backfill package noodling for efficient network crawling #1111

Some new backfill package noodling for efficient network crawling #1111

Uh oh!

ericvolp12 commented Jun 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Some new backfill package noodling for efficient network crawling #1111

Are you sure you want to change the base?

Some new backfill package noodling for efficient network crawling #1111

Uh oh!

Conversation

ericvolp12 commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ericvolp12 commented Jun 29, 2025 •

edited

Loading