When many workers concurrently call find()/glob() on a prefix with a large number of objects, some workers intermittently get incomplete listings (e.g., 89k+ instead of 90k). Results vary run-to-run. This is extremely hard to reproduce and not deterministic but happened every so often on a workload with a few thousand simultaneous workers that were all globbing s3://commoncrawl/crawl-data/CC-MAIN-2024-30/segments/ with "*/warc/*" (should yield 90k files, see https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-30/index.html).
I suspect that this is due to transient issues in the pagination, where some requests might be failing silently.
_iterdir uses the aiobotocore paginator directly, which bypasses s3fs’ _call_s3/_error_wrapper retry handling used elsewhere. Timeouts, IncompleteRead, throttling/invalid-XML parse errors, etc can terminate pagination early or drop a page without consistent exceptions being surfaced or retried.
Ideally the pagination requests should be wrapped in the error handling/retrying (or at the very least an exception should be raised when some of the paginated requests fail imo)