Inconsistent results from find()/glob() under concurrency due to paginator bypassing retry logic

When many workers concurrently call find()/glob() on a prefix with a large number of objects, some workers intermittently get incomplete listings (e.g., 89k+ instead of 90k). Results vary run-to-run. This is extremely hard to reproduce and not deterministic but happened every so often on a workload with a few thousand simultaneous workers that were all globbing `s3://commoncrawl/crawl-data/CC-MAIN-2024-30/segments/` with `"*/warc/*"` (should yield 90k files, see https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-30/index.html).

I suspect that this is due to transient issues in the pagination, where some requests might be failing silently.

`_iterdir` uses the aiobotocore paginator directly, which bypasses s3fs’ `_call_s3/_error_wrapper` retry handling used elsewhere. Timeouts, IncompleteRead, throttling/invalid-XML parse errors, etc can terminate pagination early or drop a page without consistent exceptions being surfaced or retried.

Ideally the pagination requests should be wrapped in the error handling/retrying (or at the very least an exception should be raised when some of the paginated requests fail imo)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent results from find()/glob() under concurrency due to paginator bypassing retry logic #982

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent results from find()/glob() under concurrency due to paginator bypassing retry logic #982

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions