Skip to content

Inconsistent results from find()/glob() under concurrency due to paginator bypassing retry logic #982

@guipenedo

Description

@guipenedo

When many workers concurrently call find()/glob() on a prefix with a large number of objects, some workers intermittently get incomplete listings (e.g., 89k+ instead of 90k). Results vary run-to-run. This is extremely hard to reproduce and not deterministic but happened every so often on a workload with a few thousand simultaneous workers that were all globbing s3://commoncrawl/crawl-data/CC-MAIN-2024-30/segments/ with "*/warc/*" (should yield 90k files, see https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-30/index.html).

I suspect that this is due to transient issues in the pagination, where some requests might be failing silently.

_iterdir uses the aiobotocore paginator directly, which bypasses s3fs’ _call_s3/_error_wrapper retry handling used elsewhere. Timeouts, IncompleteRead, throttling/invalid-XML parse errors, etc can terminate pagination early or drop a page without consistent exceptions being surfaced or retried.

Ideally the pagination requests should be wrapped in the error handling/retrying (or at the very least an exception should be raised when some of the paginated requests fail imo)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions