Skip to content

Add Performance Benchmarks for PartitionedDataset.load() #1074

@SajidAlamQB

Description

@SajidAlamQB

Description

Following recent changes in #1070, we modified PartitionedDataset.load() to always invalidate its internal partition list cache before scanning the filesystem.

This was necessary to fix bugs (related to Issue #4164 and #623) where stale caches caused load() to fail, particularly when used with ParallelRunner.

While the fix works, always performing a filesystem scan might introduce performance overhead compared to potentially reusing a cached list (even though the cached list could previously be stale). This impact is expected to be negligible for small datasets or fast filesystems but could be noticeable for datasets with a very large number of partitions or those residing on slow/high-latency storage (e.g., S3).

Possible Implementation

Implement performance benchmarks specifically targeting PartitionedDataset.load() to:

  • Measure performance difference between the old caching behaviour and the new behaviour (always re-scanning).

  • Measure the load() time under various conditions:

    • Local filesystem vs. Remote filesystem (e.g., mocked S3).
    • Small number of partitions vs. Very large number of partitions.
    • Repeated load() calls on the same dataset instance.

We essentially want to design benchmark scenarios covering the conditions above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions