Description
Description
Following recent changes in #1070, we modified PartitionedDataset.load()
to always invalidate its internal partition list cache before scanning the filesystem.
This was necessary to fix bugs (related to Issue #4164 and #623) where stale caches caused load()
to fail, particularly when used with ParallelRunner
.
While the fix works, always performing a filesystem scan might introduce performance overhead compared to potentially reusing a cached list (even though the cached list could previously be stale). This impact is expected to be negligible for small datasets or fast filesystems but could be noticeable for datasets with a very large number of partitions or those residing on slow/high-latency storage (e.g., S3).
Possible Implementation
Implement performance benchmarks specifically targeting PartitionedDataset.load()
to:
-
Measure performance difference between the old caching behaviour and the new behaviour (always re-scanning).
-
Measure the
load()
time under various conditions:- Local filesystem vs. Remote filesystem (e.g., mocked S3).
- Small number of partitions vs. Very large number of partitions.
- Repeated
load()
calls on the same dataset instance.
We essentially want to design benchmark scenarios covering the conditions above.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status