Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

tchaton · 2024-08-01T13:51:22Z

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

stale · 2025-04-16T05:36:56Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bhimrazy · 2025-04-17T07:33:06Z

Let's keep this open, as we've been experimenting around this issue. We'll continue exploring and will add our findings here from the last few experiments.

bhimrazy · 2025-04-17T07:40:44Z

Idea:
One of the other ideas that could be explored is this: we create a multiprocessing dictionary and share it between the workers or a simple dict to keep within the worker process. The downloader workers/threads would add data to the buffer corresponding to a chunk key in the shared dictionary, and the reader can then check for the existence of that key—along with certain byte ranges or the full size of the chunk—before starting to read.**

Some potential downsides of this approach might include:

Performance bottlenecks due to the overhead of multiprocessing.Manager().dict(), especially under heavy concurrent access.
Synchronization complexity, as ensuring thread/process safety for concurrent reads and writes to the buffer may require locks or queues.
Memory management issues, particularly if chunks are large or not cleared after use.
Limited scalability, since Python multiprocessing may not efficiently handle shared state across many processes compared to more optimized shared memory structures."

deependujha · 2025-05-05T11:08:50Z

I tried something similar in my Rust PR, where the download chunk has deserialized and unflattened and contained dict {index: item}, but the ram usage increase exponentially. Close to 50-60 GB.

I didn't tried using multiprocessing dict for sharing, But I think this will be much more complicated.

Dataloader instantiates multiple dataset worker, and multiple dataloader are instantiated by lightning.

To be able to share dict across the dataloader, won't be trivial.

bhimrazy · 2025-06-04T06:40:21Z

Update
I prototyped an in-memory version of PyTreeLoader, similar to our earlier streaming ideas. It downloads chunks into RAM and streams them as they arrive—no file writes.

Tested on 12GB ImageNet:

Baseline: ~6.5k samples/sec
In-memory: ~5k samples/sec (even with sequential byte-range downloads)

Not faster yet, but promising given the early state.
Prototype here: code

Also Thomas mentioned:

In theory, S3 client can download to an io.BytesIO, and we can use a threading lock to block reading until fully downloaded.

More to come with further investigation—just leaving this here for the history and future reference.

tchaton added enhancement New feature or request help wanted Extra attention is needed labels Aug 1, 2024

stale bot added the won't fix label Apr 16, 2025

stale bot removed the won't fix label Apr 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

tchaton commented Aug 1, 2024

stale bot commented Apr 16, 2025

Uh oh!

bhimrazy commented Apr 17, 2025

Uh oh!

bhimrazy commented Apr 17, 2025 •

edited

Loading

Uh oh!

deependujha commented May 5, 2025

Uh oh!

bhimrazy commented Jun 4, 2025 •

edited

Loading

Uh oh!

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

Comments

tchaton commented Aug 1, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

stale bot commented Apr 16, 2025

Uh oh!

bhimrazy commented Apr 17, 2025

Uh oh!

bhimrazy commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deependujha commented May 5, 2025

Uh oh!

bhimrazy commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhimrazy commented Apr 17, 2025 •

edited

Loading

bhimrazy commented Jun 4, 2025 •

edited

Loading