Skip to content

Investigate keeping the content of the downloaded chunks in RAM instead of writing it to file. #291

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tchaton opened this issue Aug 1, 2024 · 4 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tchaton
Copy link
Collaborator

tchaton commented Aug 1, 2024

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

@tchaton tchaton added enhancement New feature or request help wanted Extra attention is needed labels Aug 1, 2024
Copy link

stale bot commented Apr 16, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix label Apr 16, 2025
@bhimrazy
Copy link
Collaborator

Let's keep this open, as we've been experimenting around this issue. We'll continue exploring and will add our findings here from the last few experiments.

@stale stale bot removed the won't fix label Apr 17, 2025
@bhimrazy
Copy link
Collaborator

bhimrazy commented Apr 17, 2025

Idea:
One of the other ideas that could be explored is this: we create a multiprocessing dictionary and share it between the workers or a simple dict to keep within the worker process. The downloader workers/threads would add data to the buffer corresponding to a chunk key in the shared dictionary, and the reader can then check for the existence of that key—along with certain byte ranges or the full size of the chunk—before starting to read.**

Some potential downsides of this approach might include:

  • Performance bottlenecks due to the overhead of multiprocessing.Manager().dict(), especially under heavy concurrent access.
  • Synchronization complexity, as ensuring thread/process safety for concurrent reads and writes to the buffer may require locks or queues.
  • Memory management issues, particularly if chunks are large or not cleared after use.
  • Limited scalability, since Python multiprocessing may not efficiently handle shared state across many processes compared to more optimized shared memory structures."

@deependujha
Copy link
Collaborator

I tried something similar in my Rust PR, where the download chunk has deserialized and unflattened and contained dict {index: item}, but the ram usage increase exponentially. Close to 50-60 GB.

I didn't tried using multiprocessing dict for sharing, But I think this will be much more complicated.

Dataloader instantiates multiple dataset worker, and multiple dataloader are instantiated by lightning.

To be able to share dict across the dataloader, won't be trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants