Skip to content

Question: Is litdata faster when loading local dataset or network storage s3 dataset? #428

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2catycm opened this issue Nov 30, 2024 · 7 comments
Labels
question Further information is requested won't fix

Comments

@2catycm
Copy link

2catycm commented Nov 30, 2024

When my storage is large enough to download the dataset locally, should I still use litdata's streaming api?

@2catycm 2catycm added the enhancement New feature or request label Nov 30, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

@2catycm
Copy link
Author

2catycm commented Nov 30, 2024

Another Question,can I use sshfs instead of s3? since I don't have s3 account, but I have multiple machines. To save storage, I am wondering to store datasets on machine D, and access them from machine A, B, C. Can I use litdata to optimize this workflow?

@tchaton
Copy link
Collaborator

tchaton commented Dec 1, 2024

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best,
T.C

@2catycm
Copy link
Author

2catycm commented Dec 2, 2024

Hey @2catycm,

Yes, some users reported increased speed even running locally.

We don t support sssfs but it shouldn t be hard to add it if you want too. Feel free to make a PR.

Best, T.C

Thanks for your reply, I am trying to use vtab-1k dataset locally and tried to use litdata to optimize it. And I found that on a subset of length 800, the speed of litdata is faster than pytorch Dataset by 1.41 times (147 ms -> 104ms).

I am not sure whether my benchmark is appropriate, since I just trivially iterate the dataset, haven't used it to train.

%%timeit
bar = tqdm(train_dataset)
for i, data in enumerate(bar):
    pass

@tchaton
Copy link
Collaborator

tchaton commented Dec 2, 2024

Hey @2catycm Yes, this is appropriate. We benchmark by iterating over the dataset 2 epochs in the cloud, one epoch locally.

@tchaton
Copy link
Collaborator

tchaton commented Dec 2, 2024

Hey @2catycm. We could probably make it slightly faster too.

@bhimrazy bhimrazy added question Further information is requested and removed enhancement New feature or request labels Feb 9, 2025
Copy link

stale bot commented Apr 16, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix label Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix
Projects
None yet
Development

No branches or pull requests

3 participants