Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: implement dataflux fast listing #10731

Closed
akansha1812 opened this issue Aug 21, 2024 · 2 comments · Fixed by #10748, #10899 or #10913 · May be fixed by #10966 or #11093
Closed

storage: implement dataflux fast listing #10731

akansha1812 opened this issue Aug 21, 2024 · 2 comments · Fixed by #10748, #10899 or #10913 · May be fixed by #10966 or #11093
Assignees
Labels
api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@akansha1812
Copy link
Contributor

To list large dataset in a GCS bucket sequential it takes a long time. If we can list objects in parallel, it will be much faster to complete listing.

Dataflux fast-listing will be used to list objects in a bucket in parallel using worksteal algorithm. It supports storage.Query to filter objects in a bucket and returns objects in batches. User can provide bucket, storage.Query and number of parallel worker and batch size.

There are different implementation for worksteal algorithm done and after benchmarking those, dataflux implementation came out faster.

@akansha1812 akansha1812 added the triage me I really want to be triaged. label Aug 21, 2024
@codyoss
Copy link
Member

codyoss commented Aug 22, 2024

I am not sure what data flux is, is the related to storage? All List RPC today are compliant with https://google.aip.dev/158 and https://google.aip.dev/client-libraries/4233. This is all based on page_tokens that you need to do a fetch to get the next result.

@tritone tritone changed the title dataflux: The purpose of this feature is to quickly list and download data stored in GCS storage: implement dataflux fast listing Aug 22, 2024
@tritone tritone added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. api: storage Issues related to the Cloud Storage API. and removed triage me I really want to be triaged. labels Aug 22, 2024
@tritone
Copy link
Contributor

tritone commented Aug 22, 2024

@codyoss this will be a sub-package for storage similar to transfer manager, but focused on a few new features for AI/ML workloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment