Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk Data Access Performance for S3 #13

Closed
samueljackson92 opened this issue Apr 23, 2024 · 1 comment
Closed

Bulk Data Access Performance for S3 #13

samueljackson92 opened this issue Apr 23, 2024 · 1 comment
Labels
question Further information is requested

Comments

@samueljackson92
Copy link
Collaborator

samueljackson92 commented Apr 23, 2024

When bulk downloading the data, it can be quite slow, especially when we are being selective with the data stored in s3.

This is because we have many, many small files. When we want to download everything we have to page through all the of the keys in the s3 which can take some time.

I am wondering how we can make this process more performant.

  • One solution is to have a file index and query this, but that would incur introducing an access layer.
  • Another solution is we push users towards local caching and they just load what they need:
endpoint_url = f"https://s3.echo.stfc.ac.uk"
url = 's3://mast/test/shots/tiny/30390.zarr'
ds = xr.open_dataset("filecache::" + url, engine='zarr', group=f'rba', 
                     storage_options={'s3': {'anon': True, 'endpoint_url': endpoint_url}, 'filecache': {'cache_storage':'/tmp/files'}})
ds

The second solution has implications for how you access the data on HPC systems. For CSD3 we should have an internally facing s3 storage. For other sites they will need an internet connection.

@samueljackson92 samueljackson92 added the enhancement New feature or request label Apr 23, 2024
@samueljackson92 samueljackson92 changed the title Write download script for CSD3 Bulk Downloads for S3 May 1, 2024
@samueljackson92 samueljackson92 added question Further information is requested and removed enhancement New feature or request labels May 1, 2024
@samueljackson92 samueljackson92 changed the title Bulk Downloads for S3 Bulk Data Access Performance for S3 May 1, 2024
@samueljackson92
Copy link
Collaborator Author

samueljackson92 commented May 1, 2024

I think we should update the intake catalog to use filecache and give the user an option to specify where the cache is.

Using filecache would also be a nice solution for #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant