Bulk Data Access Performance for S3 #13

samueljackson92 · 2024-04-23T15:25:07Z

When bulk downloading the data, it can be quite slow, especially when we are being selective with the data stored in s3.

This is because we have many, many small files. When we want to download everything we have to page through all the of the keys in the s3 which can take some time.

I am wondering how we can make this process more performant.

One solution is to have a file index and query this, but that would incur introducing an access layer.
Another solution is we push users towards local caching and they just load what they need:

endpoint_url = f"https://s3.echo.stfc.ac.uk"
url = 's3://mast/test/shots/tiny/30390.zarr'
ds = xr.open_dataset("filecache::" + url, engine='zarr', group=f'rba', 
                     storage_options={'s3': {'anon': True, 'endpoint_url': endpoint_url}, 'filecache': {'cache_storage':'/tmp/files'}})
ds

The second solution has implications for how you access the data on HPC systems. For CSD3 we should have an internally facing s3 storage. For other sites they will need an internet connection.

samueljackson92 · 2024-05-01T08:39:03Z

I think we should update the intake catalog to use filecache and give the user an option to specify where the cache is.

Using filecache would also be a nice solution for #12

samueljackson92 added the enhancement New feature or request label Apr 23, 2024

samueljackson92 changed the title ~~Write download script for CSD3~~ Bulk Downloads for S3 May 1, 2024

samueljackson92 added question Further information is requested and removed enhancement New feature or request labels May 1, 2024

samueljackson92 changed the title ~~Bulk Downloads for S3~~ Bulk Data Access Performance for S3 May 1, 2024

samueljackson92 closed this as completed Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk Data Access Performance for S3 #13

Bulk Data Access Performance for S3 #13

samueljackson92 commented Apr 23, 2024 •

edited

Loading

samueljackson92 commented May 1, 2024 •

edited

Loading

Bulk Data Access Performance for S3 #13

Bulk Data Access Performance for S3 #13

Comments

samueljackson92 commented Apr 23, 2024 • edited Loading

samueljackson92 commented May 1, 2024 • edited Loading

samueljackson92 commented Apr 23, 2024 •

edited

Loading

samueljackson92 commented May 1, 2024 •

edited

Loading