Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concurrent reads with dask and xarray #2

Open
scottyhq opened this issue Nov 5, 2020 · 7 comments
Open

concurrent reads with dask and xarray #2

scottyhq opened this issue Nov 5, 2020 · 7 comments

Comments

@scottyhq
Copy link
Member

scottyhq commented Nov 5, 2020

I've never been clear on whether or not xarray+dask computations using threads are hampered by file locks. It's complicated because GDAL-->rasterio-->xarray-->dask are all involved.

Rasterio recently updated their example of concurrent processing. This example specifically reads files from local disk rather than over a network.
https://rasterio.readthedocs.io/en/latest/topics/concurrency.html

There is some good discussion here about multiple threads reading the same file concurrently here
pangeo-data/pangeo-example-notebooks#21 (comment)

And finally the rasterio mailing list has some relevant discussion
https://rasterio.groups.io/g/main/topic/72528118#468

A simple example that shows timing improvement would go a long way in helping to clarify this for people. This might require a PR to xarray to deal with locks...

@scottyhq
Copy link
Member Author

scottyhq commented Dec 9, 2020

discussed this a bit with @TomAugspurger today, and just added a notebook with some concrete comparisons
https://github.com/pangeo-data/cog-best-practices/blob/main/rasterio-concurrency.ipynb

@scottyhq
Copy link
Member Author

scottyhq commented Feb 11, 2021

@TomAugspurger thanks for the PR to rioxarray, it inspired me to put together another notebook with a few more details on aligning dask chunks with data store for the case of multiband COGs (your example NAIP image in Azure)
https://github.com/pangeo-data/cog-best-practices/blob/main/4-threads-vs-async.ipynb

basically I'm trying to document how chunking decisions might depend on both 1. the data format / how bytes are organized in storage and 2. the hardware you're running on (how many CPUs).

in exploring the case of running some notebooks on mybinder.org which has just 1vCPU and 2GB RAM I became interested in the ability to still achieve high throughput with asyncio instead of multithreading. this is easy to achieve if bypassing GDAL entirely with a COG-specific library like https://github.com/geospatial-jeff/aiocogeo . It's unclear to me how to integrate that sort of access pattern with xarray / rioxarray, perhaps as a sepearate 'engine' with the ongoing xarray backend refactor?

also bigger picture trying to tie this all together in way that is straight-forward for a user to run computations on a collection of cogs as an Xarray dataset with max efficiency and minimal custom code (#4)

@TomAugspurger
Copy link
Member

Thanks Scott, that's very useful. I wasn't aware of the different interleaving options. This could inform what a good default chunking should be for rioxarray.open_rasterio when the user specifies chunks=True (cc @snowman2).

still achieve high throughput with asyncio instead of multithreading.

Yeah, that's worth exploring more. A separate engine seems like a decent amount of work, but feels like the only way to really do it. I imagine that it could share a lot of the implementation of rioxarray, though I haven't looked closely. I wonder if anyone has experimented with an async wrapper for GDAL / rasterio.

@snowman2
Copy link

Currently rioxarray.open_rasterio is slower opening the files due to using pyproj under the hood for CRS management:
corteva/rioxarray#234

So, the speed gains for reading in parallel will likely only be realized for very large rasters.

@scottyhq
Copy link
Member Author

A separate engine seems like a decent amount of work, but feels like the only way to really do it.

Right. I've had a hard time wrapping my head around this to be honest. It feels like there could be a separation of concerns between the xarray/rioxarray API and the filesystem I/O such as what @martindurant demoed with xarray+fsspec+zarr async access (https://www.youtube.com/watch?v=7XDBM3pW2ls).

But that leads to the question can rasterio handle file-like objects from fsspec? and I think the answer is No based on rasterio/rasterio#977 (comment)

@martindurant
Copy link

No, I don't think it can, and neither can some other drivers such as cfgrib. To deal with those via fsspec, we would need to leverage temporary local storage of the files. That gets tricky to organise with async, where we don't know beforehand which parts of the global dataset are to be accessed (unless xarray itself were to speak async, which is asking a lot).

@kylebarron
Copy link

in exploring the case of running some notebooks on mybinder.org which has just 1vCPU and 2GB RAM I became interested in the ability to still achieve high throughput with asyncio instead of multithreading. this is easy to achieve if bypassing GDAL entirely with a COG-specific library like geospatial-jeff/aiocogeo . It's unclear to me how to integrate that sort of access pattern with xarray / rioxarray, perhaps as a sepearate 'engine' with the ongoing xarray backend refactor?

I was going to specifically mention aiocogeo if it hadn't already been mentioned. The upsides are that it could potentially integrate well with Dask? From a cursory read of the Dask docs, it seems to be able to integrate with async code? Downsides are that I don't believe aiocogeo handles reprojection: just loading and parsing data, so you'd need to implement that manually or still bring in pyproj/GDAL for that I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants