-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
concurrent reads with dask and xarray #2
Comments
discussed this a bit with @TomAugspurger today, and just added a notebook with some concrete comparisons |
@TomAugspurger thanks for the PR to rioxarray, it inspired me to put together another notebook with a few more details on aligning dask chunks with data store for the case of multiband COGs (your example NAIP image in Azure) basically I'm trying to document how chunking decisions might depend on both 1. the data format / how bytes are organized in storage and 2. the hardware you're running on (how many CPUs). in exploring the case of running some notebooks on mybinder.org which has just 1vCPU and 2GB RAM I became interested in the ability to still achieve high throughput with asyncio instead of multithreading. this is easy to achieve if bypassing GDAL entirely with a COG-specific library like https://github.com/geospatial-jeff/aiocogeo . It's unclear to me how to integrate that sort of access pattern with xarray / rioxarray, perhaps as a sepearate 'engine' with the ongoing xarray backend refactor? also bigger picture trying to tie this all together in way that is straight-forward for a user to run computations on a collection of cogs as an Xarray dataset with max efficiency and minimal custom code (#4) |
Thanks Scott, that's very useful. I wasn't aware of the different interleaving options. This could inform what a good default chunking should be for
Yeah, that's worth exploring more. A separate engine seems like a decent amount of work, but feels like the only way to really do it. I imagine that it could share a lot of the implementation of rioxarray, though I haven't looked closely. I wonder if anyone has experimented with an async wrapper for GDAL / rasterio. |
Currently So, the speed gains for reading in parallel will likely only be realized for very large rasters. |
Right. I've had a hard time wrapping my head around this to be honest. It feels like there could be a separation of concerns between the xarray/rioxarray API and the filesystem I/O such as what @martindurant demoed with xarray+fsspec+zarr async access (https://www.youtube.com/watch?v=7XDBM3pW2ls). But that leads to the question can rasterio handle file-like objects from fsspec? and I think the answer is No based on rasterio/rasterio#977 (comment) |
No, I don't think it can, and neither can some other drivers such as cfgrib. To deal with those via fsspec, we would need to leverage temporary local storage of the files. That gets tricky to organise with async, where we don't know beforehand which parts of the global dataset are to be accessed (unless xarray itself were to speak async, which is asking a lot). |
I was going to specifically mention aiocogeo if it hadn't already been mentioned. The upsides are that it could potentially integrate well with Dask? From a cursory read of the Dask docs, it seems to be able to integrate with async code? Downsides are that I don't believe aiocogeo handles reprojection: just loading and parsing data, so you'd need to implement that manually or still bring in pyproj/GDAL for that I think. |
I've never been clear on whether or not xarray+dask computations using threads are hampered by file locks. It's complicated because GDAL-->rasterio-->xarray-->dask are all involved.
Rasterio recently updated their example of concurrent processing. This example specifically reads files from local disk rather than over a network.
https://rasterio.readthedocs.io/en/latest/topics/concurrency.html
There is some good discussion here about multiple threads reading the same file concurrently here
pangeo-data/pangeo-example-notebooks#21 (comment)
And finally the rasterio mailing list has some relevant discussion
https://rasterio.groups.io/g/main/topic/72528118#468
A simple example that shows timing improvement would go a long way in helping to clarify this for people. This might require a PR to xarray to deal with locks...
The text was updated successfully, but these errors were encountered: