Description
Currently, rioxarray.open_rasterio builds a DataArray
with a _file_obj
that's an xarray CacheFileManager
that eventually calls rasterio.open
on the user-provided input (typically a filename). When using rioxarray with chunks=True
to do parallel reads from multiple threads this can cause workers to wait for a lock, since GDAL file objects are not safe to be read from multiple threads.
This snippet demonstrates the issue by creating and then reading a (4, 2560, 2560)
TIF. It uses a SerializableLock subclass that adds a bit of logging when locks are requested and acquired.
# file: rioxarray-locks.py
import rasterio
import rioxarray
import numpy as np
import logging
import threading
import time
import dask
from dask.utils import SerializableLock
logger = logging.getLogger("lock")
logger.setLevel(logging.INFO)
handler = logging.FileHandler("lock.log")
handler.setLevel(logging.INFO)
logger.addHandler(handler)
class InstrumentedLock(SerializableLock):
"""A lock that logs acquisition attempts and completions."""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
def __enter__(self, *args, **kwargs):
ident = threading.get_ident()
t0 = time.perf_counter()
logger.info("GET lock %d from %d", id(self.lock), ident)
super().__enter__()
logger.info("GOT lock %d from %d waited %f", id(self.lock), ident, time.perf_counter() - t0)
if __name__ == "__main__":
# Generate a tiff file. Has 4 bands that could be read in parallel.
a = np.random.randint(0, 255, size=(4, 10 * 256, 10 * 256), dtype="uint8")
transform = rasterio.Affine(30.0, 0.0, 358485.0, 0.0, -30.0, 4265115.0)
count, width, height = a.shape
with rasterio.open('example.tif', 'w',
width=width, height=height, count=count,
dtype=rasterio.uint8,
crs='+proj=latlong', transform=transform) as dst:
for band in range(count):
dst.write(a[band], band + 1)
lock = InstrumentedLock()
ds = rioxarray.open_rasterio("example.tif", chunks=True, lock=lock)
logger.info("-------------- read meatadata ------------------")
with dask.config.set(scheduler="threads"):
ds.mean().compute()
Here's the output to `lock.log:
GET lock 139663598009152 from 139664510379840
GOT lock 139663598009152 from 139664510379840 waited 0.000443
GET lock 139663598009152 from 139664510379840
GOT lock 139663598009152 from 139664510379840 waited 0.000405
-------------- read meatadata ------------------
GET lock 139663598009152 from 139663571498752
GET lock 139663598009152 from 139663563106048
GET lock 139663598009152 from 139663554713344
GET lock 139663598009152 from 139663343875840
GOT lock 139663598009152 from 139663571498752 waited 0.001341
GOT lock 139663598009152 from 139663563106048 waited 0.703704
GOT lock 139663598009152 from 139663554713344 waited 0.711287
GOT lock 139663598009152 from 139663343875840 waited 0.714647
So the section after --- read metadata ---
is what matters. We see all the threads try to get the same lock. The first succeeds quickly. But the rest all need to wait. Interestingly, it seems that someone (probably GDAL) has cached the array , so the actual reads of the second through 4th bands are fast (shown by the waits all being around 0.7s). I don't know enough about GDAL / rasterio to say when pieces of data are read and cached, vs. when they're just not read period, but I suspect it depends on the internal chunking of the TIF file and the chunking pattern of the request.
Another note: this only affects multi-threaded reads. If we change dask.config.set(scheduler="processes")
, we don't see any waiting for locks. That's because the lock objects themselves are different after they've been serialized.
GET lock 140529993845664 from 140530906224448
GOT lock 140529993845664 from 140530906224448 waited 0.000643
GET lock 140529993845664 from 140530906224448
GOT lock 140529993845664 from 140530906224448 waited 0.000487
-------------- read meatadata ------------------
GET lock 140451225032720 from 140452137330496
GOT lock 140451225032720 from 140452137330496 waited 0.000512
GET lock 140156806203456 from 140157718505280
GOT lock 140156806203456 from 140157718505280 waited 0.001993
GET lock 140098670495760 from 140099582793536
GOT lock 140098670495760 from 140099582793536 waited 0.003202
GET lock 140257691564096 from 140258603865920
GOT lock 140257691564096 from 140258603865920 waited 0.000851
As for potential solutions: a candidate is to pass around URIs like example.tif
rather than cached file objects. Then each thread can open its own File object. I think this would have to be an option controlled by a parameter, since the exact tradeoff between the cost of opening a new file vs. reusing an open file object can't be known ahead of time.
I haven't looked closely at whether an implementation like this is easily doable, but if there's interest I can take a look.
cc @scottyhq as well, in case this interests you :) xref to pangeo-data/cog-best-practices#2 and https://github.com/pangeo-data/cog-best-practices/blob/main/rasterio-concurrency.ipynb.