-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible approach for stacking items #75
base: main
Are you sure you want to change the base?
Conversation
# if isinstance(self.urlpath, list): | ||
# self._can_be_local = fsspec.utils.can_be_local(self.urlpath[0]) | ||
# else: | ||
# self._can_be_local = fsspec.utils.can_be_local(self.urlpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant following up on your comment on the last PR about intake-xarray and it's design and purpose. It seems to me that some of the complication with intake-xarray is the integration with fsspec and ultimately whether the library dealing with I/O can handle fsspec filelike objects.
In the case of rioxarray / xr.open_rasterio. rasterio.open() does not know what to do with fsspec objects so we just pass urls around. and unless i'm missing something there is no need for fsspec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec might be useful for expanding glob-strings; also, in some cases the best route is to have fsspec make local copies and pass the paths of those. It's the case where the xarray backend has it's own internal URL handling that's thorny. fsspec might still be useful for expanding globs, but neither file-likes or local copies are appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the feedback @martindurant !
It's the case where the xarray backend has it's own internal URL handling that's thorny.
yes, and with the backend refactor ongoing it's especially confusing for me to understand. there is a proposal in rioxarray to handle fsspec objects corteva/rioxarray#246 , so maybe eventually more xarray backends will support this...
fsspec might still be useful for expanding globs, but neither file-likes or local copies are appropriate.
i think this is another case where intake-stac is a bit weird compared to other intake plugins. since STAC is already a catalog and we're basically translating to a new catalog type, i don't think there is a case where someone would manually write or save an intake yml version. And consequently there won't be any glob patterns in the URLs...
in the absence of fsspec integrations the key goals of intake-stac in my mind are:
- provide a common interface to explore catalogs and load data (.to_dask()) regardless of file type (.jp2, .tif, .nc)...
- provide convenience functions for complicated loading of many assets (arguably this could also be implemented as an open_mfrasterio() in rioxarray, but having the STAC metadata allows us to take shortcuts or do things otherwise impossible with just filenames
- tap into the holoviews system for visualization
# NOTE: don't know what's going on here... | ||
# https://github.com/intake/intake-xarray/issues/20#issuecomment-432782846 | ||
def _get_schema(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@martindurant I had a hard time following what is going on here (copied from https://github.com/intake/intake-xarray/blob/master/intake_xarray/raster.py). It seems like this could be greatly simplified or deleted if we assume we are just referencing URLs or local filepaths and not using fsspec ?
# if self._can_be_local: | ||
# files = fsspec.open_local(self.urlpath, **self.storage_options) | ||
# else: | ||
# pass URLs to delegate remote opening to rasterio library | ||
# files = self.urlpath | ||
# files = fsspec.open(self.urlpath, **self.storage_options).open() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again i feel like none of this is needed unless the backend library knows what to do with fsspec things?:
with fsspec.open(url) as f:
rioxarray.open_rasterio(f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if we know for certain that we want original URLs passed through to the xarray backend, then fsspec has much less/nothing to do. It means there needs to be an alternate way to pass IO parameters (e.g., credentials) to that backend if it needs them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means there needs to be an alternate way to pass IO parameters (e.g., credentials) to that backend
I assume this means setting GDAL configs @scottyhq
metadatas = {'items': {}} | ||
hrefs = [] | ||
for item in items: | ||
metadatas['items'][item] = {'STAC': self[item].metadata, 'assets': {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way i ended up structuring this it ended up being basically a loop around StacItem.stack_bands()
so could probably just call that method instead of a bunch of duplicate code?
def stack_bands(self, bands, path_as_pattern=None, concat_dim='band'): | ||
def stack_assets(self, assets, path_as_pattern=None, concat_dim='band', override_coords=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed again to be less EO-specific and more in line with the STAC assets. i don't think too many people are using this library yet, so i'm not concerned with disruptive changes.
Coordinate alignment is performed automatically upon calling the | ||
`to_dask()` method to load into an Xarray DataArray if bands have diffent | ||
ground sample distance (gsd) or array shapes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i learned the hard way that this isn't exactly true. it works for landsat bands that were aligned based on GSD, but here are two with the same CRS and transform, different GSD that xarray fails to do what we expect... i'll open an issue to illustrate this (https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B04.tif, https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B05.tif)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it might be best to raise a warning if CRS, Affine, and GSD do not match for requested stacked assets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @scottyhq,
i stumbled upon this, do you have a suggestion how to go about stacking (sen2) bands with different GSD to be later used for lazy-loading in a dask array?
Thanks,
Vitus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitusbenson i'd recommend explicitly interpolating, or using a gdal vrt
url4 = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B04.tif'
url5 = 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/36/M/YB/2020/8/S2A_36MYB_20200814_0_L2A/B05.tif'
da4 = rioxarray.open_rasterio(url4, chunks='auto').assign_coords(band=['B4']) #gsd=10
da5 = rioxarray.open_rasterio(url5, chunks='auto').assign_coords(band=['B5']) #gsd=20
da5_interp = da5.interp(y=da4['y'], x=da4['x'], method='nearest')
da = xr.concat([da5_interp, da4], dim='band') #gsd=10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - I'll have to give that a shot to compare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I wasn't aware of rioxarray's interpolation capabilities - really useful!
__version__ = get_distribution('intake_stac').version | ||
|
||
|
||
class RioxarraySource(DataSource, PatternMixin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intended to be a simplified version of https://github.com/intake/intake-xarray/blob/master/intake_xarray/raster.py seeing what is the minimal driver if intake-xarray were to be separated out per xarray backends (netcdf, zarr, raster, etc)
Your issues are similar to the general conversation in intake-xarray and my fsspec/zarr PR to xarray: that the many backends need different inputs, so it's difficult to know when to invoke fsspec. I don't know how to make a comprehensive solution. Maybe keeping things specific, like here, is the way forward. |
Hi @scottyhq - thanks for putting this together. Still working through the code and getting familiar with how STAC collections get mapped into intake-stac - so apologies if I am missing some key points in the suggestions below! I think adding these types of features (stacking/mosaicing) into intake-stac would really improve a lot of workflows.
If dropping fsspec use and going straight to urls, we could also leverage the STAC metadata to create xarray objects with dask.delayed and dask.array.from_delayed that would be fast. Here is an example opening 500 assets in ~1 sec: https://gist.github.com/rmg55/ae5bc19b94345cc2033ef9261a229e24
I am also having trouble wrapping my head around the best approach for this. I am conceptualizing this into two categories spatial mosaic and temporal stacking. Here are a few thoughts about what might be great to try and support long-term (realizing we probably want to start small and build features in). Mosaic: Images collected at the same time and need to be stitched together. This is probably better accomplished with the stac_vrt approach. However, I think an interesting intake-stac + stac_vrt workflow could accomplish something like:
Stacking: Temporally stacking assets to create a time series (x,y,band(s),time). I think this could be accomplished in either intake-stac (perhaps with the approach mentioned above) or in stac_vrt. Thoughts below on using both approaches:
I'd be happy to contribute to this effort once we can settle on a best approach. |
Thanks @rmg55 this is awesome! This approach of getting the dimension and coordinate info from STAC JSON to delay file I/O is really slick! The disadvantage I see is the possibility for discrepancies when data is actually read. For example, note that you have an error with coordinate parsing in your function, which I came across when I tried to plot the resulting dataarray: def _delayed_open(stac_item, band):
transform = stac_item['assets'][band]['proj:transform'][0:6]
nx,ny = stac_item['assets'][band]['proj:shape']
# NOTE: xarray and rioxarray docs are wrong here, no need for from_gdal...
# see https://github.com/pydata/xarray/issues/5005
#x, y = Affine.from_gdal(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
x, y = Affine(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
href = stac_item['assets'][band]['href']
date = datetime.datetime.fromisoformat(stac_item['properties']['datetime'][0:10]).date()
# changed to darray because i always do `da=xr.open_rasterio()` ;)
data = darray.from_delayed(dask.delayed(xr.open_rasterio)(href).data, (1,nx,ny), dtype=float)[np.newaxis,:,:,:]
# swap order of x and y to be consistent with ordering returned from xr.open_rasterio
da = xr.DataArray(data=data, coords=([date],[band],y,x),dims=('time','band','y','x'))
return da you bring up good points and great ideas for stacking and mosaicing options, continuing to prototype some examples here or via gists is definitely helping to figure out an approach. |
@scottyhq - thanks for pointing out those bugs in the gist (updated recently to incorporate your edits). Agreed on your assessment that errors are not surfaced until the data is read - which could make tracing errors more confusing for users. Also, I spent some time exploring the stackstac project by @gjoseph92 - which is really awesome. Here is a gist of using it with NASA HLS STAC: https://gist.github.com/rmg55/b144cb273d9ccfdf979e9843fdf5e651 One thing to note is that stackstac is able to work with STACs that are missing the projection extension by using the bbox, user input on resolution+crs:epsg, and warped_vrt. Note this can lead to slightly different shape/transform than the original file (see the above gist). Perhaps something to consider for stac_vrt as well? |
Hey @scottyhq! It does seem like we're working on similar things. Do you think stackstac could be helpful here? I could imagine intake-stac using it under the hood to generate the xarray, but I'm not sure if there are any intake-specific details that it can't handle right now. |
Definitely! I abandoned this PR for a bit but would like to come back to it soon with STAC 1.0 coming out in the near future. I think I'll probably delay integrating stackstac and do that in a follow up. I'm also watching corteva/rioxarray#246 and the linked PR in rasterio rasterio/rasterio#2141 , which will have some bearing on how fsspec and intake-xarray are currently used (as discussed in above comments) |
This approach is much less-sophisticated than stackstac, so I think I might abandon this PR, or need to re-do the .stack() method adding an optional dependency on stackstac. That said, I think it would be nice to have consistency in what |
👍 great point about the consistent CRS handling. I'm planning to turn to that after getting pystac 1.0 support and collection-level assets handled in intake-stac. FWIW, my hope is that intake-stac can be a pretty simple wrapper around pystac, pystac-client, and stackstac. Ideally we limit the complexity here. |
In full agreement :) I just added you as an admin for this repo @TomAugspurger, so that you have the ability to merge things |
Been working on this PR for a couple days to try and understand nuances of intake by splitting off from the intake-xarray plugin and thinking about how stacking items might work in practice.
I'm also trying to sort out the advantages of intake-stac versus working directly with STAC catalogs and stac-utils/stac-vrt#11. So I'm going to add some comments in the code and try to get feedback from @jhamman @martindurant @TomAugspurger and @rmg55 .
Currently this is a functional implementation that opens files in sequence, which is fine for small catalogs, but scaling up it would probably be smart to add an optional dependency on
stac-vrt
and inject the build_vrt() function as a catalog method.addresses #65