Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify media-type and format negotiation of Part 3 collections #427

Open
fmigneault opened this issue Jul 16, 2024 · 12 comments
Open

Clarify media-type and format negotiation of Part 3 collections #427

fmigneault opened this issue Jul 16, 2024 · 12 comments

Comments

@fmigneault
Copy link
Contributor

Given a process executed with Part 3 collection (rather than value or href), the data passed as input or produced as output of the process seems to imply that it will be a GeoJSON FeatureCollection, or an equivalent representation such as STAC Collection or GML.

Given this implication, can it be assumed that only process descriptions that explicitly support format: geojson-feature-collection (or alternate "collection-like" formats) can safely employ the collection input/output? A process expecting, for example, a GeoTiff would be automatically invalid if a collection was specified?

Given that a collection can also imply an array of "whatever item the collection contains", would a process expecting an input array of GeoTiff be a valid candidate for a collection with a STAC Collection that happens to contain those GeoJSON Items with STAC Assets employing GeoTiff media-types? Would this process still need to indicate format: geojson-feature-collection support explicitly (IMO, yes) to avoid ambiguity regarding how those GeoTiff should be parsed?

Are there other use-cases not mentioned above to consider?

The Part 3 document needs to clarify these considerations such that expected behaviors by implementations can somewhat "agree" and improve chances of interoperability. The versatility of collection, although useful in some cases, somewhat acts as a double-edged sword when describing and execution processes, since processing intensions can become too abstract.

@jerstlouis
Copy link
Member

jerstlouis commented Jul 17, 2024

Given this implication, can it be assumed that only process descriptions that explicitly support format: geojson-feature-collection (or alternate "collection-like" formats) can safely employ the collection input/output? A process expecting, for example, a GeoTiff would be automatically invalid if a collection was specified?

No, that is not a correct assumption.

The term "collection", as used in draft OGC API - Common - Part 2, and as used in Processes - Part 3: Workflows, does not imply that the data is a collection of vector features. This was discussed in details in opengeospatial/ogcapi-common#140.

An OGC API collection is a "collection of data", typically including 2 or 3 spatial dimensions.
This data could be anything such as feature collection, 2D/3D/nD regular or irregular gridded coverage, maps, imagery, 3D meshes, point clouds. The different OGC API data access standards (Features, EDR, Tiles, Maps, Coverages, DGGS....) define different mechanism to access data from a collection. Some of these APIs such as Tiles support accessing several of these data types (for example, OGC API - Tiles can be used with all of these data types I just mentioned).

If you meant that it implies the existence of a /collections/{collectionId}/items/{itemId} end-point, this is specifically defined for the OGC API - Features access mechanism (and Records "Features API" aka Searchable Catalog deployment type -- but this is another story, since Records really is not a "data access mechanism", but a discovery API, which is why I am so strongly opposed philosophically to this "Features API" deployment type).

The "STAC collection", which is also modeled after OGC API - Features and also uses /items, is also particular, because again it somewhat mixes up data and metadata.
Conceptually, the actual data of interest for the processing for a STAC collection is not the STAC items, but the data inside of the STAC assets. We discussed this a little bit with @m-mohr in another issue that I cannot find right now.

So if we consider STAC + COG as a valid "data access mechanism" for Collection Input and/or Output, it needs to be clear that the "input" / "output" is not the items, but the actual COG data (all of the assets combined comprising a single coverage output in Part 1 terms, where assets of different bands would either be considered a dimension or different fields of the coverage, and different items would be at other spatial or temporal coordinates within the coverage -- you could combine the whole thing for a particular AoI/ToI/RoI within a single netCDF file for example).

It would also be possible to offer both OGC API - Coverages (/coverage) as well as a STAC (/items)+COG for the same collection, but not both Features and STAC because both use the same /items end-point.

would a process expecting an input array of GeoTiff be a valid candidate for a collection with a STAC Collection that happens to contain those GeoJSON Items with STAC Assets employing GeoTiff media-types?

With the first-class collection object, while such a STAC+COG collection could be a valid "Remote Collection Input" to put in "collection" input, the input conceptually would be the actual coverage data, so that you should also be able to actually pass to that process as an input an "href" with a GeoTIFF or a base64-encoded GeoTIFF when not using Collection Input. This does pose some challenges in terms of GeoTIFF being only 2D and not able to encode different bands at different resolution.

This goes back to some of our previous discussions, but in terms of the implementation, it does mean that you can implement all of the OGC API - Processes 1/3 and OGC API data access mechanisms client machinery separately from the processes, and feed the process the actual data directly (the subset of the coverage cells retrieved from the COGs as HTTP range requests, after having parsed the STAC items, in this case). After the process receives an input, it does not need to make any further network requests to retrieve data to do the actual processing. This makes the most sense to me thinking of the processes as "functions" from which all this API / remote access stuff should be abstracted away, and in terms of not having to duplicate the same functionality in the implementation of every process.

Preprocessing inputs / accessing remote collections is a machinery thing -- not a process thing.

Would this process still need to indicate format: geojson-feature-collection support explicitly (IMO, yes) to avoid ambiguity regarding how those GeoTiff should be parsed?

No, because the actual data (the coverage encoded in all the GeoTIFF assets) is not a GeoJSON feature collection, but a 2D gridded coverage (or 3D if the different GeoTIFFs form a temporal dimension).

As mentioned above, it would also be possible to offer that coverage data using OGC API - Coverages, OGC API - Tiles, OGC API - EDR, as well as STAC + COG all from within the same /collections/{collectionId}, so that the client can pick its preferred/most efficient access mechanism.

@fmigneault
Copy link
Contributor Author

Given the shared use of collection for (most) OGC APIs (/collections/{}/{maps|tiles|coverages|...}, which can be mixed simultaneously to regroup "common data of different nature", I tend to agree "Collection" should be generic, not just FeatureCollection. Thanks for confirming.

So if we consider STAC + COG as a valid "data access mechanism" for Collection Input and/or Output, it needs to be clear that the "input" / "output" is not the items, but the actual COG data

This is exactly my concern. It is not really "clear", unless you happen to know all involved data access mechanism, which is not the case by most users.

"How" do we indicate that clearly?

A STAC Item could contain Assets with respective single-band GeoTiff, virtual Assets that pre-combine RGB into another GeoTiff, could also contain Point Cloud LAZ, a DEM, netCDF weather data, and a GeoJSON FeatureCollection with class labels, all at the same time. It could also have many other media-type variants of all this data only to offer alternative/equivalent formats that can be retrieved as desired.

In a situation where only the RGB GeoTiff would be the relevant Assets for the Process execution in this case, does that mean it is up to the execution body to pass the appropriate filter as well as the collection to return only the href matching the RBG Assets? I think this can be possible with a sophisticated CQL2 definition, but it can become very convoluted depending on "what" needs to be obtained, and considering all media-types/formats that could be present.

Note

I strongly believe providing explicit examples of these convoluted use cases in the specification could help readers understand the "best practices" and expectations.

This makes the most sense to me thinking of the processes as "functions" from which all this API / remote access stuff should be abstracted away, and in terms of not having to duplicate the same functionality in the implementation of every process.

I agree. This is how I'm implementing it as well. This is why I want to get as much details as possible regarding how the "collection -> value/href" conversion must be done. The Process receiving the resolved collection data would deal with it as if it was provided explicitly by value or href, and would essentially be unaware of "how" the collection was resolved beforehand.

@jerstlouis
Copy link
Member

jerstlouis commented Jul 19, 2024

"How" do we indicate that clearly?

STAC+COG being a special case, we should have a conditional requirement that makes this clear.

A STAC Item could contain Assets with respective single-band GeoTiff, virtual Assets that pre-combine RGB into another GeoTiff, could also contain Point Cloud LAZ, a DEM, netCDF weather data, and a GeoJSON FeatureCollection with class labels, all at the same time. I

That clearly does not fit that usage pattern :)
My opinion is that such chaotic heteorgeneous set of STAC items is not a good fit for the Collection Input / Output pattern.
Those STAC items could be used as inputs to a process in the regular Part 1 way, by providing one or more links to the STAC items or to the STAC collection using "href".

To fit the Collection Input/Output pattern, you would need to organize your STAC items as one collection per asset data type, where all assets together form a single Feature Collection or Coverage or Point Cloud. They could still all be on the same STAC API, just one collection per GeoDataClass data layer.

It could also have many other media-type variants of all this data only to offer alternative/equivalent formats that can be retrieved as desired.

That could be fine as long as they all describe the same data of the same type, as long as somehow the STAC specification / extensions makes it clear to the client that these are alternative formats that can be used and represent the same data.

does that mean it is up to the execution body to pass the appropriate filter as well as the collection to return only the href matching the RBG Assets?

I'm not familiar with the STAC API, but the Features CQL2 filter would only allow you to filter at the items level, not the assets level. So while a filter could allow to use heterogeneous STAC items by resulting in a consistent set of filtered items, I don't think the same would work if it is the assets that are heterogeneous?

@fmigneault
Copy link
Contributor Author

That clearly does not fit that usage pattern :)
My opinion is that such chaotic heteorgeneous set of STAC items is not a good fit for the Collection Input / Output pattern.

Interestingly enough though, that is the case I'm seeing as the most relevant to support GDC for my server.
GeoTiff coverage + Climate data + GeoJSON annotations simultaneously for AI training... yes please!

Cross-collection use cases are also defined in TB20 to combine EO and Climate/Weather data.
Therefore, we must expect to have to deal with very heterogeneous data types and structures.

TBH, I think OGC APIs are just as chaotic. The difference is only that different "data access mechanism" are used for these various formats instead of STAC extensions. The problem remains the same, it just moves, and a client submitting the request must still indicate somehow which items of interest to resolve, and this will dictate which API endpoints/request to use and data type to extract from a "Collection" that can contain many things.

Many of the OGC API endpoints can return similar media-types. Therefore, it is not sufficient to use only this information to resolve which endpoints to request when only the collection reference is provided. For example, Coverages, Maps, Tiles could all return GeoTiff. Which one is the most relevant could entirely depend on which process is being called. When combined with Part 2, where processes are dynamically deployed, this resolution cannot be defined in advanced on the server. Either a hint like format: geojson-feature-collection must be provided, or an explicit parameter must be submitted in the execution request.

To fit the Collection Input/Output pattern, you would need to organize your STAC items as one collection per asset data type, where all assets together form a single Feature Collection or Coverage or Point Cloud. They could still all be on the same STAC API, just one collection per GeoDataClass data layer.

I strongly disagree with this. I do not see any difference with accessing /collections/x/items and /collection/x/coverages to obtain corresponding features and imagery over a certain AOI/TOI. Both of these data types could still live under the same x collection with OGC APIs as well. STAC only allows making this relationship more obvious by grouping the corresponding elements by Asset under a common Item.

I don't think the same would work if it is the assets that are heterogeneous?

I think so too since Assets cannot be filtered currently by Features CQL2.

@jerstlouis
Copy link
Member

GeoTiff coverage + Climate data + GeoJSON annotations simultaneously for AI training... yes please!

Therefore, we must expect to have to deal with very heterogeneous data types and structures.

I am not saying that it is not valid scenario to provide all those things together.

What I am saying is that if you have STAC items for all these, each type of data to use together with "Collection Input / Output", such as the GeoTIFF coverage, the individual climate dataset (though multiple variables can be different fields in the same coverage) and the GeoJSON annotations, should each be their own collections. All collections could still be provided from the same STAC API end-points, or be provided together as multiple inputs to a process.

I do not see any difference with accessing /collections/x/items and /collection/x/coverages to obtain corresponding features and imagery over a certain AOI/TOI. Both of these data types could still live under the same x collection with OGC APIs as well.

I am not sure what you are saying here. Are you talking about /items with annotations for the coverage, or /items as STAC items describing the assets of the coverage?

It is possible to do only ONE of these two things, because both STAC and Features use /items:

  • Have both a /coverage and a /items STAC end-point which links to equivalent coverage assets for the same collection
  • Have both a /coverage and a /items Features end-point where the features are an alternative representation of the coverage (e.g., Point Clouds could be thouht as both a Coverage maybe available as .laz files, and a multi-point feature collection, where the classification is a feature ID)

For example, Coverages, Maps, Tiles could all return GeoTiff. Which one is the most relevant could entirely depend on which process is being called.

I think a key principle here is that for a particular collection, all data access mechanisms (Coverages, Tiles, DGGS, Tiles, Maps, EDR) represent the same data, and accessing the data using any of them should produce equivalent results. In the case of a STAC /items, those items are actually metadata as a stepping-stone to the actual data in the assets.

Maps / Map Tiles would normally not be used for processing workflows unless it is the only available access mechanism, because Maps are a visual representation of the data typically not intended for processing, except if the data is (A)RGB imagery.

So the choice of the API access mechanism should be agnostic of the process, and it is really up to the implementation of the server and client to decide whether to use Coverages or Coverage Tiles or DGGS or EDR or accessing STAC assets to retrieve the data.

For a STAC collection to fit the Collection Input / Output paradigm, the assets it describes must fit the concept that not only the metadata (the STAC items) have a consistent schema, but the actual data of the assets must also have a consistent schema, since these assets are what would actually get fed to a process receiving this collection as an input (not the STAC items).

@fmigneault
Copy link
Contributor Author

[...] should each be their own collections

I do not see why that should be the case. The whole point of "Collection" as per its own definition is that it can be an agglomeration of different data representations and formats. If one is forced to represent a dataset that combines imagery, weather data and annotations representing some "event" into distinct collections because of their different type/format/representations, it defeats the main purpose of that collection/dataset that offers cross-domain annotations. This dataset could be the result of curated expert annotations that cannot "simply" be obtained by joining collections based on AOI/TOI/etc.

Splitting into distinct collections also causes another issue. It doesn't "naturally group" together the corresponding data that represents the whole dataset. Users have to "somehow" figure out that collection-imagery, collection-weather and collection-annotations need to be combined within some filter-x to obtain the full picture. Each of those collections could be meaningless on their own.

I am not sure what you are saying here [...]

What I meant is that, in OGC APIs, a /collections/x that happens to contain both imagery and features, for example, would be accessed by (at least) 2 requests, using /collections/x/coverage and /collections/x/items (or any other appropriate endpoints) to obtain the data in each representation. The results then have to be joined, either using some matching id, by corresponding AOI/TOI, or any other relevant filter for the use case that employs both simultaneously.

On the other hand, using a STAC /collections/x, both of the imagery and feature Assets are obtained with the same /collections/x/items endpoint. One does not need to worry if the "join" aligns, because data types are natively grouped properly. Instead of figuring out how to "join" the data from 2 responses, the same response can be reused for both data representations as is. The problem is that one must instead figure out which relevant data Assets to extract from the single response using distinct field modifiers.

In the end, with either approach, there is a need to filter/join/parse some specific data structure, which depends on specific metadata properties, media-types, etc. returned by the API of choice, and this cannot be done for what the process expects, unless it provides a format hint telling which one is appropriate for its use case.

IMO, the fact that distinct OGC APIs returns some specific data types is not any different from parsing STAC Items that happen to have the same data distributed in its Assets. Resolving OGC /coverages and /items sub-paths is the same as resolving $.assets["imagery"] and $.assets["annotations"] in STAC. When the client sends the execution request with a collection reference, there must be "something" that indicates which one to pick.

It is possible to do only ONE of these two things

STAC could also simply introduce a way to filter assets, and ignore OGC APIs coverage...
It is already possible as "post-filter" step in the Python client. The same could be provided in the STAC API directly. I have seen some discussions about adding assets filtering capabilities, similar to the above $.assets["imagery"] JSON path, or using additional query parameters such as ?assets=[CQL2]. It's not because STAC aligns with OGC API features that it cannot add more capabilities on top of it, or that it needs to reuse all OGC APIs. And then again, a format would be needed to indicate if filters should be resolved OGC or STAC style given both are Features compliant, and application/geo+json is not sufficient to hint which to pick.

I think a key principle here is that for a particular collection, all data access mechanisms (Coverages, Tiles, DGGS, Tiles, Maps, EDR) represent the same data, and accessing the data using any of them should produce equivalent results.

The "same data" is really subjective depending on which user you talk to.

Those data access mechanisms all have some spatio-temporal structure that could be represented in some kind of alternate/extended GeoJSON with custom fields or generic properties encoding, which is why you can interrogate them with similar filter/bbox/crs/etc. query parameters. You can have virtually anything combined in the extra properties. This is essentially what is done with STAC extensions and assets. For each new STAC extension created for a new particular type of data, we can find some OGC API/standard that tries encoding the same.

Again, it only moves the problem elsewhere. Each approach is valid.

implementation of the server and client to decide whether to use Coverages or Coverage Tiles or DGGS or EDR or accessing STAC assets to retrieve the data

When you're dealing with your own processing server accessing your own data structure, sure, you can arrange things to "just work" together, because you know your own use cases. When you interrogate some remote server maintained by some other entity that have multiple use cases to handle, they couldn't care less about your requirements. Ambiguous combinations emerge to handle more use cases, and you need a way to indicate how to resolve things. In order for server/client to agree, the API must provide some mean for resolution. This is the whole reason of this format we see everywhere (i.e.: #395).

@fmigneault
Copy link
Contributor Author

Following a few tests on my end while implementing Collection Input (crim-ca/weaver#685), I've found that I need additional parameters like format, type and schema to operate collections properly.

More specifically, the below combinations (not exhaustive) are possible:

{
  "inputs": {
    "static-collection": {
        "$comment": "Static document, resolves to 'test.geojson'.", 
        "collection": "https://mocked-file-server.com/test",
        "format": "geojson-feature-collection",
        "schema": "http://www.opengis.net/def/glossary/term/FeatureCollection"
    },
    "ogc-api-features":  {
        "collection": "https://mocked-file-server.com/collections/test",
        "format": "ogc-features-collection",
        "type": "application/geo+json",
        "filter-lang": "cql2-text",
        "filter": "property.name = 'test'"
    },
    "ogc-api-coverages":  {
        "collection": "https://mocked-file-server.com/collections/test",
        "format": "ogc-coverage-collection",
        "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        "filter-lang": "cql2-text",
        "filter": "property.name = 'test'"
    },
    "ogc-api-maps":  {
        "collection": "https://mocked-file-server.com/collections/test",
        "format": "ogc-map-collection",
        "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        "style": "dark",
        "filter-lang": "cql2-text",
        "filter": "property.name = 'test'"
    },
    "stac-api":  {
        "collection": "https://mocked-file-server.com/collections/test",
        "format": "stac-collection",
        "type": "image/tiff; application=geotiff; profile=cloud-optimized",
        "filter-lang": "cql2-text",
        "filter": "property.name = 'test'"
    }
  }
}

Using format, I can disambiguate (explicitly) between accessing a collection's contents as static document, STAC API, OGC API Features, Coverages, Maps, etc., which could all be supported simultaneously by the same referred collection endpoint. Without format, I could not, for example, distinguish between retrieving the "full" Coverage (ie: all bands as one file) or only specific STAC Asset bands (individually), since both are retrieved with the same GeoTiff media-type. Similarly, a style passed as input would not be properly applied unless the Maps interface was used, but other parameters are mostly identical to Coverage. If I simply removed that style parameter for a given request, the resolved API could switch entirely between Coverages/Maps, which could yield different results if they imply different requirements/behaviors in the backend (eg: auto-scaling). With format, I can ensure Maps are used with/without style regardless. In some cases, it is possible to guess automatically based on parameters a single API candidate, but often the parameters are too overlapping to pick, and depend too much on the multiple implementation to figure out a single candidate between available options. Sometimes, combinations (eg: style+filter) can imply hidden behaviors such as processing the filter locally, before dispatching the style remotely with a modified request, since the remote Maps API does not support filter directly. The format offers a way to reliably hint the selection process toward a specific API in case of ambiguity. Taking inspiration from geojson-feature-collection that is already defined, I came up with these variations:

  • stac-collection
  • ogc-coverage-collection
  • ogc-features-collection
  • ogc-map-collection
  • geojson-feature-collection

So far, I have only those defined, but there are most definitely other combinations to add to the format table. Also, open to renaming. This is just what felt like appropriate names.

Using type/schema combinations, I can request the specific contents to be retrieved. Those are similar to when using href instead of collection. However, they do not describe the content hosted at supplied link like it does for href. Instead, they describe the desired contents from the collection. Another distinction from href case, is that they can imply different resolution depending on format (as in the previous STAC vs Coverage example mentioned, the same GeoTiff type does not yield the same results). Regarding schema, it was sometimes needed to disambiguate even further between different types of GeoJSON, since they are all represented by the same application/geo+json, and type/format alone were not enough.

@jerstlouis
I believe those properties should be added explicitly to Part 3, with descriptions and expected behaviors when involving each of them. If that sounds right, how should we proceed?

@jerstlouis
Copy link
Member

jerstlouis commented Aug 29, 2024

@fmigneault

I can disambiguate (explicitly) between accessing ...

This really goes against the idea that the workflow does not explicitly make this distinction, so that the exact same workflow can be re-used (workflow reusability) with other servers and/or clients supporting different access mechanisms.

Without format, I could not, for example, distinguish between retrieving the "full" Coverage (ie: all bands as one file) or only specific STAC Asset bands (individually), since both are retrieved with the same GeoTiff media-type.

Pointing to a collection e.g., sentinel2 with all its bands, does not necessarily mean "all bands", but whichever bands the process to which it is given as input needs. This should apply the same whether using STAC or Coverages. The process can use field selection properties= for Coverages, and retrieve only the assets it needs with STAC.

If I simply removed that style parameter for a given request, the resolved API could switch entirely between Coverages/Maps, which could yield different results if they imply different requirements/behaviors in the backend (eg: auto-scaling).

The OGC API - Maps API access mechanisms (and related styles) doesn't really fit into the typical workflows at all, since it doesn't normally provide access to the raw data of the collection, especially when an alternate access mechanism providing the raw data is available. Most workflows would work off the raw data, unles it's explicitly something intended to work on pre-rendered maps. The one use case where Maps may be useful is for RGB imagery being used as part of a workflow, with no other access mechanisms available, though that is probably not great for processing (e.g., 8-bit depth and scaled/clamped values instead of 16-bit original values).

This being said, I'm not totally against the idea to add parameters to force a particular access mechanism for specific scenario / use cases, but this is absolutely not something that should be required or even the common use case. The whole point of the collection input was to leave it open for negotiation between the server / client hops in the workflow.

I believe those properties should be added explicitly to Part 3, with descriptions and expected behaviors when involving each of them. If that sounds right, how should we proceed?

Hoping we can first take some time to discuss this together de vive voix (probably a long discussion), and in the context of specific experiments.

@fmigneault
Copy link
Contributor Author

This really goes against the idea that the workflow does not explicitly make this distinction, so that the exact same workflow can be re-used (workflow reusability) with other servers and/or clients supporting different access mechanisms.

And when the workflow is not able to resolve a combination because there are conflicting patterns, what do I do? Shrug and move to another server? The workflow is not reusable either. The idea of reusing format is that it would have the same connotation as in JSON schema. If you want to use it for extra validation/resolution you can, but you are allowed to ignore it, just like current inputs are allowed to ignore the format: geojson-feature-collection provided in the schema. It is only a hint.

The process can use field selection properties= for Coverages [...]

I agree, and never said otherwise. However, even then, the properties would depend on the API.
If Coverages uses properties=B04,B03,B02 that is fine.
Using STAC, it would need properties=assets.red,assets.green,assets.blue.

So, rather than trying to handle all possible combinations of property names/structure some user could throw at the process, I find it is much easier to add a format: stac-collection that lets the client indicate "I'm working using STAC style collections". Then, the underlying content negotiation can still choose to attempt using the Coverages-style properties, but by doing any necessary translation under-the-hood. If anything, that can make the workflow even more reusable, since the server can understand what it is trying to parse, and can still interact with the collection even if it does not support STAC.

access mechanisms (and related styles) doesn't really fit into the typical workflows [...]

I've often seen replies of the sort. It is not because it is not "YOUR" typical workflow, that it can simply be ignored or that it doesn't exist. We either support powerful and flexible Collection Inputs, or we don't at all. We cannot pick and choose only the preferred and easy use cases...

Again, I reiterate, the type, schema and format I propose are NOT required. They are optional additional parameters that can be provided to address these quirky workflows that do not "just work" with the most basic use cases. What I want is for the standard to state explicitly that, given situations where such disambiguation is needed, those are the parameters to use, and what can be expected from them. Otherwise, we just keep doing wild-west hack workarounds.

Hoping we can first take some time to discuss this together de vive voix

Would love that. What date/time would work? There's never enough time in SWG or TB meetings for this.

@jerstlouis
Copy link
Member

jerstlouis commented Aug 29, 2024

And when the workflow is not able to resolve a combination because there are conflicting patterns, what do I do? Shrug and move to another server?

Multiple available access patterns should never result in a failed workflow resolution -- on the contrary, they should make successful resolution more likely due to multiple options available intersecting client capabilities. The client picks its preferred option that it supports, but picking any that does work for it should work almost just as well.

The idea with Part 3 nested processes is that each client-server hop is responsible for negotiating the immediate exchange. The client side of each hop sees what access mechanisms the server supports, and the client picks what makes the most sense from what is available in the context of the process. In theory, all access mechanisms should give access to the same raw data, whether it's Coverage tiles, DGGS, Coverages, STAC items pointing to COGs... The data should be the same in all cases, so whichever access mechanims the client picks should yield very similar, if not identical results. A hint that this client is free to ignore might make sense in situations where for whatever reason the client makes a bad decision by itself, but ideally the client should make the best decision without needing a hint.

Using STAC, it would need properties=assets.red,assets.green,assets.blue

If you mean that the workflow definition should have "properties": [ "assets.red", "assets.green", "assets.blue" ] that's not how I was expecting this to work. The "properties" should be exactly the same, regardless of the access mechanism, so it should still simply be "properties": ["red", "green", "blue"], and this is mapped to the separate assets, or to the separate bands of a unique COG referenced by STAC items, or to the "properties" parameter if using Coverages. The idea is that the workflow is defined one way, and this way works for all access mechanisms (Coverages, Tiles, DGGS, STAC+COG...).

It is not because it is not "YOUR" typical workflow, that it can simply be ignored or that it doesn't exist.

I'm not saying it doesn't exist, just that in general (unless specifically doing post-processing on maps) processes take as input raw data, not rendered maps, so you don't have these two fundamentally different types of inputs one being the raw data, the other rendered maps. In the case of STAC+COG, Coverages, Coverage Tiles, and DGGS data, you get the same data values regardless of the selected access mechanisms, and we should focus initial discussion / experiment on these use cases, rather than thinking about Maps and Styles at this stage.

They are optional additional parameters that can be provided to address these quirky workflows that do not "just work" with the most basic use cases. What I want is for the standard to state explicitly that, given situations where such disambiguation is needed, those are the parameters to use, and what can be expected from them.

They might come in handy in particular situations, but I'm really wary about working on these exception scenarios before having done technology integration experiments with the regular cases, because there's a possibility that the automatic resolution might actually just work fine without the hints. So I would really like us to experiment with specific workflow examples both where we agree they are not necessary, and where you think they are really necessary, to have some concrete context to discuss and define these hints.

What date/time would work? There's never enough time in SWG or TB meetings for this.

5:00 PM or later this evening would work for me, any time tomorrow Friday, any time after 10:00 AM Monday, any time Tuesday.

@fmigneault
Copy link
Contributor Author

Multiple available access patterns should never result in a failed workflow resolution -- on the contrary, they should make successful resolution [...]

I disagree. If my auto-resolution implementation happened to prefer STAC-API, but you provided the properties assuming Coverages or Features would be picked, the workflow would not resolve as expected. My resolution strategy cannot magically figure out what you intended.

The process cannot inform about which one is better either, because it does not depend on a collection per se. A process accepting a GeoTiff, for example, can still receive it directly by href or value, without any knowledge of a specific collection. The collection is just a shortcut for me to avoid extracting that href/value myself from the collection.

Since many APIs can still return 200 OK even if a filter, property, etc. was not "effective", I cannot even rely on the response to know if the operation worked as intended. If a GeoTiff reference was found matching my criteria, how do I know it contains the selected properties? I would actually need to download the file, open it, parse it (if I even can, sometimes that the job of the process), and basically redo the operation on my end to validate everything. It defeats the whole purpose of limiting data transfer and dispatching the operation to a well-establish API behavior.

In theory, all access mechanisms should give access to the same raw data

Well, they don't, hence the use of different APIs for different purpose.

If you mean that the workflow definition should have "properties": [ "assets.red", "assets.green", "assets.blue" ]

Correct in the workflow definition.
I had the Coverage/STAC API as KVP-URL receiving it in mind, but it would be submitted in JSON array style next to the collection reference.

The "properties" should be exactly the same, regardless of the access mechanism, so it should still simply be "properties": ["red", "green", "blue"]

No, it doesn't!

For a Coverages API, it would indicate to extract the corresponding bands, which are defined in the /collections/{}/schema.
In Features API, the properties query directly refers to the nested properties field.
So, properties=red would look for JSON-path .properties.red in the features.
But in STAC Items, assets and properties live at the same level.
Therefore, if one wants to support some kind of "asset" selection, they must distinguish between properties.<x> and assets.<x>, meaning property=assets.red, property=properties.red and property=red means very different things.

Completely different access strategies and interpretation for all.

regardless of the selected access mechanisms, and we should focus initial discussion / experiment on these use cases, rather than thinking about Maps and Styles at this stage.

It is this kind of focus that creates situations as the scale-factor=1 issue with Coverage.
By ignoring issues that we can already identify, we just create breaking/incompatible behaviors later.

working on these exception scenarios before having done technology integration experiments with the regular cases

That's the thing. I am implementing it now, and I'm already encountering issues!
So rather than waiting until the standard is published and inevitably require a revision, I highlight the concerns, and propose solutions.

I also got time Friday 10AM-2PM. Let's try that.

@jerstlouis
Copy link
Member

jerstlouis commented Aug 29, 2024

Completely different access strategies and interpretation for all.

I believe some (most?) of those issues you point out are non-issues when used as intended, which is why I want to get to the bottom of all these with you in practical scenarios like we're doing here. Let's try to discuss tomorrow around 10:00 AM.

In Features API, the properties query directly refers to the nested properties field.

The properties= parameter works exactly the same between OGC API - Coverages and OGC API - Features (we explicitly have issues to align them e.g. opengeospatial/ogcapi-features#931). In both cases, the properties are exactly the name of the properties in the logical schema at /schema. The schema, like the properties, refer to the logical schema, which is independent of the particular encoding, or access mechanism.

The STAC access mechanism is a bit of a special case, as we discussed in a previous issue, but how I would understand properties to work in that case, to align with the concept of a data cube available for example as both Coverages and STAC+COG, is that the property selection translates to the selection of an asset or field inside an asset in the case of a single asset containing multiple bands (not to a property of the STAC item metadata).

If my auto-resolution implementation happened to prefer STAC-API, but you provided the properties assuming Coverages or Features would be picked, the workflow would not resolve as expected.

The properties interpretation absolutely needs to yield the same results regardless of whether using STAC or Coverages, otherwise these are not two valid access OGC API access mechanism in the context of this Collection input extension. This is possible with my suggested interpretation.

The process cannot inform about which one is better either, because it does not depend on a collection per se.

It would normally be the processing engine really (the Part 3 / collection input) which would access the collection, not the process (i.e., execution unit of the application package) itself. But potentially the processing engine could use information about that process (like a format that it accepts which is supported by a particular access mechanism) to make a decision about which access mechanism to pick.

Since many APIs can still return 200 OK even if a filter, property, etc. was not "effective", I cannot even rely on the response to know if the operation worked as intended. If a GeoTiff reference was found matching my criteria, how do I know it contains the selected properties?

If using a parameter supported as per the conformance declaration associated with the input collection, a 200 while using that parameter should definitely mean that it was applied successfully. Also, using an unsupported parameter with an OGC API should result in a 4xx error if the parameter/value is not supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants