-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing scroll API docs? #225
Comments
Hello @kylebarron , The scroll API must be a reference to Elasticsearch, where the Scroll API is an alternate way to "page" through large responses. I've not seen this error, but it indicates that the paging mechanism in sat-api is not working as expected. Your AOI is pretty large, I would recommend that you divide your query into smaller queries, such as one year at a time for that AOI...or divide your AOI into smaller AOIs. Note also that the deployed DevSeed sat-api you are using is a little out of date. STAC is now on version 0.9 and there is a new forked and refactored version of sat-api called stac-api, along with a beta version of sat-search. However as of right now there isn't a deployed version of stac-api containing the same public datasets the DevSeed API does. Within the next 2 months there will be one for Sentinel-2 in the new version. Are you interested in Sentinel-2 data, Landsat-8, or both? |
Thanks for your response. I'm guessing that there's a default Elasticsearch option that sets 10,000 as the max scroll and that wasn't modified... If it's not any worse for performance on the backend to retrieve items 29,500-30,000 than it is to retrieve items 0-500, it would be nice to restrict usage by rate limiting rather than a max number of results, so that a user could (slowly) page through as many results as they desired. I think the best workaround is to split it up by year as you mentioned. I'm only interested in Landsat 8, since Sentinel 2 isn't stored on AWS in COG. |
@kylebarron I'm a little rusty on how it works exactly, but I think you can combine the
Let me know if that helps |
@kylebarron you can check how we are dealing with paging in https://github.com/developmentseed/awspds-mosaic/blob/master/awspds_mosaic/landsat/stac.py#L69-L101 |
That's what I'm attempting to do in the original post. Aka if I first find total number of results: import json
import requests
query_str = '{"bbox": [-127.64, 23.92, -64.82, 52.72], "time": "2013-01-01T00:00:00Z/2020-04-01T23:59:59Z", "query": {"eo:sun_elevation": {"gt": 0}, "landsat:tier": {"eq": "T1"}, "collection": {"eq": "landsat-8-l1"}, "eo:cloud_cover": {"gte": 0, "lt": 10}, "eo:platform": {"eq": "landsat-8"}}, "sort": [{"field": "eo:cloud_cover", "direction": "asc"}]}'
query = json.loads(query_str)
url = 'https://sat-api.developmentseed.org/stac/search'
headers = {
"Content-Type": "application/json",
"Accept-Encoding": "gzip",
"Accept": "application/geo+json", }
data = requests.post(url, headers=headers, json={**query, **{'limit': 0}}).json()
data['meta']['found']
# 29773 But then if I try to retrieve a high enough page, it fails: (i.e. page 58 with a limit of 500 should be 29000-29500 (if page numbering starts at 1)) data = requests.post(url, headers=headers, json={**query, **{'limit': 500, 'page': 58}}).json()
data
# {'code': 500,
# 'description': '[illegal_argument_exception] Result window is too large, from + size must be less than or equal to: [10000] but was [29000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.'} Yes I'm using essentially the exact same code, ported to a CLI and Python package that I can run locally instead of on Lambda. Regardless, it never goes past 10,000 results returned from the API. You can test with: git clone https://github.com/kylebarron/landsat-cogeo-mosaic
cd landsat-cogeo-mosaic
pip install -e .
landsat-cogeo-mosaic create \
--bounds '-127.64,23.92,-64.82,52.72' \
--max-cloud 10 \
--stac-collection-limit 500 \
--season summer > mosaic.json It logs
But if you wait a few minutes you'll see it cuts off at page 20. |
@kylebarron Looks like 10000 is a limit within Elasticsearch. And while there is a way around it, it's not recommended. It says to use the scroll API to do paging (would have to be implemented in sat-api), although last I read the scroll API was not recommended for production. I think your best bet is to ensure that your queries don't have so many responses and just to divide up the queries. On the API side though, it should at least throw a more meaningful error if the # of responses > 10K Thanks for bringing this up. |
Good to know. I don't want to suggest a change that makes backend performance worse. I think for my own use I'll first find the total number of results with |
My mistake @kylebarron, I should have read through the full error first 🤦♂. Do you think it would support your use case if we passed through the parameters necessary for the elasticsearch search scroll? @matthewhanson do you know why this isn't recommended for production (maybe performance reasons)? |
I'm not really sure why it's not recommended for general paging, from the documentation: It's quite a bit more complicated since it's not stateless you'd either need to use session tokens or as you suggest @drewbo pass back the parameters which means adding new query parameters to the API for users to hand back info about the scroll API. |
It's fine, I don't intend to ask for a ton of work when the workaround isn't that bad. You can close this if you want, or leave it open if you want to update the API to throw a more meaningful error |
I'm trying to create a seamless cloudless landsat basemap using MosaicJSON. So I'm trying to loop over all cloudless landsat imagery to record it in the MosaicJSON file. When I attempt to do that I get an error saying to use the "scroll api" instead.
I've searched the code, searched the API docs, searched issues, and I can't find any reference to a scroll API. Does it exist?
Separately, I tried to use
sat-search
but it doesn't give the same number of results as the HTTP API for the same query, namely here it gives 3859 results fromsearch.found()
instead of the 29773 results that themeta
key of the HTTP API says should exist.Repro code:
Am I missing something, or why do these identical queries return different numbers of results?
The text was updated successfully, but these errors were encountered: