Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Average over shape process #98

Closed
huard opened this issue Mar 9, 2020 · 44 comments · Fixed by #152
Closed

Average over shape process #98

huard opened this issue Mar 9, 2020 · 44 comments · Fixed by #152
Assignees
Labels

Comments

@huard
Copy link
Collaborator

huard commented Mar 9, 2020

Description

Given one or multiple polygons and a netCDF file, compute the spatial average (area-weighted) over each region and store along a new "geometry" dimension.

Might require to pass a file storing the cell area for accurate computations.

@huard huard added the Raven label Apr 14, 2020
@richardarsenault
Copy link

@huard , this is in finch, but we now have a basic averaging method using rioxarray to do simple averaging at the catchment scale. Is this something that has advanced and/or reasonable to think that we can get it rolled-out in the next 3 months?

@huard
Copy link
Collaborator Author

huard commented Dec 9, 2020

Yes, it's part of my master plan. Will probably use new functionality just released in xESMF however. Once the weights are computed, this is really fast. See https://pangeo-xesmf.readthedocs.io/en/latest/notebooks/Spatial_Averaging.html

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 10, 2021

@huard is there a nice "real life use case" notebook that demonstrates this capability I could use for a demo?

@huard
Copy link
Collaborator Author

huard commented Mar 10, 2021

@rsignell-usgs
Copy link

@huard , Is there a python example that would make the polygon request and retrieve the output using this new Finch WPS service?

Or am I not understanding the capabilities?

@huard
Copy link
Collaborator Author

huard commented Mar 26, 2021

@aulemahal I don't think that apart from the unit test, we've gotten there yet, correct ?

@rsignell-usgs Yes, this is correct. I've created an issue to do this. #165 We probably won't be able to get this done until April though. If you want to try your hand,

  1. install the birdy client: https://github.com/bird-house/birdy
  2. connect to the finch server and call the average_polygon method:
from birdy import WPSClient
url = 'https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps'
wps = WPSClient(url)
wps.average_polygon?

@aulemahal
Copy link
Collaborator

True, there is no example of the use of this process.

@aulemahal
Copy link
Collaborator

Oh haha, not a real example, but I had an issue with subset_polygon here : #153 (comment)
If I coded this correctly, change subset to average and you have a minimal example.

@rsignell-usgs
Copy link

@huard and @aulemahal, I gave it a try, and was able to submit and process a non-polygon request, but failing with polygon request. I'm sure it's user error:
https://nbviewer.jupyter.org/gist/rsignell-usgs/10059fde7b80d8c29f962505ab024b10

@huard
Copy link
Collaborator Author

huard commented Mar 28, 2021

Try with an absolute file name for the geojson polygon. Otherwise, the client will assume you're passing the geojson object as a string representation. The client could probably be made smarter eventually...

@rsignell-usgs
Copy link

@huard , I tried
https://nbviewer.jupyter.org/gist/rsignell-usgs/fa40eea834a48963b5fe2a02feda824b
using the full path name (cell [9]) , and also passing the JSON as string (cell [13]), but couldn't get either to work.

I'm not sure what else to try, help appreciated!

@huard
Copy link
Collaborator Author

huard commented Mar 29, 2021

The error message suggests that the connection was dropped, so I'll offer a guess.

PyWPS supports two access mechanisms: sync and async. By default, the birdy client uses sync. sync processes are time-limited, if the server doesn't respond fast enough, the request is just dropped. My guess here is that the tolerance and the size of the polygon make the request fairly compute intensive. You could thus try either to:

  1. limit the computation time by increasing the polygon simplification tolerance (e.g. to .1). On my laptop, this reduces the compute time from 4 minutes to a few seconds.
  2. launch the process asynchronously, (instantiate the client with wps = WPSClient(pavics_url, progress=True))

If this is indeed the problem, a more informative error message would definitely help.

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 29, 2021

I tried increasing the polygon tolerance to 0.1 but it still failed.
I think I am already running async because I successfully used this notebook to access the dry_days service, which produces a netcdf file remotely, which I then could access.

But could the problem be that the WPS service was having a temporary problem?

Today I tried running the same dry_days query that worked a few days ago, and it starts, but doesn't complete:
2021-03-29_10-40-54

And I just tried again and got a 500.

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 29, 2021

The above was with:

from birdy import WPSClient

pavics_url = 'https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps'
wps = WPSClient(pavics_url, progress=True)
dap_url = 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/flyingpigeon/cmip3/pr.sresa2.miub_echo_g.run1.atm.da.nc'

resp = wps.dry_days(dap_url, thresh='0.1 mm/d', variable='pr')

@huard
Copy link
Collaborator Author

huard commented Mar 29, 2021

Thanks for your patience, will report the error to our sysadmin.

@rsignell-usgs
Copy link

I tried a few minutes ago, and didn't get the 500.
But after failing the above, I'm now getting a 500 error again.

@rsignell-usgs
Copy link

@huard , are you able to run the above 5 lines of code?

@Zeitsperre
Copy link
Collaborator

I just tried it myself, Code 500 Server error for Finch is what I'm seeing.

@aulemahal
Copy link
Collaborator

I can confirm, Error 500 when using pavics, but passes with @tlvu's VM.
I'm on it, trying to figure out the issue, but it's hard since it doesn't happen in the VM and the logs don't give you information on what request triggers the errors.

@Zeitsperre
Copy link
Collaborator

@aulemahal is the version of birdy different on the test VM? I have 0.7.0 for PAVICS.

@aulemahal
Copy link
Collaborator

I'm testing on pavics, but calling finch on the VM, so birdy is the same. And finch is at 0.7.1 on both sides...

@tlvu
Copy link
Collaborator

tlvu commented Mar 29, 2021

@Zeitsperre @aulemahal can you guys use the official Jupyter environment on PAVICS to reproduce the error? If we do not have a reproducible environment, it's hard to reproduce the problem.

@aulemahal
Copy link
Collaborator

@rsignell-usgs It seems that the error was on the server side, the finch server was getting too crowded because of the use of progress=True.

The main WPS server should never be used with progress=True.
But we also have an "async" finch here : https://pavics.ouranos.ca/twitcher/ows/proxy/finchasync/wps, this one can safely be used with that option.

I did not know this before! This should fix the dry_days problem. I'm going back to the average shape one...

@tlvu
Copy link
Collaborator

tlvu commented Mar 29, 2021

@huard , I tried
https://nbviewer.jupyter.org/gist/rsignell-usgs/fa40eea834a48963b5fe2a02feda824b
using the full path name (cell [9]) , and also passing the JSON as string (cell [13]), but couldn't get either to work.

I'm not sure what else to try, help appreciated!

@rsignell-usgs

Could you provide /home/jovyan/shared/users/rsignell/notebooks/WPS/output.geojson used in your notebook https://nbviewer.jupyter.org/gist/rsignell-usgs/fa40eea834a48963b5fe2a02feda824b? Trying to reproduce the error in your notebook.

However, as @aulemahal rightfully point out, avoid using progress=True with the main WPS server.

For progress=True please use https://pavics.ouranos.ca/twitcher/ows/proxy/finchasync/wps instead. The root cause is this bug #121 which really is on PyWPS library side.

@tlvu
Copy link
Collaborator

tlvu commented Mar 29, 2021

Could you provide /home/jovyan/shared/users/rsignell/notebooks/WPS/output.geojson used in your notebook https://nbviewer.jupyter.org/gist/rsignell-usgs/fa40eea834a48963b5fe2a02feda824b? Trying to reproduce the error in your notebook.

Sorry, read the notebook too fast. Didn't notice output.geojson was created in cell 7.

@rsignell-usgs
Copy link

@tlvu, were you able to reproduce?

@rsignell-usgs
Copy link

I tried with the async version of the server. The dry_days process worked, but I couldn't get far with average_polygon.

Here's my current notebook:
https://nbviewer.jupyter.org/gist/rsignell-usgs/109c47f1a792a658e9164c1240b6c49a

@tlvu
Copy link
Collaborator

tlvu commented Mar 30, 2021

@tlvu, were you able to reproduce?

@rsignell-usgs

Yes, your notebook against my dev server https://gist.github.com/tlvu/f6de3098cad70709fbed990f4e7d61d4#file-wps_polygon-devserver-ipynb (so we do not have to test and debug on the production server).

dry_days worked as well. Same error for the rest.

I gave ssh access to my dev server to @aulemahal I think he's investigating.

@rsignell-usgs
Copy link

@tlvu and @aulemahal , I'm so glad you could reproduce, not just something dumb I was doing! Good luck!

@aulemahal
Copy link
Collaborator

@rsignell-usgs I am indeed investiguating the issue. It happens on both prod and "dev" servers, but not locally, so it may have to do with the interaction between all the web server layers. I think there is also an error on the logging side that makes it even harder to debug. I'm trying to fix that first.

@aulemahal
Copy link
Collaborator

aulemahal commented Mar 31, 2021

@rsignell-usgs @huard Ok, all this work for the conclusion we kinda already knew:
I was able to successfully average using a local shape!

The first reason the tasks where failing is indeed the size of the input. Our finch seems to accept only inputs below 3 Mo. output.geojson as my example creates it is 13 Mo. I simplified the polygon upstream to tolerances of 0.001 (yielding 2.2 Mo, and 0.01 : 800 Ko). The error message is not clear at all "BrokenPipe", but the logs do say File size for input exceeded. Maximum request size allowed: 3.0 megabytes.

The second reason is that the non-async finch will timeout because xESMF is slow, especially with large polygons.
The 800 Ko polygon worked successfully on both finches and I seem to be able to use the async version for the 2.2 Mo, but it is very slow (on the dev server). xESMF's polygon parsing is single threaded, so it doesn't scale easily. Again, this issue was combined with #168, timeout processes pollute the database and block further processes.

Also, the progress reporting is quite poor as we delegated everything to clisops. Even with implementing the averaging function ourselves we would only be able to report once more : when the weights are created, before computing the average itself.

I'll make sure to explain and show the simplification process in my upcoming notebook.

EDIT: The process with the 2.2 Mo polygon never finished. I saw in the log that the SpatialAverager was created and even that it had at least begun the computation, but the task is still "on going", but finch is idle.

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 31, 2021

@aulemahal , thanks for that detective work!
In addition those issues, there is still some network/security issues/layers that are preventing me from doing any successful query, right?

Or do you believe you have a code block I should be able to execute successfully?

@aulemahal
Copy link
Collaborator

I think not anymore! Both finches (async and normal) on pavics seem to be running correctly. I just tried:

pavics_url = 'https://pavics.ouranos.ca/twitcher/ows/proxy/finch/wps'
wps = WPSClient(pavics_url)
dap_url = 'https://pavics.ouranos.ca/twitcher/ows/proxy/thredds/dodsC/birdhouse/testdata/flyingpigeon/cmip3/pr.sresa2.miub_echo_g.run1.atm.da.nc'
resp = wps.dry_days(dap_url, thresh='0.1 mm/day', variable='pr')

and it works.

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 31, 2021

@aulemahal, do you have an average_polygon example that works? That's the one I was struggling with.

@aulemahal
Copy link
Collaborator

Oh! Give me 30 minutes.

@rsignell-usgs
Copy link

rsignell-usgs commented Mar 31, 2021

Hoping we can make something like cell [13] work in this polygon notebook

@aulemahal
Copy link
Collaborator

Here we go:
https://gist.github.com/aulemahal/88343751b99ddc2f0d5a0f07d8f2fdf4

The main trick is to simplify the polygons before they are sent to finch. Here I am using geopandas, reading the WFS-retrieved polygons directly into it and simplifying before writing to file. Seems that geojsons under 1 Mo will not cause time outs and work with the normal finch instance.
Larger polygons (but under 3 Mo) might work with the async version, but for now my only test hung indefinitely...

@huard Sadly, I found yet another bug. The current finch behavior is to chunk datasets when they have over 1 000 000 points. However, it chunks all dimensions while the averaging process doesn't support chunking along the spatial dimensions... The testdata we are using here will work, but larger datasets won't.

@huard
Copy link
Collaborator Author

huard commented Mar 31, 2021

Ok, please create issues and tag it with the DACCS label. We'll fix it in the next release. I guess this requires a release of xESMF, clisops and finch ?

@aulemahal
Copy link
Collaborator

Hum, I don't think it's possible to perform the average with chunks along spatial dimensions, so I don't think we could change xESMF in any meaningful way. Same for clisops, except that clisops has an easier access to the spatial dimensions names. But rechunking on-the-fly without user input isn't very elegant. I really think this is a finch problem.

@huard
Copy link
Collaborator Author

huard commented Mar 31, 2021

ok, easier then !

@tlvu
Copy link
Collaborator

tlvu commented Mar 31, 2021

Our finch seems to accept only inputs below 3 Mo.

Found out it's the default from PyWPS https://github.com/geopython/pywps/blob/20e1e254a3f7914e555fa89f363d1f6eb5f3895c/pywps/configuration.py#L74

We can override in our Finch, what would be a reasonable value? Was the default at 3Mb for a reason?

I just set that limit to 20Mb on my dev server if someone wants to try.

tlvu added a commit to Ouranosinc/PAVICS-e2e-workflow-tests that referenced this issue Apr 8, 2021
tlvu added a commit to bird-house/birdhouse-deploy that referenced this issue Apr 16, 2021
…nable in prod

Value 100mb is reasonable according to @huard.

Fixes Ouranosinc/raven#361.

Also fixes issue describe here for Finch bird-house/finch#98 (comment).

Fix this kind of error found by Jenkins:
```
  _ raven-master/docs/source/notebooks/Multiple_watersheds_simulation.ipynb::Cell 1 _
  Notebook cell execution failed
  Cell 1: Cell execution caused an exception

  Input:
  # The model parameters for gr4jcn for both catchments. Can either be a string of comma separated values, a list, an array or a named tuple.
  gr4jcn1 = "0.529, -3.396, 407.29, 1.072, 16.9, 0.947"
  gr4jcn2 = "0.28, -3.6, 380.9, 1.092, 14.6, 0.831"

  params = [gr4jcn1, gr4jcn2]

  # Forcing files. Raven uses the same forcing files for all and extracts the information it requires for each model.
  ts = get_file("input2d/input2d.nc")

  # Model configuration parameters. In a real case, we'd set nc_index to two different values for two different watersheds.
  config = dict(
      start_date=[dt.datetime(2000, 1, 1), dt.datetime(2000, 1, 1)],
      end_date=[dt.datetime(2002, 1, 1), dt.datetime(2002, 1, 1)],
      area=[4250.6, 5000],
      elevation=[843.0, 780],
      latitude=[54.4848, 48.0],
      longitude=[-123.3659, -122.99],
      nc_index=[0, 0],
  )

  # Launch the WPS to get the multi-model results.  Note the "gr4jcn" and "hmets" keys.
  resp = wps.raven_gr4j_cemaneige(ts=str(ts), params=params, **config)

  # And get the response
  # With `asobj` set to False, only the reference to the output is returned in the response.
  # Setting `asobj` to True will retrieve the actual files and copy the locally.
  [hydrograph, storage, solution, diagnostics, rv] = resp.get(asobj=True)

  Traceback:

  ---------------------------------------------------------------------------
  ServiceException                          Traceback (most recent call last)
  <ipython-input-2-37168570fca3> in <module>
       20
       21 # Launch the WPS to get the multi-model results.  Note the "gr4jcn" and "hmets" keys.
  ---> 22 resp = wps.raven_gr4j_cemaneige(ts=str(ts), params=params, **config)
       23
       24 # And get the response

  </opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py-240> in raven_gr4j_cemaneige(self, ts, nc_spec, params, start_date, end_date, nc_index, duration, run_name, name, hrus, area, latitude, longitude, elevation, evaporation, rain_snow_fraction, rvc, output_formats)

  /opt/conda/envs/birdy/lib/python3.7/site-packages/birdy/client/base.py in _execute(self, pid, **kwargs)
      345         try:
      346             wps_response = self._wps.execute(
  --> 347                 pid, inputs=wps_inputs, output=wps_outputs, mode=mode
      348             )
      349

  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in execute(self, identifier, inputs, output, mode, lineage, request, response)
      357         # submit the request to the live server
      358         if response is None:
  --> 359             response = execution.submitRequest(request)
      360         else:
      361             response = etree.fromstring(response)

  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in submitRequest(self, request)
      910         reader = WPSExecuteReader(verbose=self.verbose, timeout=self.timeout, auth=self.auth)
      911         response = reader.readFromUrl(
  --> 912             self.url, request, method='Post', headers=self.headers)
      913         self.response = response
      914         return response

  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in readFromUrl(self, url, data, method, username, password, headers, verify, cert)
      601
      602         return self._readFromUrl(url, data, self.timeout, method, username=username, password=password,
  --> 603                                  headers=headers, verify=verify, cert=cert)
      604
      605

  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/wps.py in _readFromUrl(self, url, data, timeout, method, username, password, headers, verify, cert)
      513             u = openURL(url, data, method='Post',
      514                         username=self.auth.username, password=self.auth.password,
  --> 515                         headers=headers, verify=self.auth.verify, cert=self.auth.cert, timeout=timeout)
      516             return etree.fromstring(u.read())
      517

  /opt/conda/envs/birdy/lib/python3.7/site-packages/owslib/util.py in openURL(url_base, data, method, cookies, username, password, timeout, headers, verify, cert, auth)
      209
      210     if req.status_code in [400, 401]:
  --> 211         raise ServiceException(req.text)
      212
      213     if req.status_code in [404, 500, 502, 503, 504]:    # add more if needed

  ServiceException: <?xml version="1.0" encoding="utf-8"?>
  <ExceptionReport version="1.0.0"
      xmlns="http://www.opengis.net/ows/1.1"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.opengis.net/ows/1.1 http://schemas.opengis.net/ows/1.1.0/owsExceptionReport.xsd">
      <Exception exceptionCode="NoApplicableCode" locator="NotAcceptable">
          <ExceptionText>Request failed: (&#x27;Connection aborted.&#x27;, BrokenPipeError(32, &#x27;Broken pipe&#x27;))</ExceptionText>
      </Exception>
  </ExceptionReport>
```
tlvu added a commit to bird-house/birdhouse-deploy that referenced this issue Apr 16, 2021
…for-raven-demo

Update Raven and Jupyter env for Raven demo

Raven release notes PR Ouranosinc/raven#374 + Ouranosinc/raven#382

Jupyter env update PR Ouranosinc/PAVICS-e2e-workflow-tests#71

Other fixes:
* Fix intermittent Jupyter spawning error by doubling various timeouts config (it's intermittent so hard to test so we are not sure which ones of timeout fixed it)
* Fix Finch and Raven "Broken pipe" error when the request size is larger than default 3mb (bumped to 100mb) (fixes Ouranosinc/raven#361 and Finch related comment bird-house/finch#98 (comment))
* Lower chance to have "Max connection" error for Finch and Raven (bump parallelprocesses from 2 to 10). In prod, the server has the CPU needed to run 10 concurrent requests if needed so this prevent users having to "wait" after each other.
@rsignell-usgs
Copy link

@aulemahal , is there an updated notebook that demonstrates the average over shape process?

@aulemahal
Copy link
Collaborator

Yes! Here: https://pavics-sdi.readthedocs.io/projects/finch/en/latest/notebooks/subset.html#Averaging-over-polygons
There still some limitations, the first is non-negligeable:
1 - No missing value handling (1 NaN gridcell means the average is NaN), but this is currently being fixed. I suspect it will be integrated in finch this summer.
2 - Often we need to simplify the polygons on the client side. This is in part due to a bug in OWSlib, I have no ETA for a fix...

@rsignell-usgs
Copy link

Great! And thanks for the reminders about the limitations --those are important to know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants