-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NetCDF Authorization failure #89
Comments
This looks like our intermittent failure since we upgraded Magpie, see this very similar error Ouranosinc/Magpie#433 (comment) I would suggest a work-around for the moment to bypass all the front proxies (Nginx, Twitcher) and hit Thredds directly at http://host:8083/<SAME_PATH_AFTER> (port taken from https://github.com/bird-house/birdhouse-deploy/blob/c5f45c9d0c2f450379c874ad435f466284518819/birdhouse/docker-compose.yml#L256) Quick explanation is since this is intermittent, bigger is your date range, probably you'll access more NetCDF files and more chance one of your access will blow up. The good news is a fix has been found in PR bird-house/birdhouse-deploy#182 and probably be merged next week when Francis from CRIM is back from vacation. |
The PR that should theoretically fix this error has been merged bird-house/birdhouse-deploy#182. Note the new Magpie has a new unique email constraint that is not backward compatible. Run this query bird-house/birdhouse-deploy#182 (comment) to find all your users email and change any duplicated email before the upgrade. Always good to back |
Thank you for the suggestion @tlvu! I have applied the changes made to
We have a password set for |
@nikola-rados I have never seen this error in the many Magpie upgrade in the past. Is your Here are the few things I would try:
|
I've tried to rollback several times but for whatever reason
And the logs coming from
|
After applying the update it looks like there may be a duplicated email that is causing a problem:
|
@nikola-rados before going any further, please take immediately a backup of Then try the command in bird-house/birdhouse-deploy#182 (comment) to find all the duplicate emails and manually update those duplicate email to avoid duplication via direct sql update. Magpie UI do not work anymore at this point you'll have to use direct sql update. Then resume the upgrade again. How many custom access control rules and how many users did you add? If not too much, you might as well recreate these rules and users from scratch. Let's keep this option only as the last resort. Pinging @fmigneault the Magpie developer if there is another way out. |
The easiest method is with manual SQL update of the email as @tlvu suggested (or delete the user if it is not needed anymore). |
FYI @nikola-rados SQL to manual update email direct in DB so you do not have to search yourself: Ouranosinc/Magpie#443 (comment) |
I'll be on vacation in a few hours. Francis (@fmigneault) please continue helping Nik (@nikola-rados) from PCIC for this Magpie upgrade, thanks. |
Ok |
Good news is that the HTTPError: 504 Server Error: Gateway Time-out for url: https://docker-dev03.pcic.uvic.ca/twitcher/ows/proxy/osprey/wps Also, our docker logs for Traceback (most recent call last):
File "/tmp/osprey/processes/wps_full_rvic.py", line 199, in _handler
convolution(convolve_config)
File "/root/.local/lib/python3.8/site-packages/rvic/convolution.py", line 59, in convolution
time_handle, hist_tapes = convolution_run(
File "/root/.local/lib/python3.8/site-packages/rvic/convolution.py", line 337, in convolution_run
runin = data_model.read(timestamp)
File "/root/.local/lib/python3.8/site-packages/rvic/core/read_forcing.py", line 320, in read
temp = self.current_fhdl.variables[fld][self.current_tind]
File "src/netCDF4/_netCDF4.pyx", line 4406, in netCDF4._netCDF4.Variable.__getitem__
File "src/netCDF4/_netCDF4.pyx", line 5350, in netCDF4._netCDF4.Variable._get
File "src/netCDF4/_netCDF4.pyx", line 1927, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: DAP server error
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/.local/lib/python3.8/site-packages/pywps/app/Process.py", line 250, in _run_process
self.handler(wps_request, wps_response) # the user must update the wps_response.
File "/tmp/osprey/processes/wps_full_rvic.py", line 201, in _handler
raise ProcessError(f"{type(e).__name__}: {e}")
pywps.app.exceptions.ProcessError: RuntimeError: NetCDF: DAP server error
2021-08-06 18:55:01 ERROR: osprey: Process error: method=wps_full_rvic.py._handler, line=201, msg=RuntimeError: NetCDF: DAP server error |
I did another run of 2021-08-18 18:13:01 DEBUG: osprey: Initializing database connection
2021-08-18 18:13:01,353 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-08-18 18:13:01,354 INFO sqlalchemy.engine.Engine SELECT count(*) AS count_1
FROM (SELECT pywps_requests.uuid AS pywps_requests_uuid, pywps_requests.pid AS pywps_requests_pid, pywps_requests.operation AS pywps_requests_operation, pywps_requests.version AS pywps_requests_version, pywps_requests.time_start AS pywps_requests_time_start, pywps_requests.time_end AS pywps_requests_time_end, pywps_requests.identifier AS pywps_requests_identifier, pywps_requests.message AS pywps_requests_message, pywps_requests.percent_done AS pywps_requests_percent_done, pywps_requests.status AS pywps_requests_status
FROM pywps_requests
WHERE pywps_requests.uuid = ?) AS anon_1
2021-08-18 18:13:01,354 INFO sqlalchemy.engine.Engine [cached since 8.944e+04s ago] ('fb9988c8-004e-11ec-b7fc-0242ac12000c',)
2021-08-18 18:13:01,355 INFO sqlalchemy.engine.Engine SELECT pywps_requests.uuid AS pywps_requests_uuid, pywps_requests.pid AS pywps_requests_pid, pywps_requests.operation AS pywps_requests_operation, pywps_requests.version AS pywps_requests_version, pywps_requests.time_start AS pywps_requests_time_start, pywps_requests.time_end AS pywps_requests_time_end, pywps_requests.identifier AS pywps_requests_identifier, pywps_requests.message AS pywps_requests_message, pywps_requests.percent_done AS pywps_requests_percent_done, pywps_requests.status AS pywps_requests_status
FROM pywps_requests
WHERE pywps_requests.uuid = ?
2021-08-18 18:13:01,355 INFO sqlalchemy.engine.Engine [cached since 8.944e+04s ago] ('fb9988c8-004e-11ec-b7fc-0242ac12000c',)
2021-08-18 18:13:01,357 INFO sqlalchemy.engine.Engine UPDATE pywps_requests SET time_end=?, message=?, percent_done=?, status=? WHERE pywps_requests.uuid = ?
2021-08-18 18:13:01,357 INFO sqlalchemy.engine.Engine [cached since 8.944e+04s ago] ('2021-08-18 18:13:01.356406', 'Process error: RuntimeError: NetCDF: DAP server error', 100.0, 5, 'fb9988c8-004e-11ec-b7fc-0242ac12000c')
2021-08-18 18:13:01,357 INFO sqlalchemy.engine.Engine COMMIT
2021-08-18 18:13:01 DEBUG: osprey: _update_status: status=5, clean=True
2021-08-18 18:13:01 DEBUG: osprey: clean workdir: status=5
2021-08-18 18:13:01 INFO: osprey: Removing temporary working directory: /tmp/pywps_process_ftlcy2n_
2021-08-18 18:13:01 DEBUG: osprey: Checking for stored requests
2021-08-18 18:13:01 DEBUG: osprey: Initializing database connection
2021-08-18 18:13:01,362 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2021-08-18 18:13:01,364 INFO sqlalchemy.engine.Engine SELECT pywps_stored_requests.uuid AS pywps_stored_requests_uuid, pywps_stored_requests.request AS pywps_stored_requests_request
FROM pywps_stored_requests
LIMIT ? OFFSET ?
2021-08-18 18:13:01,364 INFO sqlalchemy.engine.Engine [generated in 0.00019s] (1, 0)
2021-08-18 18:13:01 DEBUG: osprey: No stored request found
2021-08-18 18:13:01 INFO: osprey: Request: getcapabilities
2021-08-18 18:09:01,242 INFO [TWITCHER:158][waitress-1] 'None' request 'read' permission on '/ows/proxy/thredds/dodsC/datasets/storage/data/projects/comp_support/climate_explorer_data_prep/hydro/sample_data/set4/columbia_vicset2.nc.dods'
2021-08-18 18:09:01,255 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>'
2021-08-18 18:09:01,274 INFO [TWITCHER:60][waitress-1] Using adapter: '<class 'magpie.adapter.MagpieAdapter'>' Sometimes, the following shows up: 2021-08-18 18:09:01,348 WARNI [waitress.queue:117][MainThread] Task queue depth is 1 This also appears after the server error occurs: 2021-08-18 18:13:01,439 INFO [waitress:353][waitress-0] Client disconnected while serving /ows/proxy/thredds/dodsC/datasets/storage/data/projects/comp_support/climate_explorer_data_prep/hydro/sample_data/set4/columbia_vicset2.nc.dods The following also shows up occasionally, but it shows something similar for our other web services, so it's probably normal: 2021-08-18 18:25:01,827 INFO [TWITCHER:158][waitress-0] 'None' request 'getcapabilities' permission on '/ow @tlvu I was wondering if any of these logging messages could indicate why we're having a |
If the server can handle more workers, the number could be increased if multiple download / WPS requests are often executed in parallel by multiple users. If not, I would investigate further why I remember getting many caching slowdown with
This one only indicates that the "anonymous"/public user is used for the corresponding request. |
Sorry for the late reply, I was on vacation.
That 4 mins timeout seems to be related to this config https://github.com/bird-house/birdhouse-deploy/blob/4e9a94dbcbf1ddd8bb275317206c3376fd7897d3/birdhouse/default.env#L77-L79. You can bump that value to a more appropriate value. The side-effect of a too long timeout is if the server is indeed crashing or dead, you won't know about it sooner. Yours to decide on your appropriate value. The note on the config also mention about using async mode. Note there is currently a bug with queue handling in PyWPS that you can not mix sync and async call within the same bird, see bird-house/finch#121. We got around about this problem by having 2 same bird, one dedicated for sync and another one for async (see bird-house/finch#98 (comment)). General debugging note: the first step to debugging should be to remove the front Nginx and Twitcher proxy and directly hit the WPS and Thredds. This can be achieved by directly hitting the port of the service instead of going through port 443 (httpS). For example Thredds port is 8083 by looking at the docker-compose.yml file (https://github.com/bird-house/birdhouse-deploy/blob/4e9a94dbcbf1ddd8bb275317206c3376fd7897d3/birdhouse/docker-compose.yml#L256). Note it's also a security risk to expose direct access to all the services so your firewall should only expose 443 to the internet. All those direct service port should only be allowed inside your network. Hope it helps. |
This issue has been fully resolved by increasing the |
When running the
wps_full_rvic_demo.ipynb
notebook withlocalhost
as the target url, having a long time range (longer than 2-3 years) causes the following issue at some point during the convolution part:owslib.wps.WPSException : {'code': 'NoApplicableCode', 'locator': 'None', 'text': 'Process error: RuntimeError: NetCDF: Authorization failure'}
.Steps to reproduce the behavior:
osprey start
.jupyter lab
.wps_full_rvic_demo.ipynb
notebook, go to cell 2 and changeget_target_url("osprey")
to"http://localhost:5000/wps"
run_startdate
to2010-01-01-00
(or a date before that)Ideally, the process should be able to run for a long time range with no errors. Since this error does not occur when running
RVIC
normally, it appears to be an issue withinosprey
.The text was updated successfully, but these errors were encountered: