Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: node won't start with RLN in on-chain dynamic mode #2662

Closed
romanzac opened this issue May 3, 2024 · 17 comments
Closed

bug: node won't start with RLN in on-chain dynamic mode #2662

romanzac opened this issue May 3, 2024 · 17 comments
Assignees
Labels
bug Something isn't working effort/days Estimated to be completed in a few days, less than a week

Comments

@romanzac
Copy link
Contributor

romanzac commented May 3, 2024

Problem

During "test_publish_with_valid_payloads_dynamic_at_slow_rate" execution, first container running Node1 won't start even within the timeout of 10 minutes. Static mode is working fine. Please have a look.

I've built docker image from release: wakunode2-v0.27.0

Impact

High occurrence, medium severity. RLN in dynamic mode not functional.

To reproduce

  1. Please checkout waku-org/waku-interop-tests@73d0951
  2. cd waku-interop-tests
  3. python -m venv .venv
  4. source .venv/bin/activate
  5. pip install -r requirements.txt
  6. pre-commit install
  7. pytest tests/relay/test_rln.py -k 'test_publish_with_valid_payloads_dynamic_at_slow_rate'

Expected behavior

RLN Relay in on-chain mode working.

Screenshots/logs

node1_2024-05-03_12-45-38__5a5531e5-c304-462b-a7f3-f58ba92d0a0b__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-05-03_12-45-38__c9737de4-6df4-4f4c-b308-323968164308__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-05-03_12-46-01__96df9f86-46da-4444-8b3a-7b6d0e7e060a__harbor.status.im_wakuorg_nwaku:latest.log
node2_2024-05-03_12-45-38__c9737de4-6df4-4f4c-b308-323968164308__harbor.status.im_wakuorg_nwaku:latest.log
test_run.log

@romanzac romanzac added the bug Something isn't working label May 3, 2024
@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

the second log you shared seems to show the node has finished syncing - does the /health endpoint still not return true?

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

the fourth log shows the node has not finished syncing

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

on my own node -
image
seems to be the health endpoint returning not ready and healthy in a flaky manner. cc: @NagyZoltanPeter

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

I see in your logs that we use debug/v1/info in the test - which results in

Response status code: 200. Response content: b'{"listenAddresses":["/ip4/172.18.205.233/tcp/6886/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip","/ip4/172.18.205.233/tcp/6887/ws/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip"],"enrUri":"enr:-LO4QGGlww8liwBmcHFHdLcXwt-Uq0c6iU6cdDJ6pWlh2avnWILMdWa9P_iCS0kiWhLuECjRTMvxoykPXyP5sKjcx88BgmlkgnY0gmlwhKwSzemKbXVsdGlhZGRyc4wACgSsEs3pBhrn3QOCcnOFAAABAACJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCGuaDdWRwghrohXdha3UyAQ"}'
INFO     src.node.waku_node:waku_node.py:193 REST service is ready !!

does node1 not respond after this is available?

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

testing with #2664

@NagyZoltanPeter
Copy link
Contributor

on my own node - image seems to be the health endpoint returning not ready and healthy in a flaky manner. cc: @NagyZoltanPeter

It's a known issue, already solved here: #2612
Will be part of next release!

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

suggest to use this image - quay.io/wakuorg/nwaku-pr:2664-rln-v1

@NagyZoltanPeter
Copy link
Contributor

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

Indeed! It separates node ready from initialization in general from status of rln_relay. There is a - future improved - array of protocol statuses, in there you can see rln_relay status now (and yet there is changing from ready to synchronize and back as of actual status).
Likewise:

{
    "nodeHealth": "Ready",
    "protocolsHealth": [
     {
         "Rln Relay": "Ready"
      }
    ]
}

@rymnc
Copy link
Contributor

rymnc commented May 3, 2024

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

Indeed! It separates node ready from initialization in general from status of rln_relay. There is a - future improved - array of protocol statuses, in there you can see rln_relay status now (and yet there is changing from ready to synchronize and back as of actual status).
Likewise:

{
    "nodeHealth": "Ready",
    "protocolsHealth": [
     {
         "Rln Relay": "Ready"
      }
    ]
}

wow, nice 🔥

@romanzac
Copy link
Contributor Author

romanzac commented May 3, 2024

Good progress on health info indeed! I'll wait for the next release to enable onchain tests. And I will try to play with /health endpoint on Monday. Thanks for now.

@NagyZoltanPeter
Copy link
Contributor

@rymnc: Hi, about the original issue. From the logs I don't see the actual problem. The first log tells me that node is started well. It took almost 7 min to sync on-chain RLN blocks. The second node (or log) ends with still in sync, but only 3 min from startup.
Maybe the start sequence needs to be checked as it seems to me that nodes started with delay while the timeout applies from the first node start, from the 10 min timeout I think.
Can you please check this scenario. Maybe you need to extend the timeout to give enough time for the second node to get in sync.

@romanzac
Copy link
Contributor Author

romanzac commented May 8, 2024

@gabrielmer gabrielmer added the effort/days Estimated to be completed in a few days, less than a week label Jul 24, 2024
@romanzac
Copy link
Contributor Author

Hi @rymnc , I've retested on-chain mode and node startup with RLNv2:

I've tried two scenarios:

  1. No RLN credential files present when the test starts - Node with RLN starts at average of 52 sec. which is solid improvement.

Logs:
test.log
node1_2024-07-29_17-52-46__4707c900-c283-4ee8-ba48-b42c351efbb4__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-07-29_17-52-46__d4c701df-cfb7-4281-9c67-0e297b744520__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-07-29_17-53-01__2fb144ad-1a02-4348-a6c2-0a11f3e76788__harbor.status.im_wakuorg_nwaku:latest.log
node2_2024-07-29_17-52-46__4707c900-c283-4ee8-ba48-b42c351efbb4__harbor.status.im_wakuorg_nwaku:latest.log

  1. Existing RLN credential files present, new credentials data are appended to the end of keystore.json - Node cannot start.
ERR 2024-07-29 09:57:17.614+00:00 Mounting protocols failed                  tid=1 file=node_factory.nim:456 error="failed to mount waku RLN relay protocol: failed to mount WakuRlnRelay: could not initialize the group manager: failed to get the keystore credentials: Error while searching credentials in keystore: credential not found: Credential not found in keystore"
ERR 2024-07-29 09:57:17.614+00:00 Failed setting up node                     topics="wakunode main" tid=1 file=waku.nim:144 error="Mounting protocols failed: failed to mount waku RLN relay protocol: failed to mount WakuRlnRelay: could not initialize the group manager: failed to get the keystore credentials: Error while searching credentials in keystore: credential not found: Credential not found in keystore"
ERR 2024-07-29 09:57:17.614+00:00 Waku initialization failed                 topics="wakunode main" tid=1 file=wakunode2.nim:68 error="Failed setting up node: Mounting protocols failed: failed to mount waku RLN relay protocol: failed to mount WakuRlnRelay: could not initialize the group manager: failed to get the keystore credentials: Error while searching credentials in keystore: credential not found: Credential not found in keystore"

Logs:
test.log
node1_2024-07-29_17-56-26__8b250a23-d437-4b17-8a74-ca245ae62527__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-07-29_17-56-26__c8445c39-15b7-40e7-9412-a670da0b4ed6__harbor.status.im_wakuorg_nwaku:latest.log
node1_2024-07-29_17-56-27__4e5e36bb-d051-4bb6-bbe6-26a69d306ce5__harbor.status.im_wakuorg_nwaku:latest.log
node2_2024-07-29_17-56-26__8b250a23-d437-4b17-8a74-ca245ae62527__harbor.status.im_wakuorg_nwaku:latest.log

Any hints ?

@rymnc
Copy link
Contributor

rymnc commented Jul 29, 2024

hmm, thanks for re-running @romanzac!

may i quickly ask if scenario 1 is working?

as for scenario 2, can you send the keystore with which you have this problem please?

@romanzac
Copy link
Contributor Author

hmm, thanks for re-running @romanzac!

may i quickly ask if scenario 1 is working?

as for scenario 2, can you send the keystore with which you have this problem please?

Yes, scenario 1 works. Keystore sent over .

@romanzac
Copy link
Contributor Author

romanzac commented Jul 29, 2024

It looks scenario 2. is somewhat unusual for standard operation. So I will change my tests to either reuse the existing registration or create unique keystores to avoid parsing log for assigned index value e.g. membershipIndex=4180.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working effort/days Estimated to be completed in a few days, less than a week
Projects
Archived in project
Development

No branches or pull requests

4 participants