Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

rinx · 2020-07-15T05:35:02Z

related to #503, #556

Describe the issue:

currently, vald-agent-ngt pods have these containers:

initContainers
- agent-sidecar (initcontainer mode: download S3 backup data to volume)
containers
- agent-ngt
- agent-sidecar (sidecar mode: upload S3 backup data)

agent-sidecar on initContainer mode may fail to complete to download backup data and it returns status code 0 (RST stream from remote host will cause this case). in this case, there may be fragments of backup data in the volume and they cause blocking of NGT startup (#503).
the ideal behavior of the pods on the status like this is retrying to download backup data. however, a failing status of a container doesn't trigger pod restarts.

if there's liveness probe server in the pods, it can trigger pod restarts.
however, agent-NGT has a postStop phase (it is executed after liveness probe killed) to save index. agent-sidecar has a postStop phase to upload index.
so, it is required to improve internal/servers/server to handle these problems.

issue-label-bot · 2020-07-15T05:35:04Z

Issue-Label Bot is automatically applying the label type/bug to this issue, with a confidence of 0.88. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

rinx added type/bug Something isn't working team/core Core team priority/medium labels Jul 15, 2020

rinx mentioned this issue Jul 15, 2020

[patch] [agent-NGT, sidecar] Improve S3 backup/recover behavior #556

Merged

18 tasks

kpango added team/sre SRE team priority/low status/need-fix status/help-wanted area/agent type/suggestion Suggestion area/agent/core area/agent/sidecar and removed type/bug Something isn't working priority/medium labels Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

rinx commented Jul 15, 2020

issue-label-bot bot commented Jul 15, 2020

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581

Comments

rinx commented Jul 15, 2020

Describe the issue:

issue-label-bot bot commented Jul 15, 2020