Cannot retry to download S3 backup data when Agent-NGT data load timeouts #581
Labels
area/agent/core
area/agent/sidecar
area/agent
priority/low
status/help-wanted
status/need-fix
team/core
Core team
team/sre
SRE team
type/suggestion
Suggestion
related to #503, #556
Describe the issue:
currently, vald-agent-ngt pods have these containers:
agent-sidecar on initContainer mode may fail to complete to download backup data and it returns status code 0 (RST stream from remote host will cause this case). in this case, there may be fragments of backup data in the volume and they cause blocking of NGT startup (#503).
the ideal behavior of the pods on the status like this is retrying to download backup data. however, a failing status of a container doesn't trigger pod restarts.
if there's liveness probe server in the pods, it can trigger pod restarts.
however, agent-NGT has a postStop phase (it is executed after liveness probe killed) to save index. agent-sidecar has a postStop phase to upload index.
so, it is required to improve internal/servers/server to handle these problems.
The text was updated successfully, but these errors were encountered: