-
Notifications
You must be signed in to change notification settings - Fork 73
Open
Description
nomad podman driver v0.6.1
When a container take a longer time to stop sometimes allocation exit code is 0 and other times is 137.
[root@nomadtesting test-image]# nomad job status redis
ID = redis
Name = redis
Submit Date = 2024-11-29T12:56:38+02:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Node Pool = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost Unknown
redis 0 0 1 4 0 0 0
Allocations
ID Node ID Task Group Version Desired Status Created Modified
68808dc9 253a8f99 redis 0 run running 17m13s ago 16m57s ago
b897b38f 253a8f99 redis 0 stop failed 23m1s ago 17m13s ago
cf225248 253a8f99 redis 0 stop failed 32m42s ago 23m1s ago
e20f42db 253a8f99 redis 0 stop failed 1h15m ago 32m42s ago
[root@nomadtesting test-image]# nomad alloc status e2
ID = e20f42db-31cd-f245-c548-6f9f5409bea2
Eval ID = 942ad545
Name = redis.redis[0]
Node ID = 253a8f99
Node Name = nomadtesting.novalocal
Job ID = redis
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = stop
Desired Description = alloc was rescheduled because it failed
Created = 1h16m ago
Modified = 33m38s ago
Replacement Alloc ID = cf225248
Task "redis" is "dead"
Task Resources:
CPU Memory Disk Addresses
0/500 MHz 692 KiB/256 MiB 300 MiB
Task Events:
Started At = 2024-11-29T10:58:53Z
Finished At = 2024-11-29T11:40:57Z
Total Restarts = 1
Last Restart = 2024-11-29T13:37:52+02:00
Recent Events:
Time Type Description
2024-11-29T13:40:57+02:00 Not Restarting Error was unrecoverable
2024-11-29T13:40:57+02:00 Driver Failure rpc error: code = FailedPrecondition desc = failed to remove dead container: cannot delete container, status code: 200
2024-11-29T13:37:52+02:00 Restarting Task restarting in 0s
2024-11-29T13:35:29+02:00 Terminated Exit Code: 137
2024-11-29T13:33:45+02:00 Restart Signaled Template with change_mode restart re-rendered
2024-11-29T12:58:53+02:00 Started Task started by client
2024-11-29T12:58:52+02:00 Task Setup Building Task Directory
2024-11-29T12:58:52+02:00 Received Task received by client
[root@nomadtesting test-image]# nomad alloc status cf
ID = cf225248-9e5c-0219-2624-9e6b6cb5010b
Eval ID = cdb4feb7
Name = redis.redis[0]
Node ID = 253a8f99
Node Name = nomadtesting.novalocal
Job ID = redis
Job Version = 0
Client Status = failed
Client Description = Failed tasks
Desired Status = stop
Desired Description = alloc was rescheduled because it failed
Created = 34m14s ago
Modified = 24m33s ago
Replacement Alloc ID = b897b38f
Task "redis" is "dead"
Task Resources:
CPU Memory Disk Addresses
0/500 MHz 688 KiB/256 MiB 300 MiB
Task Events:
Started At = 2024-11-29T11:42:11Z
Finished At = 2024-11-29T11:49:38Z
Total Restarts = 1
Last Restart = 2024-11-29T13:49:23+02:00
Recent Events:
Time Type Description
2024-11-29T13:49:38+02:00 Not Restarting Error was unrecoverable
2024-11-29T13:49:38+02:00 Driver Failure rpc error: code = FailedPrecondition desc = failed to remove dead container: cannot delete container, status code: 200
2024-11-29T13:49:23+02:00 Restarting Task restarting in 0s
2024-11-29T13:49:14+02:00 Terminated Exit Code: 0
2024-11-29T13:48:56+02:00 Restart Signaled Template with change_mode restart re-rendered
2024-11-29T13:42:11+02:00 Started Task started by client
2024-11-29T13:41:57+02:00 Task Setup Building Task Directory
2024-11-29T13:41:57+02:00 Received Task received by client
This is caused by the fact that after a stop command is sent to running container
curl -v -s --unix-socket /run/podman/podman.sock http://d/v1.0.0/libpod/containers/$container_id/stats?stream=false will return 200 with an empty body that will cause runContainerMonitor to call ContainerInspect. If the container wasn't killed yet it has exitcode=0
[root@nomadtesting test-image]# curl -v -s --unix-socket /run/podman/podman.sock http://d/v1.0.0/libpod/containers/$container_id/json | jq | grep -i exitcode
* Trying /run/podman/podman.sock:0...
* Connected to d (/run/podman/podman.sock) port 80 (#0)
> GET /v1.0.0/libpod/containers/387834dddaae1e141763740b37b6bec33d39e6bba998cc333370197ee2cf12be/json HTTP/1.1
> Host: d
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Api-Version: 1.41
< Content-Type: application/json
< Libpod-Api-Version: 5.2.2
< Server: Libpod/5.2.2 (linux)
< X-Reference-Id: 0xc00070a000
< Date: Fri, 29 Nov 2024 12:09:30 GMT
< Transfer-Encoding: chunked
<
{ [6334 bytes data]
* Connection #0 to host d left intact
"ExitCode": 0,
"KubeExitCodePropagation": "invalid",
but after it is killed it has correct exit code
* Trying /run/podman/podman.sock:0...
* Connected to d (/run/podman/podman.sock) port 80 (#0)
> GET /v1.0.0/libpod/containers/387834dddaae1e141763740b37b6bec33d39e6bba998cc333370197ee2cf12be/json HTTP/1.1
> Host: d
> User-Agent: curl/7.76.1
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Api-Version: 1.41
< Content-Type: application/json
< Libpod-Api-Version: 5.2.2
< Server: Libpod/5.2.2 (linux)
< X-Reference-Id: 0xc00070a990
< Date: Fri, 29 Nov 2024 12:20:52 GMT
< Transfer-Encoding: chunked
<
{ [6070 bytes data]
* Connection #0 to host d left intact
"ExitCode": 137,
"KubeExitCodePropagation": "invalid",
Metadata
Metadata
Assignees
Labels
No labels