pod-network-loss: cleanup fails because target pod has been restarted

BUG REPORT

**What happened**:
I ran the chaos experiment 'pod-network-loss' against a pod which was able to successfully inject the Linux traffic control rule to block the traffic. This resulted in the pod failing the network-based liveness probe which caused Kubernetes to kill and restart the pod. Once the experiment ended, the traffic control rule was attempted to be reverted but this failed because the original process in the pod was not running anymore. Obviously this resulted in a failed helper pod and an application pod that was stuck in the state CrashLoopBackOff because the network-based readiness probe could not succeed because of the blocked traffic. 
```
time="2022-10-20T14:20:33Z" level=info msg="Helper Name: network-chaos"
time="2022-10-20T14:20:33Z" level=info msg="[PreReq]: Getting the ENV variables"
time="2022-10-20T14:20:33Z" level=info msg="container ID of tls-terminator container, containerID: 3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44"
time="2022-10-20T14:20:33Z" level=info msg="Container ID: 3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44"
time="2022-10-20T14:20:33Z" level=info msg="[Info]: Container ID=3c762d3ba26f21bb7cd41d92bb5161793750e9f3db11ae317f72ddf8cdba5d44 has process PID=360376"
time="2022-10-20T14:20:33Z" level=info msg="/bin/bash -c sudo nsenter -t 360376 -n tc qdisc replace dev eth0 root netem loss 100"
time="2022-10-20T14:20:34Z" level=info msg="[Chaos]: Waiting for 300s"
time="2022-10-20T14:25:34Z" level=info msg="[Chaos]: Stopping the experiment"
time="2022-10-20T14:25:34Z" level=info msg="/bin/bash -c sudo nsenter -t 360376 -n tc qdisc delete dev eth0 root"
time="2022-10-20T14:25:34Z" level=error msg="nsenter: can't open '/proc/360376/ns/net': No such file or directory\n"
time="2022-10-20T14:25:34Z" level=fatal msg="helper pod failed, err: exit status 1"
```

**What you expected to happen**:
Once the experiment completes the traffic control rule is successfully removed so that the application pod is able to properly function again. 

**How to reproduce it (as minimally and precisely as possible)**:
* Setup the experiment with all the necessary Kubernetes resources
* Create a deployment with a network-based liveness probe. E.g.:
```
      livenessProbe:
        httpGet:
          path: /healthz
          port: http
          scheme: HTTP
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
```
* Run the ChaosEngine with a long enough `TOTAL_CHAOS_DURATION` so that the liveness probe reaches the failure threshold and the pod is killed


**Anything else we need to know?**:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pod-network-loss: cleanup fails because target pod has been restarted #591

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pod-network-loss: cleanup fails because target pod has been restarted #591

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions