-
Notifications
You must be signed in to change notification settings - Fork 96
Description
BUG REPORT
What happened:
While running chaos experiment, for node cpu hog, sometimes it's not able to bring up some helper pod if I specify multiple TARGET_NODES in the comma separated format. In my case I have 4 nodes, and If I specify all 4 nodes, it's able to bring up 2 helper pods, then fails to bring up the other 2. And I see the error bellow inside de node-cpu-xxxx-xxx pod:
CPU hog failed, err: unable to create the helper pod, err: Post "https://10.96.0.1:443/api/v1/namespaces/default/pods\": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer"
time="2023-03-09T15:43:36Z" level=info msg="Experiment Name: node-cpu-hog"
time="2023-03-09T15:43:36Z" level=info msg="[PreReq]: Getting the ENV for the node-cpu-hog experiment"
time="2023-03-09T15:43:38Z" level=info msg="[PreReq]: Updating the chaos result of node-cpu-hog experiment (SOT)"
time="2023-03-09T15:43:42Z" level=info msg="The application information is as follows" Node Label= Chaos Duration=60 Target Nodes="node-10-120-127-170,node-10-120-127-171,node-10-120-127-172,node-10-120-127-173" Node CPU Cores=1
time="2023-03-09T15:43:42Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2023-03-09T15:43:42Z" level=info msg="[Status]: No appLabels provided, skipping the application status checks"
time="2023-03-09T15:43:42Z" level=info msg="[Status]: Getting the status of target nodes"
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Ready=true Node=node-10-120-127-170
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Node=node-10-120-127-171 Ready=true
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Node=node-10-120-127-172 Ready=true
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Ready=true Node=node-10-120-127-173
time="2023-03-09T15:43:44Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel Node CPU Cores=1 CPU Load=0 Node Affce Perc=0
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Nodes under chaos injection" No. Of Nodes=4 Node Names="[node-10-120-127-170 node-10-120-127-171 node-10-120-127-172 node-10-120-127-173]"
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-170 NodeCPUcores=1
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-171 NodeCPUcores=1
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-172 NodeCPUcores=1
time="2023-03-09T15:43:45Z" level=error msg="[Error]: CPU hog failed, err: unable to create the helper pod, err: Post \"https://10.96.0.1:443/api/v1/namespaces/default/pods\": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer"
And this fails the experiment at the end:
kubectl describe chaosresults.litmuschaos.io nginx-chaos-node-cpu-hog
Name: nginx-chaos-node-cpu-hog
Namespace: default
Labels: app.kubernetes.io/component=experiment-job
app.kubernetes.io/part-of=litmus
app.kubernetes.io/version=2.14.0
chaosUID=9c104680-26c3-49a6-801c-2ee3f9f96505
controller-uid=f20544d9-b90a-4f08-9438-fbfbdf3c74e5
job-name=node-cpu-hog-i0wu3z
name=node-cpu-hog
Annotations: <none>
API Version: litmuschaos.io/v1alpha1
Kind: ChaosResult
Metadata:
Creation Timestamp: 2023-03-08T16:17:17Z
Generation: 4
Managed Fields:
API Version: litmuschaos.io/v1alpha1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:app.kubernetes.io/component:
f:app.kubernetes.io/part-of:
f:app.kubernetes.io/version:
f:chaosUID:
f:controller-uid:
f:job-name:
f:name:
f:spec:
.:
f:engine:
f:experiment:
f:status:
.:
f:experimentStatus:
f:history:
Manager: experiments
Operation: Update
Time: 2023-03-08T16:17:17Z
Resource Version: 5704800
UID: d53abe6b-e176-4769-9b72-4af35cd7d2ee
Spec:
Engine: nginx-chaos
Experiment: node-cpu-hog
Status:
Experiment Status:
Fail Step: [chaos]: Failed inside the chaoslib, err: unable to create the helper pod, err: Post "https://10.96.0.1:443/api/v1/namespaces/default/pods": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer
Phase: Completed
Probe Success Percentage: 0
Verdict: Fail
History:
Failed Runs: 1
Passed Runs: 1
Stopped Runs: 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Awaited 3m26s node-cpu-hog-i0wu3z-h7q5j experiment: node-cpu-hog, Result: Awaited
Warning Fail 3m19s node-cpu-hog-i0wu3z-h7q5j experiment: node-cpu-hog, Result: Fail
What you expected to happen:
I expect all the helper pods able to be up and Running and the experiment successful.
How to reproduce it (as minimally and precisely as possible):
- Install Litmus Operator
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.8.yaml
- Install the experiment engine
kubectl apply -f https://github.com/litmuschaos/chaos-charts/raw/v2.14.x/experiments/generic/node-cpu-hog/experiment.yaml
- Install the rbac yaml file
kubectl https://github.com/litmuschaos/chaos-charts/raw/v2.14.x/experiments/generic/node-cpu-hog/rbac.yaml
- Apply the node-cpu-hog-engine.yaml file below
kubectl apply -f node-cpu-hog-engine.yaml
Anything else we need to know?:
Environment:
kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-10-120-127-170 Ready edge,node 8d v1.22.17
node-10-120-127-171 Ready edge,node 8d v1.22.17
node-10-120-127-172 Ready node 8d v1.22.17
node-10-120-127-173 Ready node 8d v1.22.17
node-cpu-hog-engine YAML File:
cat node-cpu-hog-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: nginx-chaos
namespace: default
spec:
# It can be active/stop
engineState: 'active'
#ex. values: ns1:name=percona,ns2:run=nginx
auxiliaryAppInfo: ''
chaosServiceAccount: node-cpu-hog-sa
experiments:
- name: node-cpu-hog
spec:
components:
env:
# set chaos duration (in sec) as desired
- name: TOTAL_CHAOS_DURATION
value: '60'
## ENTER THE NUMBER OF CORES OF CPU FOR CPU HOGGING
## OPTIONAL VALUE IN CASE OF EMPTY VALUE IT WILL TAKE NODE CPU CAPACITY
- name: NODE_CPU_CORE
value: '1'
## LOAD CPU WITH GIVEN PERCENT LOADING FOR THE CPU STRESS WORKERS.
## 0 IS EFFECTIVELY A SLEEP (NO LOAD) AND 100 IS FULL LOADING
- name: CPU_LOAD
value: '0'
## percentage of total nodes to target
- name: NODES_AFFECTED_PERC
value: ''
# provide the comma separated target node names
- name: TARGET_NODES
value: 'node-10-120-127-170,node-10-120-127-171,node-10-120-127-172,node-10-120-127-173'