Skip to content

Unable to create the helper pod when specifying multiple TARGET_NODES for experiment. #453

@jayzziebone

Description

@jayzziebone

BUG REPORT

What happened:
While running chaos experiment, for node cpu hog, sometimes it's not able to bring up some helper pod if I specify multiple TARGET_NODES in the comma separated format. In my case I have 4 nodes, and If I specify all 4 nodes, it's able to bring up 2 helper pods, then fails to bring up the other 2. And I see the error bellow inside de node-cpu-xxxx-xxx pod:
CPU hog failed, err: unable to create the helper pod, err: Post "https://10.96.0.1:443/api/v1/namespaces/default/pods\": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer"

time="2023-03-09T15:43:36Z" level=info msg="Experiment Name: node-cpu-hog"
time="2023-03-09T15:43:36Z" level=info msg="[PreReq]: Getting the ENV for the node-cpu-hog experiment"
time="2023-03-09T15:43:38Z" level=info msg="[PreReq]: Updating the chaos result of node-cpu-hog experiment (SOT)"
time="2023-03-09T15:43:42Z" level=info msg="The application information is as follows" Node Label= Chaos Duration=60 Target Nodes="node-10-120-127-170,node-10-120-127-171,node-10-120-127-172,node-10-120-127-173" Node CPU Cores=1
time="2023-03-09T15:43:42Z" level=info msg="[Status]: Verify that the AUT (Application Under Test) is running (pre-chaos)"
time="2023-03-09T15:43:42Z" level=info msg="[Status]: No appLabels provided, skipping the application status checks"
time="2023-03-09T15:43:42Z" level=info msg="[Status]: Getting the status of target nodes"
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Ready=true Node=node-10-120-127-170
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Node=node-10-120-127-171 Ready=true
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Node=node-10-120-127-172 Ready=true
time="2023-03-09T15:43:42Z" level=info msg="The Node status are as follows" Ready=true Node=node-10-120-127-173
time="2023-03-09T15:43:44Z" level=info msg="[Info]: The chaos tunables are:" Sequence=parallel Node CPU Cores=1 CPU Load=0 Node Affce Perc=0
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Nodes under chaos injection" No. Of Nodes=4 Node Names="[node-10-120-127-170 node-10-120-127-171 node-10-120-127-172 node-10-120-127-173]"
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-170 NodeCPUcores=1
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-171 NodeCPUcores=1
time="2023-03-09T15:43:44Z" level=info msg="[Info]: Details of Node under chaos injection" NodeName=node-10-120-127-172 NodeCPUcores=1
time="2023-03-09T15:43:45Z" level=error msg="[Error]: CPU hog failed, err: unable to create the helper pod, err: Post \"https://10.96.0.1:443/api/v1/namespaces/default/pods\": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer"

And this fails the experiment at the end:

kubectl describe chaosresults.litmuschaos.io nginx-chaos-node-cpu-hog
Name:         nginx-chaos-node-cpu-hog
Namespace:    default
Labels:       app.kubernetes.io/component=experiment-job
              app.kubernetes.io/part-of=litmus
              app.kubernetes.io/version=2.14.0
              chaosUID=9c104680-26c3-49a6-801c-2ee3f9f96505
              controller-uid=f20544d9-b90a-4f08-9438-fbfbdf3c74e5
              job-name=node-cpu-hog-i0wu3z
              name=node-cpu-hog
Annotations:  <none>
API Version:  litmuschaos.io/v1alpha1
Kind:         ChaosResult
Metadata:
  Creation Timestamp:  2023-03-08T16:17:17Z
  Generation:          4
  Managed Fields:
    API Version:  litmuschaos.io/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:app.kubernetes.io/component:
          f:app.kubernetes.io/part-of:
          f:app.kubernetes.io/version:
          f:chaosUID:
          f:controller-uid:
          f:job-name:
          f:name:
      f:spec:
        .:
        f:engine:
        f:experiment:
      f:status:
        .:
        f:experimentStatus:
        f:history:
    Manager:         experiments
    Operation:       Update
    Time:            2023-03-08T16:17:17Z
  Resource Version:  5704800
  UID:               d53abe6b-e176-4769-9b72-4af35cd7d2ee
Spec:
  Engine:      nginx-chaos
  Experiment:  node-cpu-hog
Status:
  Experiment Status:
    Fail Step:                 [chaos]: Failed inside the chaoslib, err: unable to create the helper pod, err: Post "https://10.96.0.1:443/api/v1/namespaces/default/pods": read tcp 192.168.230.167:50174->10.96.0.1:443: read: connection reset by peer
    Phase:                     Completed
    Probe Success Percentage:  0
    Verdict:                   Fail
  History:
    Failed Runs:   1
    Passed Runs:   1
    Stopped Runs:  0
Events:
  Type     Reason   Age    From                       Message
  ----     ------   ----   ----                       -------
  Normal   Awaited  3m26s  node-cpu-hog-i0wu3z-h7q5j  experiment: node-cpu-hog, Result: Awaited
  Warning  Fail     3m19s  node-cpu-hog-i0wu3z-h7q5j  experiment: node-cpu-hog, Result: Fail

What you expected to happen:
I expect all the helper pods able to be up and Running and the experiment successful.

How to reproduce it (as minimally and precisely as possible):

  1. Install Litmus Operator

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.13.8.yaml

  1. Install the experiment engine

kubectl apply -f https://github.com/litmuschaos/chaos-charts/raw/v2.14.x/experiments/generic/node-cpu-hog/experiment.yaml

  1. Install the rbac yaml file

kubectl https://github.com/litmuschaos/chaos-charts/raw/v2.14.x/experiments/generic/node-cpu-hog/rbac.yaml

  1. Apply the node-cpu-hog-engine.yaml file below

kubectl apply -f node-cpu-hog-engine.yaml

Anything else we need to know?:

Environment:

kubectl get nodes
NAME                  STATUS   ROLES       AGE   VERSION
node-10-120-127-170   Ready    edge,node   8d    v1.22.17
node-10-120-127-171   Ready    edge,node   8d    v1.22.17
node-10-120-127-172   Ready    node        8d    v1.22.17
node-10-120-127-173   Ready    node        8d    v1.22.17

node-cpu-hog-engine YAML File:

cat node-cpu-hog-engine.yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: nginx-chaos
  namespace: default
spec:
  # It can be active/stop
  engineState: 'active'
  #ex. values: ns1:name=percona,ns2:run=nginx
  auxiliaryAppInfo: ''
  chaosServiceAccount: node-cpu-hog-sa
  experiments:
    - name: node-cpu-hog
      spec:
        components:
          env:
            # set chaos duration (in sec) as desired
            - name: TOTAL_CHAOS_DURATION
              value: '60'

            ## ENTER THE NUMBER OF CORES OF CPU FOR CPU HOGGING
            ## OPTIONAL VALUE IN CASE OF EMPTY VALUE IT WILL TAKE NODE CPU CAPACITY
            - name: NODE_CPU_CORE
              value: '1'

            ## LOAD CPU WITH GIVEN PERCENT LOADING FOR THE CPU STRESS WORKERS.
            ## 0 IS EFFECTIVELY A SLEEP (NO LOAD) AND 100 IS FULL LOADING
            - name: CPU_LOAD
              value: '0'

            ## percentage of total nodes to target
            - name: NODES_AFFECTED_PERC
              value: ''

            # provide the comma separated target node names
            - name: TARGET_NODES
              value: 'node-10-120-127-170,node-10-120-127-171,node-10-120-127-172,node-10-120-127-173'

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions