[DISCUSS] Introduce New CRD Chaos

# Chaos CRD Design document

# 1、Background

It is necessary to introduce the automatic experiment flow of chaos into ss to enhance the toughness and failure recovery ability of ss.

# 2、Problem description

Chaos experiment should be automated to avoid the experimental environment, injection flow, verification of the duplication of work

## 2.1 Question 1: How to inject

How can specific failure scenarios be introduced into ss

## 2.2 Question 2: How to generate pressure

How would a large number of specified requests be sent to ss-proxy during a failure to simulate a real production environment

## 2.3 Question 3: How to verify the Result

During the experiment, how to collect relevant information and set the steady-state to prove whether the system is in steady-state

# 3、Technical research

Chaos Mesh or Litmus provides different kinds of chaos experiments, covering most usage scenarios. It only has the ability to inject faults, while experimental environments and verifying the influence of faults on steady state need to be repeated in each experiment. Therefore, we need to define our own crd to realize the automated experiment process for ss-proxy, and use kubebuilder to generate the skeleton code of crd


| technology | address |
| --------------------------- | -------------------------------------------------------------------------------------------------- |
| Chaos Mesh API definition | [https://github.com/chaos-mesh/chaos-mesh](https://github.com/chaos-mesh/chaos-mesh) |
| kubebuilder | [https://github.com/kubernetes-sigs/kubebuilder](https://github.com/kubernetes-sigs/kubebuilder) |
| Litmus chaos | [https://litmuschaos.io/](https://litmuschaos.io/) |

# 4、Scheme design

## 4.1 Program summary

### injection：

In order to solve the problem of how to inject faults into ss, the commonly used solution is pingCAP open source Chaos Mesh or Litmus Chaos, which provides a variety of common fault types, but for the construction of automated ss chaotic scenario flow, it can not be introduced directly because of its complexity and independence of configuration.
Chaos Mesh has provided the corresponding API of all CRD resource definitions, which provides the possibility of simplifying the operation. We can abstract our own chaotic scenarios and interact with Chao Mesh to obtain experimental information. For the implementation of interaction, you can refer to Chaos Mesh's official Chaos DashBoard.

### Generating pressure：

With regard to the configuration environment and pressure, you can use DistSQL to make a request to the ss-proxy, inject data into the environment, and use it as proof to verify the steady state.

### Verification：

In the verification of steady state, we can grab the monitoring log to observe whether the CPU,NetWork IO fluctuates in the steady state, and verify the correctness of the previous request in the pressure phase by DistSQL.

## 4.2 Holistic design

- The chaos experiment for ss-proxy has the following parts

 - Use DistSQL to specify the configuration of proxy-environment to create the specified experimental environment
 - Establish a steady-state hypothesis, declare a specific fault, ss-Chaos converts the fault into a Chaos Mesh fault, and Chaos Mesh injects this fault into the environment
 - In the experiment, ssChaos puts the declared fields that generate traffic (`.spec.accountReq`) into jobs, and jobs send traffic requests to the experimental environment.
 - After fault injection and the start of the experiment, ssChaos grabs the data and indicators in the experimental environment as a criterion for judging whether the final experiment is in a steady state.
- The specific process is as follows:

![image](https://github.com/apache/shardingsphere-on-cloud/assets/85389467/8072f834-9951-4ade-9a55-745c465b31f5)


| ComputeNode | ss-proxy, as an object for upstream service interaction, interacts with the downstream database |
| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| StorageNode | Connect to the database of ss, the node that actually stores the data |
| Governance node | Used to store status and configuration information in ComputeNode, such as logical libraries, logical tables, etc. |
| DistSQL | It is a unique operating language of Apache ShardingSphere. It is used in exactly the same way as standard SQL and is used to provide SQL-level operational capabilities for incremental functionality. |
| proxy-environment | A fully functional ss-proxy environment |
| Chaos APIs | Different kinds of chaos experiments are provided, which are responsible for the actual injection and execution of faults. |
| ssChaos Controller | Responsible for managing the created ssChaos resources |

## 4.3 Function design

It is functionally divided into three parts: injection fault, voltage generation and fault; users can use related functions by defining cr declaration files

### 4.3.1 Feature list

- Injection chaos
 Convert the fault declared by the user to the fault type in Chaos Mesh and inject it into the specified experimental environment
- Generating pressure
 Inject traffic into the experimental environment
- Verification
 Collect the CPU, network IO and other important indicators and program output of the experimental target and
 compare them with the steady-state condition; And verify the correctness of the flow rate in the pressure
 generation stage.

## 4.4 CRD design

### 4.4.1 Spec

- Generating pressure
 It is used to specify the tools to be used and the configuration of the pressure request

 - jobSpec


 | .injectJob.Experimental string | Specifies a pressure request that defines steady-state, which is executed before and after chaos injection |
 | :------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | .injectJob.Pressure string | Specifies the data update request that is executed during chaos injection |


 * PressureCfg


 | reqNum | Number of requests in per reqTime |
 | --------------------- | --------------------------------------------- |
 | concurrentNum | Number of concurrent |
 | reqTime | req in per reqTime |
 | duration | request duration |
 | zkHost | Zookeeper connection address |
 | ssHost | ShardingSphere connection address |
 | script(optional) | Custom command script passed in by the user |
 | distSQLs []distSQL | The disSQLs we want exec in pressure |

 * distSQL


 | sql | The SQL we will exec,use "?" to represent arg |
 | --------------- | ----------------------------------------------- |
 | args []string | args will put to sql |
- Injection fault
 `.spec.chaosKind` Used to specify the type of injection failure
 To specify the type of injection fault, the common fault field is configured in the spec. When accessing the fault provided by the platform, the platform type needs to be written in the annotations, and the fields not mentioned in the fault spec for this platform are written in the annotations.

 - Common configuration field

 - Selector
 Fault target selector


| namespaces | Specify namespaces |
| --------------------- | ----------------------------------- |
| labelSelectors | Specify selection label |
| annotationSelectors | Specify comment |
| nodes | Specify nodes |
| pods | Specified as a namespace-pod name |
| nodeSelectors | Select nodes with label |

- PodChaosSpec

This part of the statement is in `spec.podChaos`
A fault that defines the type of pod, and the action field declares the type of fault that is injected into pod


| action | Specify the fault type of pod, divided into podFailure,containerKill |
| ------------------------------ | ---------------------------------------------------------------------- |
| podFailure.Duration | Specify the effective time of the PodFailureAction |
| containerKill.containerNames | Specify the container to be killed |

- networkSpec
 This part of the statement is in `.spec.networkChaos`

Define faults of network type


| Action | Define chaos of network type, divided into delay,duplicate,corrupt,partition,loss |
| ------------------------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Duration | Specify the duration of chaos |
| Direction | It is used to specify the direction of network failure. When not specified, it defaults to to, which is divided into to (- > target), from (target < -), and both (<-> target). |
| target | selector,Used to select target object |
| Source | selector,Used to select source object |
| delay.latency delay.correlation delay.jitter | latency: Indicates the network latency correlation: Indicates the correlation between the current latency and the previous one jitter: Indicates the range of the network latency |
| loss.correlation loss.loss | loss: Indicates the probability of packet loss correlation: Indicates the correlation between the probability of current packet loss and the previous time's packet loss. |
| duplicate.correlation duplicate.duplicate | correlation: Indicates the correlation between the probability of current packet duplicating duplicate: Indicates the probability of packet duplicating |
| corrupt.corrupt corrupt.correlation | corrupt: Indicates the probability of packet corruption correlation: Indicates the correlation between the probability of current packet corruption and the previous time's packet corruption. |

- Specific configuration spec
 This part needs to be declared in annotations or env

 - chaos-mesh


| Configuration field of podchaos | spec/mode <-----> selector.mode spec/value <-----> selector.value spec/pod/action <-----> specify .action spec/pod/gracePeriod <-----> specify .gracePeriod |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Configuration field of networkchaos | spec/device <-----> .device spec/targetDevice <-----> .targetDevice spec/target/mode <-----> .selector.mode spec/target/value <-----> .value spec/network/action <-----> specify .action spec/network/rate <-----> .bandwidth.rate spec/network/limit <-----> .bandwidth.limit spec/network/buffer <-----> .bandwidth.buffer spec/network/peakrate <-----> .bandwidth.peakrate spec/network/minburst <-----> .bandwidth.minburst |

```
- Litmus chaos
```


| Configuration field of podchaos | - pod-delete spec/random <-------> RANDOMNESS - Container-kill spec/signal <------> SIGNAL spec/chaos_interval <-----> CHAOS_INTERVAL |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Configuration field of networkchaos | |
| Public field | spec/action <----> .spec.experiments.name spec/ramp_time <-----> RAMP_TIME spec/duration <-------> TOTAL_CHAOS_DURATION spec/sequence <-----> SEQUENCE spec/lib_image <-----> LIB_IMAGE spec/lib <----> LIB spec/force <-----> FORCE |

- Verification
 Collect logs and indicators based on the point in time when the fault is injected, and collect indicators in the steady state and fault to determine whether the test passes
 The verification is realized in the way of controlled experiment, which is divided into steady state experimental group and fault experimental group
 Ideally, the only variable for both sets of experiments is whether there is a fault in the experimental environment
 Whether the results meet the expectations is judged by the steady-state fluctuation and pressure job execution results set by us

<img width="1032" alt="image" src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/768851f3-d5cf-4e4d-941c-42c9f18d5290">

 As shown in the above picture, the specific process is as follows:
 Steady state:

 1. Create a pressure job
 2. Collect the concerned contents in the metrics log, record and wait for the comparison with the fault metrics
 Failure:
 3. Create a chaos fault
 4. Collect metrics logs, compare them with steady state, and record the results in status
 Perform a job during steady state and a job during a fault.
 After the chao recovers, verify the execution result of the pressure job when the fault occurs and record it in the status

### 4.4.2 Status

- ChaosCondition
 This field records the progress of the injection failure, which has the following five phasesThis field records the progress of the injection chaos, which has the following four phase


| Creating | It means that chaos is in the creation stage and has not yet completed the injection. |
| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| AllRecovered | Indicates that the environment has recovered from failure |
| Paused | The experiment may be paused because the selected node does not exist. Consider whether there is a problem with the definition of crd. |
| AllInjected | This stage indicates that the fault has been successfully injected into the environment. |
| Unknown | Unknown status |

### 4.4.3 Controller design

- Overall logic of the controller

1. Convert the ssChaos to apply to the fault type in chaos-mesh and create.
2. status

```Go
ChaosCondition ChaosCondition `json:"chaosCondition"`
Phase Phase `json:"phase"`
Result []Result `json:"result"`
```

* update `.status.ChaosCondition`Chaos Mesh displays the progress of the current experiment by updating the status of four types of Type. They are used as the basis for changes of.status.ChaosCondition
 The change logic is as follows:
 Only after all the failures we are currently concerned with in chaos-mesh have entered the AllInjected phase can we change our state from creating to AllInjected.
 paused, we should check whether the pod and container we selected are running properly when the fault is paused.
 When all faults are Recovered, we update our status to AllRecovered
 As mentioned in the chaos-mesh document, it also serves as the evaluation basis for the updated status

 <img width="1021" alt="image" src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/16ececc0-8193-4764-bf76-237d9dd67f17">
* The different stages of `.status.ChaosCondition` are pressed and verified
 Data collection was performed for steady-state requests prior to injection of the fault, and specified requests were made to the environment to collect data after injection (in Allnjected state)
* update phase

 <img width="1028" alt="image" src="https://github.com/apache/shardingsphere-on-cloud/assets/85389467/6c8a5f99-2fb4-438b-b7fa-261b29b58d55">

 BeforeReq-- AfterReq is the initial stage, at which the experiment job is created and pressure request is injected into the environment. In this stage, logs, indicators and steady-state are collected
 AfterReq----Injected into this phase after the log collection and job had been successfully executed in the previous phase, where fault injection was carried out
 Injected---Recovered: When the chaosCondition was Injected and the phase was in AfterReq, the injected phase entered into the injected phase, carried out the pressure job and experiment job execution, collected logs and indicators at this time, and compared them with the steady state. The comparison results were written back into the result
 Recovered: When the chaosCondition is Recovered and the phase is in the Injected stage, it enters this stage and has recovered from the fault; verify job execution and obtain the podlog of the job to check whether the pressure job is successful. And write the result back to result
* Result
 Two experimental results are presented


 | Steady Msg | Steady phase result |
 | ------------- | --------------------- |
 | Chaos Msg | Chaos phase result |


 * Msg


 | Metrics TODO | show interested metrics msg |
 | ----------------- | -------------------------------------------------- |
 | Result string | The execution result of the pressure request |
 | Duration string | The total execution time of the pressure request |

1. Extended platform
 When you need to extend more API interfaces of chaos, the interfaces that need to be implemented for pod and network types are as follows:

About the `get/set` interface of chaos

```go
type ChaosGetter interface {
 GetPodChaosByNamespacedName(context.Context, types.NamespacedName) (PodChaos, error)
 GetNetworkChaosByNamespacedName(context.Context, types.NamespacedName) (NetworkChaos, error)
}

type ChaosSetter interface {
}
```

About the `update/create/New ` interface of chaos

```go
type ChaosHandler interface {
 NewPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.PodChaos
 NewNetworkPodChaos(ssChao *v1alpha1.ShardingSphereChaos) chaos.NetworkChaos
 UpdateNetworkChaos(ctx context.Context, ssChaos *v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.NetworkChaos) error
 UpdatePodChaos(ctx context.Context, ssChaos *v1alpha1.ShardingSphereChaos, r client.Client, cur chaos.PodChaos) error
 CreatePodChaos(ctx context.Context, r client.Client, podChao chaos.PodChaos) error
 CreateNetworkChaos(ctx context.Context, r client.Client, networkChao chaos.NetworkChaos) error
}
```

## 4.5 Expected

### 4.5.1 Expected effect

Create a definition yaml file for CR

```yaml
apiVersion: shardingsphere.apache.org/v1alpha1
kind: ShardingSphereChaos
metadata:
 labels:
 app.kubernetes.io/name: shardingsphereChaos
 name: shardingspherechaos-lala
 namespace: verify-lit
 annotations:
 selector.chaos-mesh.org/mode: all
spec:
 podChaos:
 selector:
 labelSelectors:
 app.kubernetes.io/component: zookeeper
 namespaces: [ "verify-lit" ]
 action: PodFailure
 params:
 podFailure:
 duration: 10s
 pressureCfg:
 ssHost: root:14686Ban@tcp(127.0.0.1:3306)/ds_0
 duration: 10s
 reqTime: 5s
 distSQLs:
 - sql: select * from car;
 concurrentNum: 1
 reqNum: 2

```

After applying, the chaos object is created successfully, and you can see the following information

![](static/FWZcbXMwboQYJNxqGifcasrEnZe.png)

# 5、Demo

# 6、References

- [Chaos Mesh 原理分析与控制面开发](https://cloudnative.to/blog/chaos-engineering-with-kubernetes)
- chao-mesh.org
- [ShardingSphere](https://shardingsphere.apache.org/document/)

#272

The change logic is as follows:
Only after all the failures we are currently concerned with in chaos-mesh have entered the AllInjected phase can we change our state from creating to AllInjected.
paused, we should check whether the pod and container we selected are running properly when the fault is paused.
When all faults are Recovered, we update our status to AllRecovered
As mentioned in the chaos-mesh document, it also serves as the evaluation basis for the updated status

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DISCUSS] Introduce New CRD Chaos #290

Chaos CRD Design document

1、Background

2、Problem description

2.1 Question 1: How to inject

2.2 Question 2: How to generate pressure

2.3 Question 3: How to verify the Result

3、Technical research

4、Scheme design

4.1 Program summary

injection：

Generating pressure：

Verification：

4.2 Holistic design

4.3 Function design

4.3.1 Feature list

4.4 CRD design

4.4.1 Spec

4.4.2 Status

4.4.3 Controller design

4.5 Expected

4.5.1 Expected effect

5、Demo

6、References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

technology	address
Chaos Mesh API definition	https://github.com/chaos-mesh/chaos-mesh
kubebuilder	https://github.com/kubernetes-sigs/kubebuilder
Litmus chaos	https://litmuschaos.io/

ComputeNode	ss-proxy, as an object for upstream service interaction, interacts with the downstream database
StorageNode	Connect to the database of ss, the node that actually stores the data
Governance node	Used to store status and configuration information in ComputeNode, such as logical libraries, logical tables, etc.
DistSQL	It is a unique operating language of Apache ShardingSphere. It is used in exactly the same way as standard SQL and is used to provide SQL-level operational capabilities for incremental functionality.
proxy-environment	A fully functional ss-proxy environment
Chaos APIs	Different kinds of chaos experiments are provided, which are responsible for the actual injection and execution of faults.
ssChaos Controller	Responsible for managing the created ssChaos resources

reqNum	Number of requests in per reqTime
concurrentNum	Number of concurrent
reqTime	req in per reqTime
duration	request duration
zkHost	Zookeeper connection address
ssHost	ShardingSphere connection address
script(optional)	Custom command script passed in by the user
distSQLs []distSQL	The disSQLs we want exec in pressure

namespaces	Specify namespaces
labelSelectors	Specify selection label
annotationSelectors	Specify comment
nodes	Specify nodes
pods	Specified as a namespace-pod name
nodeSelectors	Select nodes with label

action	Specify the fault type of pod, divided into podFailure,containerKill
podFailure.Duration	Specify the effective time of the PodFailureAction
containerKill.containerNames	Specify the container to be killed

Action	Define chaos of network type, divided into delay,duplicate,corrupt,partition,loss
Duration	Specify the duration of chaos
Direction	It is used to specify the direction of network failure. When not specified, it defaults to to, which is divided into to (- > target), from (target < -), and both (<-> target).
target	selector,Used to select target object
Source	selector,Used to select source object
delay.latency delay.correlation delay.jitter	latency: Indicates the network latency correlation: Indicates the correlation between the current latency and the previous one jitter: Indicates the range of the network latency
loss.correlation loss.loss	loss: Indicates the probability of packet loss correlation: Indicates the correlation between the probability of current packet loss and the previous time's packet loss.
duplicate.correlation duplicate.duplicate	correlation: Indicates the correlation between the probability of current packet duplicating duplicate: Indicates the probability of packet duplicating
corrupt.corrupt corrupt.correlation	corrupt: Indicates the probability of packet corruption correlation: Indicates the correlation between the probability of current packet corruption and the previous time's packet corruption.

Configuration field of podchaos	- pod-delete spec/random <-------> RANDOMNESS - Container-kill spec/signal <------> SIGNAL spec/chaos_interval <-----> CHAOS_INTERVAL
Configuration field of networkchaos
Public field	spec/action <----> .spec.experiments.name spec/ramp_time <-----> RAMP_TIME spec/duration <-------> TOTAL_CHAOS_DURATION spec/sequence <-----> SEQUENCE spec/lib_image <-----> LIB_IMAGE spec/lib <----> LIB spec/force <-----> FORCE

Creating	It means that chaos is in the creation stage and has not yet completed the injection.
AllRecovered	Indicates that the environment has recovered from failure
Paused	The experiment may be paused because the selected node does not exist. Consider whether there is a problem with the definition of crd.
AllInjected	This stage indicates that the fault has been successfully injected into the environment.
Unknown	Unknown status

Metrics TODO	show interested metrics msg
Result string	The execution result of the pressure request
Duration string	The total execution time of the pressure request

[DISCUSS] Introduce New CRD Chaos #290

Description

Chaos CRD Design document

1、Background

2、Problem description

2.1 Question 1: How to inject

2.2 Question 2: How to generate pressure

2.3 Question 3: How to verify the Result

3、Technical research

4、Scheme design

4.1 Program summary

injection：

Generating pressure：

Verification：

4.2 Holistic design

4.3 Function design

4.3.1 Feature list

4.4 CRD design

4.4.1 Spec

4.4.2 Status

4.4.3 Controller design

4.5 Expected

4.5.1 Expected effect

5、Demo

6、References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions