From 38b6fe4329d953bbf188a317e7f2f82fe0ffd3b9 Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Tue, 28 Dec 2021 19:56:07 +0800 Subject: [PATCH 1/6] improve the robust of scheduler Signed-off-by: bufferflies <1045931706@qq.com> --- ...Improve-the-robust-of-balance-scheduler.md | 65 +++++++++++++++++++ 1 file changed, 65 insertions(+) create mode 100644 text/0083-Improve-the-robust-of-balance-scheduler.md diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0083-Improve-the-robust-of-balance-scheduler.md new file mode 100644 index 00000000..b9ded003 --- /dev/null +++ b/text/0083-Improve-the-robust-of-balance-scheduler.md @@ -0,0 +1,65 @@ +# Improve the robust of balance scheduler + +- RFC PR: [https://github.com/tikv/rfcs/pull/83](https://github.com/tikv/rfcs/pull/83) +- Tracking Issue: [https://github.com/tikv/pd/issues/](https://github.com/tikv/pd/issues/4428) + +## Summary + +Make scheduler more robust for dynamic region size. + +## Motivation + +We have observed many different situations when the region size is different. The major drawback coms from this aspects: + +1. Balance region scheduler pick source store in order of store's score, the second store will be picked after the first store has not met some filter or retry times exceed fixed value, this problem is also exist in target pick strategy. +2. Operator has an import effect on region leader, and the leader is responsible in the operator life cycle. +3. There are some factor that influence execution time of operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's life. +4. PD should know some global config about TIKV like region-max-size, region report interval. This config should synchronize with PD. + +## Detailed design + +### store pick strategy + +It can arrange all the store based on label, like TiKV and TiFlash and allow low score group has more chance to scheduler. But the first score region should has highest priority to be selected. + +#### Consider Influence to leader + +Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow. + +It will add new store limit as new limit type to decrease leader loads of every store. + +### Operator control + +#### store limit cost + +Second, different size region occupy store limit should be different. Maybe can use this formula: + +![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;Influence=\sum_{i=0}^{j}step_{i}.Influence&space;\newline&space;Cost&space;=&space;200*ln{\frac{region_{size}}{100KiB}}) + +Cost equals 200 if operator influence is 1Mb or equal 600 if operator influence is 1gb. + +#### operator life cycle + +The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m). + +It will be better if we can calculate every step expecting execute duration by major factor includes region size, IO limit and operator concurrency like this: + +![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;V=\frac{io_limit}{sending_{count}+receiving_{count}}=\frac{100Mb/s}{3+3}=16.7Mb/s\newline&space;T_{transfer}=\frac{10Gb}{16.7Mb/s}=598s\newline&space;T_{total}=T_{generator}+T_{transfer}+T_{apply}) + +The snapshot generator duration can ignore because it doesn't need to scan. The apply snapshot duration will finish in minute level if it needs to load hot buckets. + +### sync global config + +There are some global config that all components need to synchronize like `region-max-size`, `io-limit`. Using ETCD api to implement global config may be a good idea like [this](<[https://github.com/pingcap/tidb/pull/31010/files](https://github.com/pingcap/tidb/pull/31010/files)>) + +## Drawbacks + +## Alternatives + +Removing peer may not influence the cluster performance, it can be replace by leader store limit. + +Canceling operator can depends on TiKV not by PD, but TiKV should notify PD after canceled one operator. + +## Questions + +## Unresolved questions From e951aeb863a4acf5055d81c28b1b04e5259424c2 Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Wed, 29 Dec 2021 10:35:04 +0800 Subject: [PATCH 2/6] format Signed-off-by: bufferflies <1045931706@qq.com> --- text/0083-Improve-the-robust-of-balance-scheduler.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0083-Improve-the-robust-of-balance-scheduler.md index b9ded003..bbe7d8cc 100644 --- a/text/0083-Improve-the-robust-of-balance-scheduler.md +++ b/text/0083-Improve-the-robust-of-balance-scheduler.md @@ -18,7 +18,7 @@ We have observed many different situations when the region size is different. Th ## Detailed design -### store pick strategy +### Store pick strategy It can arrange all the store based on label, like TiKV and TiFlash and allow low score group has more chance to scheduler. But the first score region should has highest priority to be selected. @@ -30,7 +30,7 @@ It will add new store limit as new limit type to decrease leader loads of every ### Operator control -#### store limit cost +#### Store limit cost Second, different size region occupy store limit should be different. Maybe can use this formula: @@ -38,7 +38,7 @@ Second, different size region occupy store limit should be different. Maybe can Cost equals 200 if operator influence is 1Mb or equal 600 if operator influence is 1gb. -#### operator life cycle +#### Operator life cycle The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m). @@ -48,7 +48,7 @@ It will be better if we can calculate every step expecting execute duration by m The snapshot generator duration can ignore because it doesn't need to scan. The apply snapshot duration will finish in minute level if it needs to load hot buckets. -### sync global config +### Sync global config There are some global config that all components need to synchronize like `region-max-size`, `io-limit`. Using ETCD api to implement global config may be a good idea like [this](<[https://github.com/pingcap/tidb/pull/31010/files](https://github.com/pingcap/tidb/pull/31010/files)>) From 6da63b89146df22f7d84e704a8860f051942fb97 Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Wed, 29 Dec 2021 11:10:05 +0800 Subject: [PATCH 3/6] format Signed-off-by: bufferflies <1045931706@qq.com> --- ...0083-Improve-the-robust-of-balance-scheduler.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0083-Improve-the-robust-of-balance-scheduler.md index bbe7d8cc..59cee445 100644 --- a/text/0083-Improve-the-robust-of-balance-scheduler.md +++ b/text/0083-Improve-the-robust-of-balance-scheduler.md @@ -1,6 +1,6 @@ # Improve the robust of balance scheduler -- RFC PR: [https://github.com/tikv/rfcs/pull/83](https://github.com/tikv/rfcs/pull/83) +- RFC PR: [https://github.com/tikv/rfcs/pull/85](https://github.com/tikv/rfcs/pull/83) - Tracking Issue: [https://github.com/tikv/pd/issues/](https://github.com/tikv/pd/issues/4428) ## Summary @@ -9,12 +9,12 @@ Make scheduler more robust for dynamic region size. ## Motivation -We have observed many different situations when the region size is different. The major drawback coms from this aspects: +We have observed many different situations when the region size is different. The major drawback comes from this aspects: 1. Balance region scheduler pick source store in order of store's score, the second store will be picked after the first store has not met some filter or retry times exceed fixed value, this problem is also exist in target pick strategy. -2. Operator has an import effect on region leader, and the leader is responsible in the operator life cycle. -3. There are some factor that influence execution time of operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's life. -4. PD should know some global config about TIKV like region-max-size, region report interval. This config should synchronize with PD. +2. Operator has an import effect on region leader, and the leader is responsible in the operator life cycle. But the region leader will not be limited by any filter. +3. There are some factor that influence execution time of operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's life not fixed config. +4. PD should know some global config about TiKV like `region-max-size`, `region-report-interval`. This config should synchronize with PD. ## Detailed design @@ -32,11 +32,11 @@ It will add new store limit as new limit type to decrease leader loads of every #### Store limit cost -Second, different size region occupy store limit should be different. Maybe can use this formula: +Different size region occupy tokens should be different. Maybe can use this formula: ![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;Influence=\sum_{i=0}^{j}step_{i}.Influence&space;\newline&space;Cost&space;=&space;200*ln{\frac{region_{size}}{100KiB}}) -Cost equals 200 if operator influence is 1Mb or equal 600 if operator influence is 1gb. +Cost equals 200 if operator influence is 1Mb or equals 600 if operator influence is 1gb. #### Operator life cycle From 509e4802008eb77914198422eaea1c5e585d53dd Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Wed, 29 Dec 2021 15:30:11 +0800 Subject: [PATCH 4/6] gramma && rename title Signed-off-by: bufferflies <1045931706@qq.com> --- ...Improve-the-robust-of-balance-scheduler.md | 28 +++++++++---------- 1 file changed, 13 insertions(+), 15 deletions(-) diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0083-Improve-the-robust-of-balance-scheduler.md index 59cee445..9011ec66 100644 --- a/text/0083-Improve-the-robust-of-balance-scheduler.md +++ b/text/0083-Improve-the-robust-of-balance-scheduler.md @@ -1,46 +1,44 @@ -# Improve the robust of balance scheduler +# Improve the robust of balance region scheduler - RFC PR: [https://github.com/tikv/rfcs/pull/85](https://github.com/tikv/rfcs/pull/83) - Tracking Issue: [https://github.com/tikv/pd/issues/](https://github.com/tikv/pd/issues/4428) ## Summary -Make scheduler more robust for dynamic region size. +Make schedulers more robust for dynamic region size. ## Motivation -We have observed many different situations when the region size is different. The major drawback comes from this aspects: +We have observed many different situations when the region size is different. The major drawback comes from these aspects: -1. Balance region scheduler pick source store in order of store's score, the second store will be picked after the first store has not met some filter or retry times exceed fixed value, this problem is also exist in target pick strategy. -2. Operator has an import effect on region leader, and the leader is responsible in the operator life cycle. But the region leader will not be limited by any filter. -3. There are some factor that influence execution time of operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's life not fixed config. +1. Balance region scheduler picks source store in order of store's score, the lower store will be picked after the higher store has not met some filter or retry times exceed fixed value. If the count of placement rules or tikv is bigger, the lower store has less chance to balance like TiFlash. +2. splitting rocksDB and sending them by region leader will occupy cpu and io resources. +3. There are some factors that influence execution time of an operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's timeout threshold rather than not fixed value. 4. PD should know some global config about TiKV like `region-max-size`, `region-report-interval`. This config should synchronize with PD. ## Detailed design ### Store pick strategy -It can arrange all the store based on label, like TiKV and TiFlash and allow low score group has more chance to scheduler. But the first score region should has highest priority to be selected. +It can arrange all the stores based on label, like TiKV and TiFlash and allow low score groups more chances to schedule. But the first score region should have the highest priority to be selected. #### Consider Influence to leader -Normally, one operator is made of region, source store and target store, the key works finished by region leader such as snapshot generate, snapshot send. It is not friendly to the leader if majority operator is add follow. - -It will add new store limit as new limit type to decrease leader loads of every store. +It will add a new store limit to decrease leader loads of every store. Picking region should check if the leader token is available. ### Operator control #### Store limit cost -Different size region occupy tokens should be different. Maybe can use this formula: +Different size regions occupy tokens should be different. Maybe can use this formula: ![](https://latex.codecogs.com/gif.image?\dpi{200}&space;\bg_white&space;Influence=\sum_{i=0}^{j}step_{i}.Influence&space;\newline&space;Cost&space;=&space;200*ln{\frac{region_{size}}{100KiB}}) -Cost equals 200 if operator influence is 1Mb or equals 600 if operator influence is 1gb. +Cost equals 200 if operator influence is 1MB or equals 600 if operator influence is 1GB. #### Operator life cycle -The operator life cycle can divide into some stages: create, executing(started), complete. PD will check operator stage by region heart beats and cancel operator if one operator‘s running time exceed the fixed value(10m). +The operator life cycle can be divided into some stages: create, executing(started), complete. PD will check operator stage by region heartbeat and cancel if one operator‘s running time exceeds the fixed value(10m). It will be better if we can calculate every step expecting execute duration by major factor includes region size, IO limit and operator concurrency like this: @@ -56,9 +54,9 @@ There are some global config that all components need to synchronize like `regio ## Alternatives -Removing peer may not influence the cluster performance, it can be replace by leader store limit. +Removing peer may not influence the cluster performance, it can be replaced by leader store limit. -Canceling operator can depends on TiKV not by PD, but TiKV should notify PD after canceled one operator. +Canceling operators can depend on TiKV not by PD, but TiKV should notify PD after canceling one operator. ## Questions From 7d1a4db2cb0ebc1ff1a7e2034698add1e922edb2 Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Wed, 29 Dec 2021 19:26:08 +0800 Subject: [PATCH 5/6] grama Signed-off-by: bufferflies <1045931706@qq.com> --- text/0083-Improve-the-robust-of-balance-scheduler.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0083-Improve-the-robust-of-balance-scheduler.md index 9011ec66..f019e834 100644 --- a/text/0083-Improve-the-robust-of-balance-scheduler.md +++ b/text/0083-Improve-the-robust-of-balance-scheduler.md @@ -13,7 +13,7 @@ We have observed many different situations when the region size is different. Th 1. Balance region scheduler picks source store in order of store's score, the lower store will be picked after the higher store has not met some filter or retry times exceed fixed value. If the count of placement rules or tikv is bigger, the lower store has less chance to balance like TiFlash. 2. splitting rocksDB and sending them by region leader will occupy cpu and io resources. -3. There are some factors that influence execution time of an operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's timeout threshold rather than not fixed value. +3. There are some factors that influence execution time of an operator such as region size, IO limit, CPU load. PD needs to be more flexible to manage operator's timeout threshold rather than not fixed value. 4. PD should know some global config about TiKV like `region-max-size`, `region-report-interval`. This config should synchronize with PD. ## Detailed design From 47488d517a826a2dacc9e293faa5f5f461c37b2b Mon Sep 17 00:00:00 2001 From: bufferflies <1045931706@qq.com> Date: Wed, 29 Dec 2021 19:26:08 +0800 Subject: [PATCH 6/6] grama && rename file Signed-off-by: bufferflies <1045931706@qq.com> --- ...duler.md => 0085-Improve-the-robust-of-balance-scheduler.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename text/{0083-Improve-the-robust-of-balance-scheduler.md => 0085-Improve-the-robust-of-balance-scheduler.md} (98%) diff --git a/text/0083-Improve-the-robust-of-balance-scheduler.md b/text/0085-Improve-the-robust-of-balance-scheduler.md similarity index 98% rename from text/0083-Improve-the-robust-of-balance-scheduler.md rename to text/0085-Improve-the-robust-of-balance-scheduler.md index 9011ec66..f019e834 100644 --- a/text/0083-Improve-the-robust-of-balance-scheduler.md +++ b/text/0085-Improve-the-robust-of-balance-scheduler.md @@ -13,7 +13,7 @@ We have observed many different situations when the region size is different. Th 1. Balance region scheduler picks source store in order of store's score, the lower store will be picked after the higher store has not met some filter or retry times exceed fixed value. If the count of placement rules or tikv is bigger, the lower store has less chance to balance like TiFlash. 2. splitting rocksDB and sending them by region leader will occupy cpu and io resources. -3. There are some factors that influence execution time of an operator such as region size, IO limit, cpu load. PD needs to be more flexible to manage operator's timeout threshold rather than not fixed value. +3. There are some factors that influence execution time of an operator such as region size, IO limit, CPU load. PD needs to be more flexible to manage operator's timeout threshold rather than not fixed value. 4. PD should know some global config about TiKV like `region-max-size`, `region-report-interval`. This config should synchronize with PD. ## Detailed design