Skip to content

Commit eef674b

Browse files
update on basis
Signed-off-by: Siyuan Zhang <[email protected]>
1 parent e7303cd commit eef674b

File tree

1 file changed

+51
-7
lines changed
  • keps/sig-etcd/4647-cluster-feature-gate

1 file changed

+51
-7
lines changed

keps/sig-etcd/4647-cluster-feature-gate/README.md

+51-7
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
- [Data Compatibility Risks During Feature Value Change](#data-compatibility-risks-during-feature-value-change)
1818
- [Feature Implementation Change Risks](#feature-implementation-change-risks)
1919
- [Design Details](#design-details)
20+
- [Set the Basis](#set-the-basis)
2021
- [Register New Feature Gates](#register-new-feature-gates)
2122
- [Set the Feature Gates](#set-the-feature-gates)
2223
- [Consensus Algorithm](#consensus-algorithm)
@@ -165,6 +166,11 @@ if s.FeatureEnabled(FeatureA) && s.ClusterVersion() >= "3.7" {implementation 2}
165166
```
166167
This way the cluster would work consistently because there is a single ClusterVersion across the whole cluster.
167168

169+
It may not be necessary for some changes if the changes do not affect user facing apis or data consistency.
170+
We would only make version based switching optional and best effort, at the discretion of the developers and reviewers.
171+
172+
However, we should make sure the feature is well tested for mixed version scenarios in robustness test.
173+
168174
## Design Details
169175

170176
On high level, a cluster feature gate would need:
@@ -175,6 +181,51 @@ On high level, a cluster feature gate would need:
175181
1. [client APIs](#client-apis-changes) to query if a feature is enabled for the whole cluster.
176182
1. a way to [remove a feature gate][#feature-removal] when it is no longer useful or have graduated.
177183

184+
### Set the Basis
185+
186+
Before we proceed to the design details, we need to think through several questions.
187+
188+
1. Do the cluster features need to tied to the cluster version?
189+
190+
Imaging the following scenario:
191+
* every server in the cluster is on 3.7, and have a new Alpha feature enabled, which would write a new field to wal.
192+
* downgrade is enabled, so cluster version is downgraded to 3.6.
193+
When the cluster version is downgrade to 3.6, the flags of each member are not changed. But we should still disable the Alpha feature, so that any new data written would be compatible after the downgrade.
194+
195+
So similar to how cluster version determines the capability, cluster features should also be tied to the cluster version.
196+
197+
In addition, a server can run with 3.N or 3.N-1 binary version with 3.N-1 cluster version with the same values of cluster feature gates.
198+
The feature implementation for Alpha and Beta features might change between different binary versions.
199+
See [Feature Implementation Change Risks](#feature-implementation-change-risks) about how we we can mitigate the risks in this scenario.
200+
201+
2. Do we need a leader to set the final values of features?
202+
203+
To decide the final values of features, we can do one of
204+
205+
a. rely on the leader to send a raft request to set the cluster feature values. This is how most of the properties of the cluster is set.
206+
But in the past we have had some issues with stale leader trying to compete with current leader. This would add yet another decision for the leader to handle.
207+
208+
b. individual members decide the feature values by reconciling the proposed values locally after receiving them from each member.
209+
This approach has the benefit of skipping a raft step, but has the risk of inconsistent behavior among the mixed version members if there is any change in the reconciliation logic between versions.
210+
211+
Considering the feature setting for a cluster is most likely idempotent, the risk of issues with stale leader is actually most smaller than other uses cases of the leader, we will pick the first approach.
212+
213+
3. What happens before the cluster feature value setting raft request is sent?
214+
215+
When a new cluster starts, the cluster starts to accept requests once a leader is elected. At this point, the cluster version might be `nil` or set to the `MinClusterVersion = "3.0.0"`.
216+
The leader would not have decided the values of cluster feature gates yet. What should be the values for the cluster features during that time?
217+
* if we set the cluster feature gate to `nil` or tie it to the `MinClusterVersion`, every cluster feature would be off at that moment. This would also apply to GA features as well.
218+
* but suppose `featureA` GAs at version 3.7, and is removed at 3.8, if we start a new cluster with 2 nodes at 3.7, and 1 node at 3.8,
219+
during the period when the cluster feature gate is `nil`, `featureA` would be disabled in the 3.7 members, and enabled in the 3.8 members because the feature code is already removed in 3.8 and forever on.
220+
This would pose great challenges requiring either we can never clean up GA features or we cannot allow starting a cluster with mixed version members. Both are not feasible.
221+
222+
* we also cannot set the cluster feature values according to the local server version either, because we run the risk of a feature might be enabled in some members and disabled in others for a mixed version cluster.
223+
224+
225+
For simplicity of the design, we would only consider the following cluster configuration cases:
226+
* when a new cluster starts, all cluster members have the same major.minor version, and have the same feature configurations.
227+
* a cluster can be upgraded, downgraded, and updated in rolling sequence. For a limited time, the cluster can have mixed versions and mixed configurations. Eventually all cluster members will have the same major.minor version, and have the same feature configurations.
228+
178229
### Register New Feature Gates
179230

180231
A feature can be registered as server level feature or cluster level feature, but not both.
@@ -226,13 +277,6 @@ To guarantee consistent value of if a feature is enabled in the whole cluster, t
226277

227278
1. Each member applies the updates to their `ClusterParams`, and saves the results in the `cluster` bucket in the backend.
228279

229-
A few other alternatives we have evaluated:
230-
1. Is it better to initialize the `ClusterParams` with nil or cluster version defaults when we have not received the updates of `proposed_cluster_params` from all members?
231-
In either case, there would be a state change from the initial state to the final state. If we choose to use the version defaults, even though different member might have different default values of the cluster parameters, compared with if the initial state is nil, the change would still be smaller because defaults rarely change between patch versions, and most users would run with parameters close the default values.
232-
233-
1. Should individual members decide the `ClusterParams` by reconciling `proposed_cluster_params` locally instead of relying on a leader to determine the final values and send a raft request to set the final `ClusterParams`?
234-
If we allow individual members decide the final `ClusterParams`, the logic to reconcile the `proposed_cluster_params` of all members to a common cluster setting - `UpdateClusterParamsIfNeeded` has to be the same across all patch versions. On the other hand, if we use a single leader to make the final decision, we would have the flexibility to change the implementation of `UpdateClusterParamsIfNeeded` in patch versions without risking split brains.
235-
236280
![cluster feature gate consensus algorithm](./cluster_feature_gate.png "cluster feature gate consensus algorithm")
237281

238282
#### New Raft Proto Changes

0 commit comments

Comments
 (0)