You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 06x-new-kafka-roller.md
+19-20Lines changed: 19 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -2,16 +2,14 @@
2
2
3
3
## Current situation
4
4
5
-
The Kafka Roller is a Cluster Operator component that's responsible for coordinating the rolling restart or reconfiguration of Kafka pods when:
5
+
The Kafka Roller is an internal Cluster Operator component that's responsible for coordinating the rolling restart or reconfiguration of Kafka pods when:
6
6
- non-dynamic reconfigurations needs to be applied
7
7
- update in Kafka CRD is detected
8
8
- a certificate is renewed
9
9
- pods have been manually annotated by the user for controlled restarts
10
10
- pod is stuck and is out of date
11
11
- Kafka broker is unresponsive to Kafka Admin connections
12
12
13
-
These are not the exhaustive list of possible triggers for rolling Kafka pods, but the main ones to highlight.
14
-
15
13
A pod is considered stuck if it is in one of following states:
16
14
-`CrashLoopBackOff`
17
15
-`ImagePullBackOff`
@@ -46,7 +44,7 @@ As you can see above, the current KafkaRoller still needs various changes and po
46
44
47
45
## Proposal
48
46
49
-
The objective of this proposal is to introduce a new KafkaRoller with simplified logic having a structured design resembling a finite state machine. KafkaRoller desisions are informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing.
47
+
The objective of this proposal is to introduce a new KafkaRoller with simplified logic having a structured design resembling a finite state machine. KafkaRoller decisions are informed by observations coming from different sources (e.g. Kubernetes API, KafkaAgent, Kafka Admin API). These sources will be abstracted so that KafkaRoller is not dependent on their specifics as long as it's getting the information it needs. The abstractions also enable much better unit testing.
50
48
51
49
Depending on the observed states, the roller will perform specific actions. Those actions should cause a subsequent observation to cause a state transition. This iterative process continues until each node's state aligns with the desired state.
52
50
@@ -59,9 +57,9 @@ When a new reconciliation starts up, a context object is created for each node t
59
57
- <i>currentNodeRole</i>: Currently assigned process roles for this node (e.g. controller, broker).
60
58
- <i>state</i>: It contains the current state of the node based on information collected from the abstracted sources (Kubernetes API, KafkaAgent and Kafka Admin API). The table below describes the possible states.
61
59
- <i>reason</i>: It is updated based on the current predicate logic from the Reconciler. For example, an update in the Kafka CR is detected.
62
-
- <i>numRestarts</i>: The value is incremented each time the node has been attempted to restart.
63
-
- <i>numReconfigs</i>: The value is incremented each time the node has been attempted to reconfigure.
64
-
- <i>numAttempts</i>: The value is incremented each time the node cannot be restarted/reconfigured due to not meeting safety conditions (more on this later).
60
+
- <i>numRestartAttempts</i>: The value is incremented each time the node has been attempted to restart.
61
+
- <i>numReconfigAttempts</i>: The value is incremented each time the node has been attempted to reconfigure.
62
+
- <i>numRetries</i>: The value is incremented each time the node cannot be restarted/reconfigured due to not meeting safety conditions (more on this later).
65
63
- <i>lastTransitionTime</i>: System.currentTimeMillis of last observed state transition.
66
64
67
65
<b>States</b>
@@ -76,16 +74,18 @@ When a new reconciliation starts up, a context object is created for each node t
76
74
| SERVING | Node is in running state and ready to serve requests (broker state >= 3 AND != 127). |
77
75
| LEADING_ALL_PREFERRED | Node is in running state and leading all preferred replicas. |
78
76
77
+
The broker states are defined [here](https://github.com/apache/kafka/blob/58ddd693e69599b177d09c2e384f31e7f5e11171/metadata/src/main/java/org/apache/kafka/metadata/BrokerState.java#L46).
78
+
79
79
### Configurability
80
-
The following can be the configured for the new KafkaRoller:
80
+
The following can be the configuration options for the new KafkaRoller:
81
81
82
82
| Configuration | Default value | Exposed to user | Description |
|maxRestarts| 3 | No | The maximum number of times a node can be restarted before failing the reconciliation. This is checked against the node's `numRestarts`.|
85
-
|maxReconfigs| 3 | No | The maximum number of times a node can be reconfigured before restarting it. This is checked against the node's `numReconfigs`.|
86
-
|maxAttempts| 10 | No | The maximum number of times a node can retried after not meeting the safety conditions. This is checked against the node's `numAttempts`.|
84
+
|maxRestartAttempts| 3 | No | The maximum number of times a node can be restarted before failing the reconciliation. This is checked against the node's `numRestarts`.|
85
+
|maxReconfigAttempts| 3 | No | The maximum number of times a node can be reconfigured before restarting it. This is checked against the node's `numReconfigs`.|
86
+
|maxRetries| 10 | No | The maximum number of times a node can retried after not meeting the safety conditions. This is checked against the node's `numAttempts`.|
87
87
| postOperationTimeoutMs | 60 seconds | Yes | The maximum amount of time we will wait for nodes to transition to `SERVING` state after an operation in each retry. This will be based on the operation timeout that is already exposed to the user via environment variable `STRIMZI_OPERATION_TIMEOUT_MS`. |
88
-
|maxBatchSize| 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. |
88
+
|maxRestartParallelism| 1 | Yes | The maximum number of broker nodes that can be restarted in parallel. |
89
89
90
90
91
91
### Algorithm
@@ -110,12 +110,12 @@ Context: {
110
110
-`WAIT_FOR_LOG_RECOVERY` - Nodes that have `RECOVERING` state.
111
111
-`RESTART` - Nodes that have non-empty list of reasons from the predicate function and have not been restarted yet (Context.numRestarts == 0).
112
112
-`MAYBE_RECONFIGURE` - Broker nodes (including combined nodes) that have an empty list of reasons and not been reconfigured yet (Context.numReconfigs == 0).
113
-
-`NOP` - Nodes that have been restarted or reconfigured at least once (Context.numRestarts > 0 || Context.numReconfigs > 0 ) and have either
113
+
-`NOP` - Nodes that have at least one restart or reconfiguration attempt (Context.numRestarts > 0 || Context.numReconfigs > 0 ) and have either
114
114
`LEADING_ALL_PREFERRED` or `SERVING` state.
115
115
116
116
4. Wait for nodes in `WAIT_FOR_LOG_RECOVERY` group to finish performing log recovery.
117
117
- Wait for each node to have `SERVING` within the `postOperationalTimeoutMs`.
118
-
- If the timeout is reached for a node and its `numAttempts` is greater than or equal to `maxAttempts`, throw `UnrestartableNodesException` with the log recovery progress (number of remaining logs and segments). Otherwise increment node's `numAttempts` and restart from step 3.
118
+
- If the timeout is reached for a node and its `numAttempts` is greater than or equal to `maxRetries`, throw `UnrestartableNodesException` with the log recovery progress (number of remaining logs and segments). Otherwise increment node's `numAttempts` and restart from step 3.
119
119
120
120
5. Restart nodes in `RESTART_FIRST` category:
121
121
- if one or more nodes have `NOT_RUNNING` state, we first need to check 2 special conditions:
@@ -125,15 +125,15 @@ Context: {
125
125
- If a node is in `NOT_RUNNING` state, the restart it only if it has `POD_HAS_OLD_REVISION` reason. This is because, if the node is not running at all, then restarting it likely won't make any difference unless node is out of date.
126
126
> For example, if a pod is in pending state due to misconfigured affinity rule, there is no point restarting this pod again or restarting other pods, because that would leave them in pending state as well. If the user then fixed the misconfigured affinity rule, then we should detect that the pod has an old revision, therefore should restart it so that pod is scheduled correctly and runs.
127
127
128
-
- At this point either we started all nodes or a node or decided not to because of node's reason not being `POD_HAS_OLD_REVISION`. Regardless, wait for nodes to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numAttempts` is greater than or equal to `maxAttempts`, throw `TimeoutException`. Otherwise increment node's `numAttempts` and restart from step 3.
128
+
- At this point either we started all nodes or a node or decided not to because of node's reason not being `POD_HAS_OLD_REVISION`. Regardless, wait for nodes to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numAttempts` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numAttempts` and restart from step 3.
129
129
130
130
131
-
- Otherwise the controllers will be attempted to restart one by one in the following order:
131
+
- Otherwise the nodes will be attempted to restart one by one in the following order:
132
132
- Pure controller nodes
133
133
- Combined nodes
134
134
- Broker only nodes
135
135
136
-
- Wait for the restarted node to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numAttempts` is greater than or equal to `maxAttempts`, throw `TimeoutException`. Otherwise increment node's `numAttempts` and restart from step 3.
136
+
- Wait for the restarted node to have `SERVING` within `postOperationalTimeoutMs`. If the timeout is reached and the node's `numAttempts` is greater than or equal to `maxRetries`, throw `TimeoutException`. Otherwise increment node's `numAttempts` and restart from step 3.
137
137
138
138
6. Further refine the broker nodes in `MAYBE_RECONFIGURE` group:
139
139
- Describe Kafka configurations for each node via Admin API and compare them against the desired configurations. This is essentially the same mechanism we use today for the current KafkaRoller.
@@ -165,18 +165,17 @@ Context: {
165
165
- If `COMBINED_AND_NOT_ACTIVE_CONTROLLER` group is non empty, return the first node that can be restarted without impacting the quorum health and the availability.
166
166
- If `COMBINED_AND_ACTIVE_CONTROLLER` group is non empty, return the node if it can be restarted without impacting the quorum health and the availability. Otherwise return an empty set.
167
167
- If `BROKER` group is non empty, batch the broker nodes:
168
-
- remove the node from the list, if it is a combined node and cannot be restarted without impacting the quorum health so that it does get included in a batch
169
168
- build a map of nodes and their replicating partitions by sending describeTopics request to Admin API
170
169
- batch the nodes that do not have any partitions in common therefore can be restarted together
171
170
- remove nodes that have an impact on the availability from the batches (more on this later)
172
171
- return the largest batch
173
-
- If an empty batch is returned, that means none of the nodes met the safety conditions such as availability and qourum health impact. In this case, check their `numAttempts` and if any of them is equal to or greater than `maxAttempts`, throw `UnrestartableNodesException`. Otherwise increment their `numAttempts` and restart from step 3.
172
+
- If an empty batch is returned, that means none of the nodes met the safety conditions such as availability and qourum health impact. In this case, check their `numAttempts` and if any of them is equal to or greater than `maxRetries`, throw `UnrestartableNodesException`. Otherwise increment their `numAttempts` and restart from step 3.
174
173
175
174
8. Restart the nodes from the returned batch in parallel:
176
175
- If `numRestarts` of a node is larger than `maxRestarts`, throw `MaxRestartsExceededException`.
177
176
- Otherwise, restart each node and transition its state to `RESTARTED` and increment its `numRestarts`.
178
177
- After restarting all the nodes in the batch, wait for their states to become `SERVING` until the configured `postOperationalTimeoutMs` is reached.
179
-
- If the timeout is reached, throw `TimeoutException` if a node's `numAttempts` is greater than or equal to `maxAttempts`. Otherwise increment their `numAttempts` and restart from step 3.
178
+
- If the timeout is reached, throw `TimeoutException` if a node's `numAttempts` is greater than or equal to `maxRetries`. Otherwise increment their `numAttempts` and restart from step 3.
180
179
181
180
9. If there are no exceptions thrown at this point, the reconciliation completes successfully. If there were `UnrestartableNodesException`, `TimeoutException`, `MaxRestartsExceededException` or any other unexpected exceptions throws, the reconciliation fails.
0 commit comments