-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(esClientDrain): enhance Drain ES Client function #168
base: master
Are you sure you want to change the base?
Conversation
Add an ability to set parameters for retry logic with Resty lib Add handling for not finishned shards migration Signed-off-by: vkropotko <[email protected]>
Signed-off-by: vkropotko <[email protected]>
If I understand the PR correctly, it should fix this bug: #128 |
Signed-off-by: vkropotko <[email protected]>
Signed-off-by: vkropotko <[email protected]>
#128 Yes, I think so. Sorry, I didn't check the list of existed issues 😃 |
👍 |
@mikkeloscar mind to also take a look? |
@@ -59,7 +66,8 @@ func TestGetElasticsearchEndpoint(t *testing.T) { | |||
customEndpoint, err := url.Parse(customURL) | |||
assert.NoError(t, err) | |||
|
|||
esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint) | |||
esOperator = NewElasticsearchOperator(faker, nil, 1*time.Second, 1*time.Second, "", "", ".cluster.local.", customEndpoint, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's pass the RetryConfig
from here instead of individual arguments.
if shard.IP == podIP { | ||
err = fmt.Errorf("Cannot migrate shards from pod '%s' with IP '%s' within provided intervals", pod.ObjectMeta.Name, pod.Status.PodIP) | ||
// if we cannot remove node than return it back active nodes pool | ||
if errExclude := c.undoExcludePodIP(pod); errExclude != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand the motivation for this, I'm not sure this is the best place to handle the issue you describe.
- If we hit the drain timeout the right thing would be to retry draining which is handled elsewhere in
operator.go
. Otherwise we can end up draining a bit, re-enabling the pod so it gets shards back, and then attempt to drain it once more. - If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the
operator.go
code and not have it as a side effect of the es client drain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't understand what do you suggest.
The current implementation looks like this:
separate autoscaler -> getScalingDirection
-> updateEDS
(parameters)
ES cluster -> run operator -> run operator.operate() in some interval -> ... -> operatePods
can run Drain
func directly or from rescaleStatefulSet
func -> control is transferred elaticserach.go
-> Drain -> waiting for draining with some retry settings
The idea of the current code to provide users the ability to stop infinity waiting for draining (this can happen mostly because of configuration error between EDS and ES indices or some problems with pods/cluster).
My assumption was next: if users specify wrong settings and Drain
stopped work then users can check logs and fix settings for their case.
ES-Op guarantees only that Drain
will have a rollback in case of failure.
If we hit the drain timeout the right thing would be to retry
draining which is handled elsewhere in operator.go.
Otherwise, we can end up draining a bit, re-enabling the pod so it gets shards back,
and then attempt to drain it once more.
Am I right that you suggest adding an additional setting for this retry? or you suggest to rid of the retry with Resty lib?
An additional setting on the operator level will produce kinda identical behavior as a specified already existed Resty settings besides checking cluster status. But this also introduces complexity to calculate the final max waiting interval.
If we hit the drain timeout and the operator decides it's better to scale out and not drain, as I understand the issue described that this aims to solve, then I think it would be better to undo the draining in the operator.go code and not have it as a side effect of the es client drain.
So do you suggest just wrap undoExcludePodIP
func into new func, .e.g. undoDrain
and run it with bound to Drain
in operator.go
?
Is there a case when Es-Op received an error from Drain
func and doesn't need to run undoDrain
beside the case when an additional retry option for Drain
existed?
Also, I can try to implement the next functionality:
- if
Drain
is unsuccessful -> leave the pod in the excluded list and set annotation/status on EDS - During the next operator run check that the previous status exists and if ES-Op doesn't decide to try to
Drain
it again -> runundoDrain
for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikkeloscar hey, could you please check my comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikkeloscar I would be good with this behaviour because it unblocks other cluster operations, but the users should have alerting on "Cannot migrate shards from pod" in the logs, because apparently there's some wrong configuration in the cluster to be fixed manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I'm sorry for not following up here earlier.
My main point was that the code added here in my mind is operator code and should belong in the logic of operator.go
and not in the client for calling ES.
Right now, what would happen on a drain timeout is that the drain would be undone, by the new code added here, and the next time the operator loop runs, it would select the exact same pod again, because it has the drain annotation, and it would again start draining, so you basically don't win anything by having it here is my point.
One clear problem that I see with the current behavior (before this PR) is that on scale-out we don't stop the draining of the pod and then undo the drain, which would make sense if we were only draining the pod for scaledown previously. However, if the pod is being drained because of a node being shut down, then draining is exactly what we want, and it IMO shouldn't be stopped.
Apart from the scale-out scenario, I don't completely follow the idea of stopping the "infinite" drain in case the cluster is misconfigured. What would be the alternative? To stop the draining and then do what? I agree with @otrosien that the operators/owners of the EDS should have some sort of alert on these kind of misconfigurations, because it's not something the es-operator is able to handle by itself, so from my perspective it's best it just tries to drain and then a human can get alerted if this drain takes longer that expected (maybe we can create better logs or other metrics, to make it easier to create such alerts?).
@otrosien you mentioned that it "unblocks other cluster operations". Are there other operations than what I mentioned which are blocked during the drain? Maybe I'm missing some knowledge about this and are judging the thing wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, sorry for the delayed response.
Let's clarify several things:
- I think I was no so clear with describing current behavior, it's not an infinity cycle to wait for the
Drain
condition.
Current behavior works next way:
functionwaitForEmptyEsNode
conducts up to 999 retry operations with increasing interval (fromSetRetryWaitTime
toSetRetryMaxWaitTime
values, or 10 -> 30 seconds in our case).
So this requires hours till it will be finished.
Then inwaitForEmptyEsNode
we ignore response
https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965L282
And check onlyerr
which will benil
because unsuccessful retries are not an error condition.
(instead of that we need to check another parameter - https://github.com/zalando-incubator/es-operator/pull/168/files#diff-67b73dbcc9b2b625a0749cfc62f02ad480ccd9d40abd4664e1455f01ff446965R367)
After this we remove a target node(pod) from the K8S cluster despite this node was not correctly removed from ES cluster. - about unblocking other cluster operations, for example:
We have a cluster that is stuck in scale-down operation, e.g. cannot migrate shards to other data nodes.
If the user updated an EDS(increased minReplicas) and the current scaling direction would beUP
then in the current implementation (before this PR) ES-Op will continue doing retries for scaling down.
If a user(or some system conditions) will restart ES-Op in this case then ES-Op will start to do the expected scaling up the operation. But here we have another issue - the IP address of the node that ES-Op tried to remove will be presented inexclude_list
.
In the PR implementation, it will work as before (I didn't change existed Retry configuration) but we will have a graceful exit(ref 1) from failed Drain. Also, users will have the ability to configure the custom timeout for Drain which is useful for cases with regular scale in/down operations(ref 2), like scaling EDS by Cronjobs.
So according to your first comment, I think I can leave a fix for 1) and for 2) - add function undoDrain
which contains undoExclude
to operator.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we can add an additional parameter to control behavior in the case of undoDrain
@mikkeloscar
I would like to take another look at this. |
One-line summary
Description
Add an ability to set parameters for retry logic with Resty lib.
Add handling for not finished shards migration.
About shards migration and Drain behavior
We have a kinda bug in waitForEmptyEsNode function.
In this function, after we finish a retry logic we don't check the response and proceed with the deletion of the pod but it can be wrong (shards can be still on the pod). So validation for this case is added.
Also, an additional logic added with
removeFromExcludeIPList
. So if checking of remaining shards finishes without success now we remove the pod from the exclude list in ES cluster settings (before, it left in the excluded list).There is the next idea behind it - ES-Op can change scaling direction after waiting for condition(migration of remaining shards from Pod) so it looks better to start with the state before Drain was started.
Types of Changes
What types of changes does your code introduce? Keep the ones that apply:
Review
List of tasks the reviewer must do to review the PR
Deployment Notes
To test this behavior ES-Op should be deployed with small values of retry logic options: