-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make manual failover reset the on-going election to promote failover #1274
base: unstable
Are you sure you want to change the base?
Make manual failover reset the on-going election to promote failover #1274
Conversation
If a manual failover got timed out, like the election don't get the enough votes, since we have a auth_timeout and a auth_retry_time, a new manual failover will not be able to proceed on the replica side. Like if we initiate a new manual failover after a election timed out, we will pause the primary, but on the replica side, due to retry_time, replica does not trigger the new election and the manual failover will eventually time out. In this case, if we initiate manual failover again and there is an ongoing election, we will reset it so that the replica can initiate a new election at the manual failover's request. Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1274 +/- ##
============================================
+ Coverage 70.69% 70.70% +0.01%
============================================
Files 114 115 +1
Lines 63161 63160 -1
============================================
+ Hits 44650 44656 +6
+ Misses 18511 18504 -7
|
A log demo from the test case (before the fix). replica:
the primary:
|
server.cluster->mf_can_start = 1; | ||
manualFailoverCanStart(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rather invoke resetManualFailover
and clean up failover_auth_time
in that period?
We should anyway cleanup failover_auth_time in the resetManualFailover method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about putting it in resetManualFailover, but I was worried about introducing other problems since resetManualFailover is called in many places.
…_reset Signed-off-by: Binbin <[email protected]>
Signed-off-by: Binbin <[email protected]>
R 3 cluster failover | ||
|
||
# Waiting for primary and replica to confirm manual failover timeout. | ||
wait_for_log_messages 0 {"*Manual failover timed out*"} 0 1000 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i forgot to mention in here, the wait will take at least 5 seconds since the manual failover timeout is 5s, i can use other fields maybe like epoch to replace it if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using log messages should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
R 3 cluster failover | ||
|
||
# Waiting for primary and replica to confirm manual failover timeout. | ||
wait_for_log_messages 0 {"*Manual failover timed out*"} 0 1000 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using log messages should be OK.
R 1 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE | ||
R 2 debug drop-cluster-packet-filter $CLUSTER_PACKET_TYPE_NONE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the usage of these constants, we should probably do it more to improve readability in tests.
Signed-off-by: Binbin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review. The idea looks good.
If a manual failover got timed out, like the election don't get the
enough votes, since we have a auth_timeout and a auth_retry_time, a
new manual failover will not be able to proceed on the replica side.
Like if we initiate a new manual failover after a election timed out,
we will pause the primary, but on the replica side, due to retry_time,
replica does not trigger the new election and the manual failover will
eventually time out.
In this case, if we initiate manual failover again and there is an
ongoing election, we will reset it so that the replica can initiate
a new election at the manual failover's request.