-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manual failover vote is not limited by two times the node timeout #1305
base: unstable
Are you sure you want to change the base?
Conversation
This limit should not restrict manual failover, otherwise in some scenarios, manual failover will time out. For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs are lost during a manual failover, it cannot vote in the second manual failover. Or in a mixed scenario of plain failover and manual failover, it cannot vote for the subsequent manual failover. Signed-off-by: Binbin <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1305 +/- ##
============================================
- Coverage 70.69% 70.68% -0.01%
============================================
Files 115 115
Lines 63153 63158 +5
============================================
+ Hits 44643 44646 +3
- Misses 18510 18512 +2
|
I'm not sure I agree with this. I think there should be some built in timeout into the system and you should retry. |
The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the manual failover retry is that the mf will pause the client 5s in the primary side. So every retry every manual failover timed out is a bad move
Yes, client pause for a long time is a bad move. ♟️ ❌
We already had this comment:
/* We did not voted for a replica about this primary for two
* times the node timeout. This is not strictly needed for correctness
* of the algorithm but makes the base case more linear. */
Hm, not strictly needed for correctness means that it's OK to change it. It doesn't affect correctness. You added this to the same comment:
* This limitation does not restrict manual failover. If a user initiates
* a manual failover, we need to allow it to vote, otherwise the manual
* failover may time out. */
I think it's safe. I like the fix. The test cases look good too. Just some nits.
Co-authored-by: Viktor Söderqvist <[email protected]> Signed-off-by: Binbin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I added a comment about a log message, because you touched it. :) We don't have to fix it though.
Signed-off-by: Binbin <[email protected]>
This limit should not restrict manual failover, otherwise in some
scenarios, manual failover will time out.
For example, if some FAILOVER_AUTH_REQUESTs or some FAILOVER_AUTH_ACKs
are lost during a manual failover, it cannot vote in the second manual
failover. Or in a mixed scenario of plain failover and manual failover,
it cannot vote for the subsequent manual failover.
The problem with the manual failover retry is that the mf will pause
the client 5s in the primary side. So every retry every manual failover
timed out is a bad move.