-
Notifications
You must be signed in to change notification settings - Fork 467
Moved balancing into three dedicated threads #5687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
RootRecoveryIT passes with this change. The test was failing because the balance thread tries to read user migrations while the root table is unassigned and blocks. The balance thread being blocked causes root tablet assignment to block. |
This looks ok I think, have you run all of the ITs (or the balancing ones) by chance? |
I only ran ComprehensiveIT and RootRecoveryIT. I will look for some more ITs to run that cover balancing. |
Tried running all of the ITs w/ Balance in their name and ran into a problem w/ TabletResourceGroupBalanceIT. Fixed this in e0fede1. The test was failing because one test methods was seeing a tserver in ZK that was killed in a previous test. I think waitForBalance used to avoid this by chance. Looking into waitForBalance made some changes to do one thing it used to do. But it does not do everything it used to do. Also made TabletResourceGroupBalanceIT wait for the ZK lock to go away at the end of a test. Seeing BalanceIT timeout. Made update c1812f3 based on looking into that, but still see it timeout sometimes. Need to look into that some more. |
With the changes in 7e428a8 BalanceIT is now running more reliably, it was flaky because balancing was not running frequently enough. SimpleBalancerFairnessIT was flaky, seems this was because the test was written w/ the assmption that tablets would all be hosted. In 7e428a8 changed the test host all tablets. In general balancing and on demand tablets seems like it needs some improvement, made a comment on #5667 related to this. |
If the ITs are passing, then I have no other comments / issues with the PR. |
fixes #5533