Skip to content

Conversation

keith-turner
Copy link
Contributor

fixes #5533

@keith-turner keith-turner added this to the 4.0.0 milestone Jun 27, 2025
@keith-turner
Copy link
Contributor Author

RootRecoveryIT passes with this change. The test was failing because the balance thread tries to read user migrations while the root table is unassigned and blocks. The balance thread being blocked causes root tablet assignment to block.

@dlmarion
Copy link
Contributor

This looks ok I think, have you run all of the ITs (or the balancing ones) by chance?

@keith-turner
Copy link
Contributor Author

This looks ok I think, have you run all of the ITs (or the balancing ones) by chance?

I only ran ComprehensiveIT and RootRecoveryIT. I will look for some more ITs to run that cover balancing.

@keith-turner
Copy link
Contributor Author

keith-turner commented Jun 30, 2025

Tried running all of the ITs w/ Balance in their name and ran into a problem w/ TabletResourceGroupBalanceIT. Fixed this in e0fede1. The test was failing because one test methods was seeing a tserver in ZK that was killed in a previous test. I think waitForBalance used to avoid this by chance. Looking into waitForBalance made some changes to do one thing it used to do. But it does not do everything it used to do. Also made TabletResourceGroupBalanceIT wait for the ZK lock to go away at the end of a test.

Seeing BalanceIT timeout. Made update c1812f3 based on looking into that, but still see it timeout sometimes. Need to look into that some more.

@keith-turner
Copy link
Contributor Author

With the changes in 7e428a8 BalanceIT is now running more reliably, it was flaky because balancing was not running frequently enough.

SimpleBalancerFairnessIT was flaky, seems this was because the test was written w/ the assmption that tablets would all be hosted. In 7e428a8 changed the test host all tablets. In general balancing and on demand tablets seems like it needs some improvement, made a comment on #5667 related to this.

@dlmarion
Copy link
Contributor

dlmarion commented Jul 1, 2025

If the ITs are passing, then I have no other comments / issues with the PR.

@keith-turner keith-turner merged commit cf0f1eb into apache:main Jul 1, 2025
8 checks passed
@keith-turner keith-turner deleted the accumulo-5533 branch July 1, 2025 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dependency between balancing and tablet group watcher can leave system in unworkable state.

2 participants