Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

num_rebuild limit hit during partial rebuild results in Full rebuild #1806

Open
abhilashshetty04 opened this issue Jan 23, 2025 · 0 comments

Comments

@abhilashshetty04
Copy link
Member

abhilashshetty04 commented Jan 23, 2025

Describe the bug
Issues with rebuild_num limit:

While attempting both partial and full rebuild we check if num_rebuild cluster wide is reached if yes we return the MaxRebuild error.

During partial rebuild we do online_child when child node comes back within the timeout. We check the limit here as well while onlining the child. If online_child fails, Its possible that child (node) came back momentarily and after the current outage the nexus would discard old write-log and start maintaining new one. So we lose old log. Hence we would basically attempt Full rebuild of the child if online_child call fails. Incase online_child fails due to the num_rebuild limit we still start Full rebuild which is not correct. We are discarding child which could be partially rebuilt.

To Reproduce
We could set the num_rebuild limit very low and shutdown 2 node involving many volumes. Bring back one of them within the timeout (to start Partil rebuild) and dont bring the other node to start Full rebuild. Hopefully, at some point limit will be met and online_child will fail resulting in Full rebuild.

Expected behavior
We should handle the MaxRebuild error and not start Full rebuild. We could try online_child in next attempt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant