You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While attempting both partial and full rebuild we check if num_rebuild cluster wide is reached if yes we return the MaxRebuild error.
During partial rebuild we do online_child when child node comes back within the timeout. We check the limit here as well while onlining the child. If online_child fails, Its possible that child (node) came back momentarily and after the current outage the nexus would discard old write-log and start maintaining new one. So we lose old log. Hence we would basically attempt Full rebuild of the child if online_child call fails. Incase online_child fails due to the num_rebuild limit we still start Full rebuild which is not correct. We are discarding child which could be partially rebuilt.
To Reproduce
We could set the num_rebuild limit very low and shutdown 2 node involving many volumes. Bring back one of them within the timeout (to start Partil rebuild) and dont bring the other node to start Full rebuild. Hopefully, at some point limit will be met and online_child will fail resulting in Full rebuild.
Expected behavior
We should handle the MaxRebuild error and not start Full rebuild. We could try online_child in next attempt.
The text was updated successfully, but these errors were encountered:
Describe the bug
Issues with rebuild_num limit:
While attempting both partial and full rebuild we check if
num_rebuild
cluster wide is reached if yes we return theMaxRebuild
error.During partial rebuild we do online_child when child node comes back within the timeout. We check the limit here as well while onlining the child. If online_child fails, Its possible that child (node) came back momentarily and after the current outage the nexus would discard old write-log and start maintaining new one. So we lose old log. Hence we would basically attempt Full rebuild of the child if
online_child
call fails. Incase online_child fails due to thenum_rebuild
limit we still start Full rebuild which is not correct. We are discarding child which could be partially rebuilt.To Reproduce
We could set the num_rebuild limit very low and shutdown 2 node involving many volumes. Bring back one of them within the timeout (to start Partil rebuild) and dont bring the other node to start Full rebuild. Hopefully, at some point limit will be met and online_child will fail resulting in Full rebuild.
Expected behavior
We should handle the
MaxRebuild
error and not start Full rebuild. We could tryonline_child
in next attempt.The text was updated successfully, but these errors were encountered: