num_rebuild limit hit during partial rebuild results in Full rebuild #1806

abhilashshetty04 · 2025-01-23T18:40:56Z

Describe the bug
Issues with rebuild_num limit:

While attempting both partial and full rebuild we check if num_rebuild cluster wide is reached if yes we return the MaxRebuild error.

During partial rebuild we do online_child when child node comes back within the timeout. We check the limit here as well while onlining the child. If online_child fails, Its possible that child (node) came back momentarily and after the current outage the nexus would discard old write-log and start maintaining new one. So we lose old log. Hence we would basically attempt Full rebuild of the child if online_child call fails. Incase online_child fails due to the num_rebuild limit we still start Full rebuild which is not correct. We are discarding child which could be partially rebuilt.

To Reproduce
We could set the num_rebuild limit very low and shutdown 2 node involving many volumes. Bring back one of them within the timeout (to start Partil rebuild) and dont bring the other node to start Full rebuild. Hopefully, at some point limit will be met and online_child will fail resulting in Full rebuild.

Expected behavior
We should handle the MaxRebuild error and not start Full rebuild. We could try online_child in next attempt.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_rebuild limit hit during partial rebuild results in Full rebuild #1806

num_rebuild limit hit during partial rebuild results in Full rebuild #1806

abhilashshetty04 commented Jan 23, 2025 •

edited

Loading

num_rebuild limit hit during partial rebuild results in Full rebuild #1806

num_rebuild limit hit during partial rebuild results in Full rebuild #1806

Comments

abhilashshetty04 commented Jan 23, 2025 • edited Loading

abhilashshetty04 commented Jan 23, 2025 •

edited

Loading