Fail health check if intents reconcile starts and doesn't finish within 30s #507
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prior to this PR, slow reconcile performance due to limited CPU or due to IO (slow network, slow responses from the control plane) could cause the operator to function slowly. If this is slow enough, then the operator is essentially not functioning properly -- if it can't reconcile in a timely manner, in practice it's not reconciling successfully, even if ultimately it does succeed.
The operator will now fail its health check if a reconcile starts and does not complete within 30 seconds, which will cause it to restart and perhaps self-heal the issue, but also indicate to the cluster operators that something is wrong and needs to be investigated, long before the issue materializes some other way.