-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Is your feature request related to a problem? Please describe.
At the moment the "big red button" for single and multiple experiments relies on the operator having access to the cluster.
For example, the operator can run a kubectl delete Disruption <name> for one or more disruptions.
It would be great if the controller had a dead man's switch. In case connection to the cluster is lost the controller would automatically stop all the running experiments.
Describe the solution you'd like
I think the implementation of a dead man's switch could use a heartbeat and a watchdog timer for remediation.
I'm still not sure how the heartbeat would look like. Can we check if the controller is still up and running and if connection to the cluster is lost?
Describe alternatives you've considered
Introducing support for duration with a default expiry period is a good first step for mitigating the risks. However, it is not enough.