Skip to content

User Request: Dead man's switch #375

@nikos912000

Description

@nikos912000

Is your feature request related to a problem? Please describe.
At the moment the "big red button" for single and multiple experiments relies on the operator having access to the cluster.
For example, the operator can run a kubectl delete Disruption <name> for one or more disruptions.

It would be great if the controller had a dead man's switch. In case connection to the cluster is lost the controller would automatically stop all the running experiments.

Describe the solution you'd like

I think the implementation of a dead man's switch could use a heartbeat and a watchdog timer for remediation.
I'm still not sure how the heartbeat would look like. Can we check if the controller is still up and running and if connection to the cluster is lost?

Describe alternatives you've considered
Introducing support for duration with a default expiry period is a good first step for mitigating the risks. However, it is not enough.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions