User Request: Support for pod state disruptions

**Is your feature request related to a problem? Please describe.**
Pod state failures (e.g. graceful/non-graceful deletion) are common disruptions in the Chaos Engineering community.

The reasoning behind pod failures is that Kubernetes pods are ephemeral resources; they get destroyed, restarted, recreated. 

This happens in many cases:
- When deploying a new version of an application
- In case the liveness probe of any container running inside the pod fails
- As a consequence of draining a node
- When the autoscaler updates the number of replicas of a deployment

Pod state disruptions can expose a number of reliability concerns including:
- Long-living pods and all the issues that may arise from them
- Cold start issues
- Scalability issues (e.g. autoscaling misconfigurations)
- Inconsistent/unknown startup times
- Uneven traffic distribution across pods
- Non-graceful shutdown
- Issues related to Java's DNS cache TTL leading to terminated pods still receiving requests
- Cascading failures
- We also wrote a [blogpost](https://medium.com/hotels-com-technology/resilience-at-hotels-com-part-1-kube-monkey-c54e6f95f2c4) on issues we found when using Kube Monkey

**Describe the solution you'd like**
Pod deletions can be executed in many different ways. The easiest is through the Kubernetes client which supports graceful and non-graceful deletions through its [gracePeriodSeconds](https://v1-19.docs.kubernetes.io/docs/reference/generated/kubernetes-api/v1.19/#delete-pod-v1-core) parameter. This is how tools like Kube Monkey and our internal controller execute that disruption.

The other option would be to do this at container level which provides more granularity. This is how [Pumba](https://github.com/alexei-led/pumba/blob/master/pkg/container/client.go) executes these disruptions.

A few more implementation details/ideas:
- The level will always be `pod` for these disruptions.
- In the CRD this might get a bit confusing but one option is to introduce a `podFailure`, similar to `nodeFailure`, with options (e.g. graceful/non-graceful deletion).
- There is already a `containers` field which would allow targeting containers.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

User Request: Support for pod state disruptions #352

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

User Request: Support for pod state disruptions #352

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions