-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Is your feature request related to a problem? Please describe.
Pod state failures (e.g. graceful/non-graceful deletion) are common disruptions in the Chaos Engineering community.
The reasoning behind pod failures is that Kubernetes pods are ephemeral resources; they get destroyed, restarted, recreated.
This happens in many cases:
- When deploying a new version of an application
- In case the liveness probe of any container running inside the pod fails
- As a consequence of draining a node
- When the autoscaler updates the number of replicas of a deployment
Pod state disruptions can expose a number of reliability concerns including:
- Long-living pods and all the issues that may arise from them
- Cold start issues
- Scalability issues (e.g. autoscaling misconfigurations)
- Inconsistent/unknown startup times
- Uneven traffic distribution across pods
- Non-graceful shutdown
- Issues related to Java's DNS cache TTL leading to terminated pods still receiving requests
- Cascading failures
- We also wrote a blogpost on issues we found when using Kube Monkey
Describe the solution you'd like
Pod deletions can be executed in many different ways. The easiest is through the Kubernetes client which supports graceful and non-graceful deletions through its gracePeriodSeconds parameter. This is how tools like Kube Monkey and our internal controller execute that disruption.
The other option would be to do this at container level which provides more granularity. This is how Pumba executes these disruptions.
A few more implementation details/ideas:
- The level will always be
podfor these disruptions. - In the CRD this might get a bit confusing but one option is to introduce a
podFailure, similar tonodeFailure, with options (e.g. graceful/non-graceful deletion). - There is already a
containersfield which would allow targeting containers.