Skip to content

Conversation

@preespp
Copy link
Contributor

@preespp preespp commented Nov 13, 2025

Added a new fault for readiness probe flapping (incident 200), targeting the frontend Deployment in the default namespace. The mechanism modifies the Deployment’s readiness probe to alternate between healthy and unhealthy, simulating intermittent readiness failures that cause pods to repeatedly transition between Ready and NotReady.

Injection:

  • Retrieve existing Deployment and backup its current manifest.
  • Patch the container’s readinessProbe with aggressive parameters:
    • periodSeconds set low to trigger frequent checks.
    • failureThreshold and successThreshold set to 1.
  • Pods may alternate between Ready and NotReady.
  • If restart_policy is set to force, pods are restarted to apply the patch immediately.

Alerts:

  • RequestErrorRate (frontend-service-1): Triggered due to increased 5xx errors from flapping pods.
  • RequestLatency (frontend-service-1): Frontend latency spikes due to pod churn.
  • FailedPodsDetected (frontend-pod-1): Expected alert did not fire—the pod never fully failed despite readiness probe flapping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant