Chaos engineering Environment for observing how the system responds, and implementing improvements #39123
Replies: 4 comments 3 replies
-
Can you please give some more detail about how the design will look? |
Beta Was this translation helpful? Give feedback.
-
I think we need to focus on simulating single points of failure and I need to identify key areas which impact our services. |
Beta Was this translation helpful? Give feedback.
-
@dginther I think I can simulate failures for vets-api just using a K8 setup to simulate the proxy, sidekiq and external services failures. So just Docker files and a K8 cluster. Thoughts, |
Beta Was this translation helpful? Give feedback.
-
RFCs have been moved to: https://github.com/department-of-veterans-affairs/va.gov-platform-architecture |
Beta Was this translation helpful? Give feedback.
-
Summary
Create a Chaos engineering.setup of vets-api.
NOTE: This will not be a public facing website, and only needs ip access for the SRE team and Operations. The goal is to run tests and not simulate user activity.
Chaos engineering is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling, observing how the system responds, and implementing improvements.
Motivation
Why are we doing this? What use cases does it support? What is the expected
outcome?
Chaos Engineering methods help improve the reliability and stability of our complex environment.
We have many external dependencies that create many difficult to diagnose issues and errors.
To create playbooks for various incidents by simulating failures from all of the potential points of failure.
The SRE Team is tasked with helping to improve the reliability and stability of systems and to reduce the incidents and response to known incidents.
Chaos engineering helps enhance and improve incident response playbooks and system resiliency, by creating the real-world conditions needed to uncover the hidden issues and performance bottlenecks that are difficult to find in distributed systems.
Detailed design
NOTE: This will not be a public facing website, and only needs ip access for the SRE team and Operations. The goal is to run tests and not simulate user activity.
Create new Chaos Engineering Env Simulated with a K8 cluster for each of the key network points
Drawbacks
Why should we not do this? Please consider:
Beta Was this translation helpful? Give feedback.
All reactions