Chaos engineering Environment for observing how the system responds, and implementing improvements #39123

zachclarity · 2022-03-25T14:35:53Z

zachclarity
Mar 25, 2022

Start Date: 2022-03-25
Gatsby Issue: Chaos engineering Environment for observing how the system responds, and implementing improvements #39121

Summary

Create a Chaos engineering.setup of vets-api.

NOTE: This will not be a public facing website, and only needs ip access for the SRE team and Operations. The goal is to run tests and not simulate user activity.

Chaos engineering is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling, observing how the system responds, and implementing improvements.

Motivation

Why are we doing this? What use cases does it support? What is the expected
outcome?

Chaos Engineering methods help improve the reliability and stability of our complex environment.
We have many external dependencies that create many difficult to diagnose issues and errors.
To create playbooks for various incidents by simulating failures from all of the potential points of failure.
The SRE Team is tasked with helping to improve the reliability and stability of systems and to reduce the incidents and response to known incidents.
Chaos engineering helps enhance and improve incident response playbooks and system resiliency, by creating the real-world conditions needed to uncover the hidden issues and performance bottlenecks that are difficult to find in distributed systems.

Detailed design

NOTE: This will not be a public facing website, and only needs ip access for the SRE team and Operations. The goal is to run tests and not simulate user activity.

Create new Chaos Engineering Env Simulated with a K8 cluster for each of the key network points

Create a new K8 cluster setup to simulate the following for vets-api
- Sidekiq queue Failure
- Reverse Proxy ALB
- Forward Proxy
- Two Vets-Api K8 for failure over
- Network access to Sentry and Datadog

Drawbacks

Why should we not do this? Please consider:

The initial setup will required some time from the Operation team
AWS resources for the simple K8 cluster

dginther · 2022-03-25T14:41:41Z

dginther
Mar 25, 2022

Can you please give some more detail about how the design will look?
Do you plan to put this in it's own VPC? Are you planning to use staging services, or connect to mock upstream services?
What chaos engineering tools are you considering?

3 replies

zachclarity Mar 25, 2022
Author

@dginther Let me know if you need more details than what I gave. I used the existing config info and created a new VPC to host a smaller version of for Chaos. So there will be four types, Dev, Chaos, Staging, Prod.

dginther Mar 25, 2022

VPC Peering connected to CSRs that connects to TIC for external and internal Dependencies like Staging setup

I think you will find that the VPCs are connected to the AWS Transit Gateway and that is how they communicate with the rest of the services in the VA network, peering is how it used to be done, but probably shouldn't be used any more.

Have you consulted the VA yet about getting IP addresses for a new VPC for this use, or do you plan to just use some private IP addresses not managed by the VA?

So there will be four types, Dev, Chaos, Staging, Prod.

Sandbox is a thing as well, but probably not too relevant to this effort.

I am not sure that this would really be as useful as you expect, considering that all these services should run inside an EKS cluster to more closely simulate the eventual environment they will all be running in. You could certainly test what failures in rev/forward proxies would do, but not how the cluster would react (self healing, restarting pods, pod evictions, etc.). Though also it is true that the rev/forward proxy and vets-api are not yet in EKS, so for the time being this would be a fair representation. Do you have a plan for moving this setup to an EKS cluster in the near future when vets-api and revproxy is put into EKS?

zachclarity Mar 25, 2022
Author

@dginther Thanks, I will update and add more info since I only need private IP and only need to be accessible to the SRE team and Ops. Also, I will need to get more info to see how we can support the new EKS setup as part of the Chaos Environment.

zachclarity · 2022-04-04T15:23:25Z

zachclarity
Apr 4, 2022
Author

I think we need to focus on simulating single points of failure and I need to identify key areas which impact our services.

0 replies

zachclarity · 2022-04-11T14:01:32Z

zachclarity
Apr 11, 2022
Author

@dginther I think I can simulate failures for vets-api just using a K8 setup to simulate the proxy, sidekiq and external services failures. So just Docker files and a K8 cluster. Thoughts,

0 replies

mchelen-gov · 2022-07-18T19:20:43Z

mchelen-gov
Jul 18, 2022

RFCs have been moved to: https://github.com/department-of-veterans-affairs/va.gov-platform-architecture
Please migrate content there.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chaos engineering Environment for observing how the system responds, and implementing improvements #39123

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Chaos engineering Environment for observing how the system responds, and implementing improvements #39123

zachclarity Mar 25, 2022

Summary

Motivation

Detailed design

Drawbacks

Replies: 4 comments · 3 replies

dginther Mar 25, 2022

zachclarity Mar 25, 2022 Author

dginther Mar 25, 2022

zachclarity Mar 25, 2022 Author

zachclarity Apr 4, 2022 Author

zachclarity Apr 11, 2022 Author

mchelen-gov Jul 18, 2022

zachclarity
Mar 25, 2022

Replies: 4 comments 3 replies

dginther
Mar 25, 2022

zachclarity Mar 25, 2022
Author

zachclarity Mar 25, 2022
Author

zachclarity
Apr 4, 2022
Author

zachclarity
Apr 11, 2022
Author

mchelen-gov
Jul 18, 2022