Frontend health should not depend on Database #529

janboll · 2024-08-21T11:03:18Z

What this PR does

Remove dependecy to database from healthchecks
Use global prometheus registry

Jira:
Link to demo recording:

Special notes for your reviewer

Health of services should not depend on upstream depending services. If CosmosDB is not reachable, frondtend would scale down. This could cause an even more catastrophic failure

We are using default handler, this requires using the default registry everywhere. This was discovered, cause Gauge metrics where not registred/showing in the metrics output. Also adapt tests to use a global metrics emmiter to avoid panics cause by re-registering metrics in the global registry

mjlshen · 2024-08-21T15:41:27Z

frontend/pkg/frontend/frontend.go

Could you help me understand the motivation behind the changes in this file? Not strongly opposed, just curious about the background.

Just wanted to learn more how the frontend is implemented and stumbled over that part.

Not depending on databases and other external services comes from experience. It can cause cascading failures which can be even harder to recover from.

I can see that in terms of multiple microservices, but I think a ReadinessProbe defined against a database connection test is a normal usage: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

We don't want this pod to accept any requests if it cannot connect to the database

It's personal preference then. I prefer pods to keep running and handle errors in my application rather than have them restart once the database connections got issues.

Feel free to discard and close this PR.

this health check is used for readiness and lifeness probe alike, right?

lifeness: including a DB check into a lifeness probe is not the right thing to do in my opinion. it will rarely solve the issue if the DB has a problem and can have negative impact on the DB.

readiness: what is the expectation of ARM towards an RP? is it preferrable to have the RP answering with an error or not answering at all? if the DB is not working it would affect all pods, and a readiness probe including a DB check would empty the endpoint list of the service. i prefer a service to remain accessible and answer with proper error message. but i'm not going to die on that hill.

tldr: i think Jans change is the right thing to do. we can revisite if lifeness / readiness probes can/should check different things

also: not sure how we are going to define scrape targets for prometheus. in general prometheus will not scrape non-ready pods when using a ServiceMonitor as they will not show up in the endpoint list of a service.

@tony-schndr what scrape-config approach are we going to leverage?

@geoberle I'm not certain at this time, I have only had success with Prometheus annotations using Istio's metric merging feature. ServiceMonitor can't connect due to Azure Managed Prometheus not having the certificate to satisfy strict mTLS. I'm going to raise this with the team after recharge/f2f, maybe there is something I'm missing.

I'm ok to move the DB check into a startup probe and definitely agree the liveness probe has no need to check the database.

I can see how the RP responding with a 500/Internal Server Error is preferable and we can alert on that.

Agree with other reviewers here, we should not validate dependencies ever to serve /healthz endpoints. Customers should get 5xx, not tcp i/o connection timeout when our database is down, since the liveness will determine if we are an Endpoint in the Service (if I am understanding our architecture correctly?)

SudoBrendan · 2024-10-18T17:29:49Z

frontend/pkg/frontend/frontend.go

+	dbConErr := f.dbClient.DBConnectionTest(request.Context())
+	if !f.CheckReady() {
+		writer.WriteHeader(http.StatusInternalServerError)
+		healthStatus = 0.0
+	} else if dbConErr != nil {
 		writer.WriteHeader(http.StatusOK)
-		healthStatus = 1.0
+		f.logger.Error(fmt.Sprintf("Database test failed: %v", dbConErr))
+		healthStatus = 0.5
 	} else {
-		arm.WriteInternalServerError(writer)
-		healthStatus = 0.0
+		writer.WriteHeader(http.StatusOK)
+		healthStatus = 1.0
 	}


This actually doesn't solve the cascading error either.

We need to turn this problem on it's head: we should adopt a pattern of alerting on the Frontend generally for high rates of 5xx responses on all endpoints. /healthz should be "dumb" IMO because it is structurally relevant for both kube (RestartPolicy) and TCP (Service Endpoint) configuration. If our golang server is online (that is, the PID is up and our service is capable of handling TCP traffic), we should return 200 here, no ifs or other logic. To clarify what I mean: Is it valuable for us to spam the kube-apiserver to kick our Frontend Pods when the DB is down? Is it valuable for us to return tcp: connection i/o timeout on Frontend when the DB is down? I think we can agree: no.

So what do we do instead? Using middleware to expose metrics on all requests and responses regarless of endpoint will enable us to determine "oh, half our endpoints are returning 5xx? Interesting - are those the endpoints that rely on OCM? DB? EventGrid? Entra? etc" and we can triage from there. I think that implementing https://github.com/slok/go-http-metrics (or similar middleware) is a great place for us to start tracking these requests/responses/error rates in a general way.

TL;DR - I'm +1 to remove all validation on /healthz - it should unconditionally return 200. It is a test of our service's PID/TCP configuration, not our dependencies and/or accuracy/correctness of our service.

github-actions · 2025-01-07T15:55:27Z

Please rebase pull request.

janboll added 2 commits August 21, 2024 12:57

Remove dependecy to database from healthchecks

5b5daff

Health of services should not depend on upstream depending services. If CosmosDB is not reachable, frondtend would scale down. This could cause an even more catastrophic failure

janboll requested review from bennerv, mbarnes, mjlshen and tonytheleg as code owners August 21, 2024 11:03

Fix lint

3bd7dc4

mjlshen reviewed Aug 21, 2024

View reviewed changes

SudoBrendan requested changes Oct 18, 2024

View reviewed changes

github-actions bot added the needs-rebase label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frontend health should not depend on Database #529

Frontend health should not depend on Database #529

janboll commented Aug 21, 2024 •

edited

Loading

mjlshen Aug 21, 2024

janboll Aug 22, 2024

mjlshen Aug 22, 2024

janboll Aug 22, 2024

geoberle Aug 22, 2024

geoberle Aug 22, 2024

tony-schndr Aug 22, 2024

mjlshen Aug 22, 2024

SudoBrendan Oct 18, 2024

SudoBrendan Oct 18, 2024

github-actions bot commented Jan 7, 2025

Frontend health should not depend on Database #529

Are you sure you want to change the base?

Frontend health should not depend on Database #529

Conversation

janboll commented Aug 21, 2024 • edited Loading

What this PR does

Special notes for your reviewer

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jan 7, 2025

janboll commented Aug 21, 2024 •

edited

Loading