Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend health should not depend on Database #529

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

janboll
Copy link
Collaborator

@janboll janboll commented Aug 21, 2024

What this PR does

  • Remove dependecy to database from healthchecks
  • Use global prometheus registry

Jira:
Link to demo recording:

Special notes for your reviewer

Health of services should not depend on upstream depending services. If CosmosDB is not reachable, frondtend would scale down. This could cause an even more catastrophic failure
We are using default handler, this requires using the default registry everywhere. This was discovered, cause Gauge metrics where not registred/showing in the metrics output.

Also adapt tests to use a global metrics emmiter to avoid panics cause by re-registering metrics in the global registry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand the motivation behind the changes in this file? Not strongly opposed, just curious about the background.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to learn more how the frontend is implemented and stumbled over that part.

Not depending on databases and other external services comes from experience. It can cause cascading failures which can be even harder to recover from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that in terms of multiple microservices, but I think a ReadinessProbe defined against a database connection test is a normal usage: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

We don't want this pod to accept any requests if it cannot connect to the database

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's personal preference then. I prefer pods to keep running and handle errors in my application rather than have them restart once the database connections got issues.

Feel free to discard and close this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this health check is used for readiness and lifeness probe alike, right?

lifeness: including a DB check into a lifeness probe is not the right thing to do in my opinion. it will rarely solve the issue if the DB has a problem and can have negative impact on the DB.

readiness: what is the expectation of ARM towards an RP? is it preferrable to have the RP answering with an error or not answering at all? if the DB is not working it would affect all pods, and a readiness probe including a DB check would empty the endpoint list of the service. i prefer a service to remain accessible and answer with proper error message. but i'm not going to die on that hill.

tldr: i think Jans change is the right thing to do. we can revisite if lifeness / readiness probes can/should check different things

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: not sure how we are going to define scrape targets for prometheus. in general prometheus will not scrape non-ready pods when using a ServiceMonitor as they will not show up in the endpoint list of a service.

@tony-schndr what scrape-config approach are we going to leverage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoberle I'm not certain at this time, I have only had success with Prometheus annotations using Istio's metric merging feature. ServiceMonitor can't connect due to Azure Managed Prometheus not having the certificate to satisfy strict mTLS. I'm going to raise this with the team after recharge/f2f, maybe there is something I'm missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to move the DB check into a startup probe and definitely agree the liveness probe has no need to check the database.

I can see how the RP responding with a 500/Internal Server Error is preferable and we can alert on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with other reviewers here, we should not validate dependencies ever to serve /healthz endpoints. Customers should get 5xx, not tcp i/o connection timeout when our database is down, since the liveness will determine if we are an Endpoint in the Service (if I am understanding our architecture correctly?)

Comment on lines +100 to 111
dbConErr := f.dbClient.DBConnectionTest(request.Context())
if !f.CheckReady() {
writer.WriteHeader(http.StatusInternalServerError)
healthStatus = 0.0
} else if dbConErr != nil {
writer.WriteHeader(http.StatusOK)
healthStatus = 1.0
f.logger.Error(fmt.Sprintf("Database test failed: %v", dbConErr))
healthStatus = 0.5
} else {
arm.WriteInternalServerError(writer)
healthStatus = 0.0
writer.WriteHeader(http.StatusOK)
healthStatus = 1.0
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually doesn't solve the cascading error either.

We need to turn this problem on it's head: we should adopt a pattern of alerting on the Frontend generally for high rates of 5xx responses on all endpoints. /healthz should be "dumb" IMO because it is structurally relevant for both kube (RestartPolicy) and TCP (Service Endpoint) configuration. If our golang server is online (that is, the PID is up and our service is capable of handling TCP traffic), we should return 200 here, no ifs or other logic. To clarify what I mean: Is it valuable for us to spam the kube-apiserver to kick our Frontend Pods when the DB is down? Is it valuable for us to return tcp: connection i/o timeout on Frontend when the DB is down? I think we can agree: no.

So what do we do instead? Using middleware to expose metrics on all requests and responses regarless of endpoint will enable us to determine "oh, half our endpoints are returning 5xx? Interesting - are those the endpoints that rely on OCM? DB? EventGrid? Entra? etc" and we can triage from there. I think that implementing https://github.com/slok/go-http-metrics (or similar middleware) is a great place for us to start tracking these requests/responses/error rates in a general way.

TL;DR - I'm +1 to remove all validation on /healthz - it should unconditionally return 200. It is a test of our service's PID/TCP configuration, not our dependencies and/or accuracy/correctness of our service.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants