-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Frontend health should not depend on Database #529
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -83,14 +83,7 @@ func (f *Frontend) Join() { | |
<-f.done | ||
} | ||
|
||
func (f *Frontend) CheckReady(ctx context.Context) bool { | ||
// Verify the DB is available and accessible | ||
if err := f.dbClient.DBConnectionTest(ctx); err != nil { | ||
f.logger.Error(fmt.Sprintf("Database test failed: %v", err)) | ||
return false | ||
} | ||
f.logger.Debug("Database check completed") | ||
|
||
func (f *Frontend) CheckReady() bool { | ||
return f.ready.Load().(bool) | ||
} | ||
|
||
|
@@ -104,12 +97,17 @@ func (f *Frontend) NotFound(writer http.ResponseWriter, request *http.Request) { | |
func (f *Frontend) Healthz(writer http.ResponseWriter, request *http.Request) { | ||
var healthStatus float64 | ||
|
||
if f.CheckReady(request.Context()) { | ||
dbConErr := f.dbClient.DBConnectionTest(request.Context()) | ||
if !f.CheckReady() { | ||
writer.WriteHeader(http.StatusInternalServerError) | ||
healthStatus = 0.0 | ||
} else if dbConErr != nil { | ||
writer.WriteHeader(http.StatusOK) | ||
healthStatus = 1.0 | ||
f.logger.Error(fmt.Sprintf("Database test failed: %v", dbConErr)) | ||
healthStatus = 0.5 | ||
} else { | ||
arm.WriteInternalServerError(writer) | ||
healthStatus = 0.0 | ||
writer.WriteHeader(http.StatusOK) | ||
healthStatus = 1.0 | ||
} | ||
Comment on lines
+100
to
111
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This actually doesn't solve the cascading error either. We need to turn this problem on it's head: we should adopt a pattern of alerting on the Frontend generally for high rates of 5xx responses on all endpoints. So what do we do instead? Using middleware to expose metrics on all requests and responses regarless of endpoint will enable us to determine "oh, half our endpoints are returning 5xx? Interesting - are those the endpoints that rely on OCM? DB? EventGrid? Entra? etc" and we can triage from there. I think that implementing https://github.com/slok/go-http-metrics (or similar middleware) is a great place for us to start tracking these requests/responses/error rates in a general way. TL;DR - I'm +1 to remove all validation on |
||
|
||
f.metrics.EmitGauge("frontend_health", healthStatus, map[string]string{ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help me understand the motivation behind the changes in this file? Not strongly opposed, just curious about the background.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to learn more how the frontend is implemented and stumbled over that part.
Not depending on databases and other external services comes from experience. It can cause cascading failures which can be even harder to recover from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that in terms of multiple microservices, but I think a ReadinessProbe defined against a database connection test is a normal usage: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
We don't want this pod to accept any requests if it cannot connect to the database
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's personal preference then. I prefer pods to keep running and handle errors in my application rather than have them restart once the database connections got issues.
Feel free to discard and close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this health check is used for readiness and lifeness probe alike, right?
lifeness: including a DB check into a lifeness probe is not the right thing to do in my opinion. it will rarely solve the issue if the DB has a problem and can have negative impact on the DB.
readiness: what is the expectation of ARM towards an RP? is it preferrable to have the RP answering with an error or not answering at all? if the DB is not working it would affect all pods, and a readiness probe including a DB check would empty the endpoint list of the service. i prefer a service to remain accessible and answer with proper error message. but i'm not going to die on that hill.
tldr: i think Jans change is the right thing to do. we can revisite if lifeness / readiness probes can/should check different things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also: not sure how we are going to define scrape targets for prometheus. in general prometheus will not scrape non-ready pods when using a
ServiceMonitor
as they will not show up in the endpoint list of a service.@tony-schndr what scrape-config approach are we going to leverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geoberle I'm not certain at this time, I have only had success with Prometheus annotations using Istio's metric merging feature.
ServiceMonitor
can't connect due to Azure Managed Prometheus not having the certificate to satisfy strict mTLS. I'm going to raise this with the team after recharge/f2f, maybe there is something I'm missing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to move the DB check into a startup probe and definitely agree the liveness probe has no need to check the database.
I can see how the RP responding with a 500/Internal Server Error is preferable and we can alert on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with other reviewers here, we should not validate dependencies ever to serve
/healthz
endpoints. Customers should get 5xx, nottcp i/o connection timeout
when our database is down, since the liveness will determine if we are an Endpoint in the Service (if I am understanding our architecture correctly?)