Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frontend health should not depend on Database #529

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 10 additions & 12 deletions frontend/pkg/frontend/frontend.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand the motivation behind the changes in this file? Not strongly opposed, just curious about the background.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to learn more how the frontend is implemented and stumbled over that part.

Not depending on databases and other external services comes from experience. It can cause cascading failures which can be even harder to recover from.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that in terms of multiple microservices, but I think a ReadinessProbe defined against a database connection test is a normal usage: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes

We don't want this pod to accept any requests if it cannot connect to the database

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's personal preference then. I prefer pods to keep running and handle errors in my application rather than have them restart once the database connections got issues.

Feel free to discard and close this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this health check is used for readiness and lifeness probe alike, right?

lifeness: including a DB check into a lifeness probe is not the right thing to do in my opinion. it will rarely solve the issue if the DB has a problem and can have negative impact on the DB.

readiness: what is the expectation of ARM towards an RP? is it preferrable to have the RP answering with an error or not answering at all? if the DB is not working it would affect all pods, and a readiness probe including a DB check would empty the endpoint list of the service. i prefer a service to remain accessible and answer with proper error message. but i'm not going to die on that hill.

tldr: i think Jans change is the right thing to do. we can revisite if lifeness / readiness probes can/should check different things

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: not sure how we are going to define scrape targets for prometheus. in general prometheus will not scrape non-ready pods when using a ServiceMonitor as they will not show up in the endpoint list of a service.

@tony-schndr what scrape-config approach are we going to leverage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geoberle I'm not certain at this time, I have only had success with Prometheus annotations using Istio's metric merging feature. ServiceMonitor can't connect due to Azure Managed Prometheus not having the certificate to satisfy strict mTLS. I'm going to raise this with the team after recharge/f2f, maybe there is something I'm missing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to move the DB check into a startup probe and definitely agree the liveness probe has no need to check the database.

I can see how the RP responding with a 500/Internal Server Error is preferable and we can alert on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with other reviewers here, we should not validate dependencies ever to serve /healthz endpoints. Customers should get 5xx, not tcp i/o connection timeout when our database is down, since the liveness will determine if we are an Endpoint in the Service (if I am understanding our architecture correctly?)

Original file line number Diff line number Diff line change
Expand Up @@ -83,14 +83,7 @@ func (f *Frontend) Join() {
<-f.done
}

func (f *Frontend) CheckReady(ctx context.Context) bool {
// Verify the DB is available and accessible
if err := f.dbClient.DBConnectionTest(ctx); err != nil {
f.logger.Error(fmt.Sprintf("Database test failed: %v", err))
return false
}
f.logger.Debug("Database check completed")

func (f *Frontend) CheckReady() bool {
return f.ready.Load().(bool)
}

Expand All @@ -104,12 +97,17 @@ func (f *Frontend) NotFound(writer http.ResponseWriter, request *http.Request) {
func (f *Frontend) Healthz(writer http.ResponseWriter, request *http.Request) {
var healthStatus float64

if f.CheckReady(request.Context()) {
dbConErr := f.dbClient.DBConnectionTest(request.Context())
if !f.CheckReady() {
writer.WriteHeader(http.StatusInternalServerError)
healthStatus = 0.0
} else if dbConErr != nil {
writer.WriteHeader(http.StatusOK)
healthStatus = 1.0
f.logger.Error(fmt.Sprintf("Database test failed: %v", dbConErr))
healthStatus = 0.5
} else {
arm.WriteInternalServerError(writer)
healthStatus = 0.0
writer.WriteHeader(http.StatusOK)
healthStatus = 1.0
}
Comment on lines +100 to 111
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually doesn't solve the cascading error either.

We need to turn this problem on it's head: we should adopt a pattern of alerting on the Frontend generally for high rates of 5xx responses on all endpoints. /healthz should be "dumb" IMO because it is structurally relevant for both kube (RestartPolicy) and TCP (Service Endpoint) configuration. If our golang server is online (that is, the PID is up and our service is capable of handling TCP traffic), we should return 200 here, no ifs or other logic. To clarify what I mean: Is it valuable for us to spam the kube-apiserver to kick our Frontend Pods when the DB is down? Is it valuable for us to return tcp: connection i/o timeout on Frontend when the DB is down? I think we can agree: no.

So what do we do instead? Using middleware to expose metrics on all requests and responses regarless of endpoint will enable us to determine "oh, half our endpoints are returning 5xx? Interesting - are those the endpoints that rely on OCM? DB? EventGrid? Entra? etc" and we can triage from there. I think that implementing https://github.com/slok/go-http-metrics (or similar middleware) is a great place for us to start tracking these requests/responses/error rates in a general way.

TL;DR - I'm +1 to remove all validation on /healthz - it should unconditionally return 200. It is a test of our service's PID/TCP configuration, not our dependencies and/or accuracy/correctness of our service.


f.metrics.EmitGauge("frontend_health", healthStatus, map[string]string{
Expand Down
5 changes: 3 additions & 2 deletions frontend/pkg/frontend/frontend_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ import (
"github.com/Azure/ARO-HCP/internal/database"
)

func TestReadiness(t *testing.T) {
func TestLiveness(t *testing.T) {
tests := []struct {
name string
ready bool
Expand Down Expand Up @@ -198,12 +198,13 @@ func TestSubscriptionsPUT(t *testing.T) {
},
}

pe := NewPrometheusEmitter()
for _, test := range tests {
t.Run(test.name, func(t *testing.T) {
f := &Frontend{
dbClient: database.NewCache(),
logger: slog.New(slog.NewTextHandler(io.Discard, nil)),
metrics: NewPrometheusEmitter(),
metrics: pe,
}

body, err := json.Marshal(&test.subscription)
Expand Down
4 changes: 1 addition & 3 deletions frontend/pkg/frontend/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,12 @@ type Emitter interface {
type PrometheusEmitter struct {
gauges map[string]*prometheus.GaugeVec
counters map[string]*prometheus.CounterVec
registry prometheus.Registerer
}

func NewPrometheusEmitter() *PrometheusEmitter {
return &PrometheusEmitter{
gauges: make(map[string]*prometheus.GaugeVec),
counters: make(map[string]*prometheus.CounterVec),
registry: prometheus.NewRegistry(),
}
}

Expand All @@ -40,7 +38,7 @@ func (pe *PrometheusEmitter) EmitGauge(name string, value float64, labels map[st
if !exists {
labelKeys := maps.Keys(labels)
vec = prometheus.NewGaugeVec(prometheus.GaugeOpts{Name: name}, labelKeys)
pe.registry.MustRegister(vec)
prometheus.MustRegister(vec)
pe.gauges[name] = vec
}
vec.With(labels).Set(value)
Expand Down
Loading