Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a section for bad monitoring examples #15

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Backend development best practices
- [HTTP status codes](#http-status-codes)
- [Load balancer health checks](#load-balancer-health-checks)
- [Access control](#access-control)
- [Bad examples of monitoring](#bad-examples-of-monitoring)
- [Release checklist](#release-checklist)
- [General questions to consider](#general-questions-to-consider)
- [Generally proven useful tools](#generally-proven-useful-tools)
Expand Down Expand Up @@ -420,6 +421,30 @@ The load balancer health check page should be placed at a `/status/health` URL.

The status pages may need proper authorization in place, especially in case they expose debugging information in status messages or application metrics. HTTP basic authentication or IP-based restrictions are usually good enough candidates to consider.

## Bad examples of monitoring

When crafting a new service, it’s tempting to create some basic monitoring, like automatically sent emails incase of errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in case


It sounds simple - something goes wrong, email is sent to appropriate parties. In reality, this kind of approach may result in spam-like flow of error messages which cannot be used to identify the problems.

On example from a customer extranet portal:

Service sends an email when something goes wrong. It may be a HTTP 500 error because of unresponsive/broken backend services (ERP, CRM) or 404 because of missing documents which were supposed to be available for download.

At one point, the API responsible for customer data stopped working on the CRM end. This resulted in totally unusable extranet, since it relies heavily on end customer data and their contract status. Problem occurred on Sunday and there’s no 24/7 monitoring deal on the CRM vendor side to handle these kind of issues. When users tried to log in to extranet, it resulted in HTTP 500 error each time the extranet page was loaded. Some 15 000 email were sent during Sunday and early Monday before the issue was noticed and extranet taken down into maintenance mode.

In this example, big problems came when one critical API was down in the customer service solution. High traffic website started to generate huge amount of emails all related to same problem - not very effective. This started to block mail server, slowed down Flowdock (which also receives these emails to customer flow) and also generated several hundred mb of log files on the extranet server.

Another example is from a customer public website:

This one also sends mails in case of errors. In this case, so large portion of errors are not actually errors that the mails can’t be used to identify if there is actually a major problem or not. People do not read them, since they are tired to look through unimportant mails all the time.

One can see that even there had been good intentions to do monitoring, it has just failed completely because of general rather than carefully thought approach to the problem.

If possible, different kind of errors should be treated differently. If a API call fails one time - maybe no action is needed. If it fails often (and/or within a small time window), something might be going on. Also controlling the logging based on number of errors might be a good idea - if identical error is noticed for 5000 times within a 10 minute window, maybe it’s not the best idea to log everything and send email for each case? There’s nothing wrong with emails, if things a set up properly :).

Getting monitoring right may take a while and customers might not be eager to pay for this. They should be reminded that if there’s no 24/7 monitoring deal, having the automated error handling provides valuable details about their services and how customers experience them.

# Release checklist

When you are ready to release, remember to check off everything on your release checklist! The resulting peace of mind, repeatability and dependability is a great boon.
Expand Down