Open
Description
We now have Monit to monitor crucial services on each nodes.
How about we make a services status page to have its switches flipped automatically through it.
Estimated work items
- Find how to catch service recovery in Monit when we could send a recovered call (i.e. service tests)
- Adjust cachet to have 1:1 mapping of components and monit checks
- Make mapping of monit status 1:1 with cachet
- Find way to make Monit send variables into update script
- Create cachet api update script that’ll be used by Monit
- Create API update only account
Proposal
Cachet’ documentation is not very complete but we could use Monit event handler (see how they’d do it with a 3rd party provider)
Configure Monit to make a trigger
# An example of Salt stack managed Monit template
# refer to salt-states/mysql/files/monit.conf.jinja
check process mysql
matching "mysql"
group database
start = "/usr/sbin/service mysql start"
stop = "/usr/sbin/service mysql stop"
if failed host {{ ip4_interfaces[0]|default('127.0.0.1') }} port 3306
protocol MYSQL then restart
if not exist for 3 cycles then restart
if 3 restarts within 5 cycles then exec /path/to/monit_update_cachet_db.sh
Setup an update script
#!/bin/sh
# /path/to/monit_update_cachet_db.sh
# Make an update to the cachet API
# -u would contain pre-populated cachet update only user
# components/2 would be the component id
# we’d have to figure out how monit tells status and make sure the value at status=3 is the right one
#10.10.10.2:8000 is the internal upstream service we send our update requests
/usr/bin/curl -u user:pass -XPUT \
-d status=3 \
10.10.10.2:8000/api/components/2
Example on how to update a component status
Using curl
we an update of the database component into partial outage would look like this;
API call
Its using incident status 2, which would mean "partial outage". See also post-parameters section.
curl -u user:pass -XPUT -d status=3 10.10.10.2:8000/api/components/2
{
"data": {
"created_at": 1427482793,
"description": "MariaDB database cluster nodes",
"id": 2,
"incident_count": 0,
"name": "db cluster",
"status": "Partial Outage",
"status_id": 3,
"updated_at": 1430332325
}
}