Skip to content

Site Reliability

Henne Vogelsang edited this page May 28, 2020 · 19 revisions

Here is how we ensure that our reference server https://build.opensuse.org functions reliably

Infrastructure Monitoring

On our servers, we make use of icinga and many monitoring-plugins which send infrastructure performance and health monitoring data to an InfluxDB time series database, which we then visualize on a Grafana dashboard. This dashboard is not public.

Logging

In our Ruby on Rails app, we make use of lograge to log to disk. System logs go to a central logging server via rsyslog.

Application Performance Monitoring

Inside our Ruby on Rails app, we make use of influxdb-rails which sends performance data to an InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

Application Health Monitoring (Telemetry)

Inside our Ruby on Rails app, we make use of bunny which sends telemetry to a RabbitMQ message broker, where a telegraf server agent reads the telemetry and stores it into a InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

Exception Tracking

Inside our Ruby on Rails app, we make use of airbrake which sends application exceptions to an errbit error catcher service at https://errbit-opensuse.herokuapp.com

Web Analytics

Tracing

Incident Management

There is always at least one person "on-call". As soon as we are alerted that person takes on the incident command and holds all positions (hacking on the problem, operating the server, communication to the users) that they have not delegated. They are free to pull in anyone they need and hand out tasks/roles to solve this incident.

After resolving the incident we do a root cause analysis and publish a report, based on our Post-Mortem-Template, on https://openbuildservice.org/categories/deployments/

Clone this wiki locally