Skip to content

Site Reliability

Henne Vogelsang edited this page Aug 27, 2020 · 19 revisions

Here is how we ensure that our reference server https://build.opensuse.org functions reliably

Infrastructure Monitoring

On our servers, we make use of icinga and many monitoring-plugins which send infrastructure performance and health monitoring data to an InfluxDB time series database, which we then visualize on a Grafana dashboard. This dashboard is not public.

Logging

In our Ruby on Rails app, we make use of lograge to log to disk. System logs go to a central logging server via rsyslog.

Application Performance Monitoring

Inside our Ruby on Rails app, we make use of influxdb-rails which sends performance data to an InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

More details in Application Performance Monitoring.

Application Health Monitoring (Telemetry)

Inside our Ruby on Rails app, we make use of bunny which sends telemetry to a RabbitMQ message broker, where a telegraf server agent reads the telemetry and stores it into a InfluxDB time series database. We visualize this data on a Grafana dashboard reachable at https://obs-measure.opensuse.org

More details in Application Health Monitoring.

Exception Tracking

Inside our Ruby on Rails app, we make use of airbrake which sends application exceptions to an errbit error catcher service at https://errbit-opensuse.herokuapp.com

Web Analytics

Tracing

Incident Management

There is always at least one person "on-call". As soon as we are alerted that person takes on the incident command and holds all positions (hacking on the problem, operating the server, communication to the users) that they have not delegated. They are free to pull in anyone they need and hand out tasks/roles to solve this incident.

After resolving the incident we do a root cause analysis and publish a report, based on our Post-Mortem-Template, on https://openbuildservice.org/categories/deployments/

We are using priority labels for issues.

  • P1: Urgent - EVERYONE drop everything and fix this
  • P2: High - If at all possible, assign this to you and fix it ASAP

Development Environment

You can run OBS and all the tools we use in our SRE stack by combining the docker-compose.yml and docker-compose.sre.yml files. To set up the stack run rake docker:ahm:prepare. This will fetch all images and configure them.

Afterward you can just issue any docker-compose command with docker-compose -f docker-compose.sre.yml -f docker-compose.yml. So for instance to boot up OBS including the SRE stack you would use docker-compose -f docker-compose.sre.yml -f docker-compose.yml up

Configure Grafana

Go to Grafana frontend, http://0.0.0.0:8000, and login (admin/admin) and import the 'influxdb-rails' sample dashboards (Overview, per Request, per Action) or export/import dashboards from obs-measure etc.

Clone this wiki locally