Skip to content

Site Reliability

Henne Vogelsang edited this page May 28, 2020 · 19 revisions

Here is how we ensure that our reference server https://build.opensuse.org functions reliable

Infrastructure Monitoring

On our servers we make use of icinga and many monitoring-plugins, that send infrastructure performance and health monitoring data to an InfluxDB time series database, which we then visualize on a Grafana dashboard. This dashboard is not public.

Logging

In our Ruby on Rails app we make use of of lograge to log to disk. System logs go to a central logging server via rsyslog.

Application Performance Monitoring

Inside our Ruby on Rails app we make use of influxdb-rails, that sends performance data to an InfluxDB time series database, which we then visualize on a Grafana dashboard, that is reachable at https://obs-measure.opensuse.org

Application Health Monitoring (Telemetry)

Inside our Ruby on Rails app we make use of bunny, that sends telemetry to a RabbitMQ message broker, where a telegraf server agent reads the telemetry, and stores it into a InfluxDB time series database, which we then visualize on a Grafana dashboard, that is reachable at https://obs-measure.opensuse.org

Exception Tracking

Inside our Ruby on Rails app we make use of airbrake, that sends application exceptions to an errbit error catcher service at https://errbit-opensuse.herokuapp.com

Web Analytics

Tracing

Incident Management

Clone this wiki locally