Skip to content

Site Reliability

Dany Marcoux edited this page May 28, 2020 · 19 revisions

Here is how we ensure that our reference server https://build.opensuse.org functions reliably

Infrastructure Monitoring

On our servers, we make use of icinga and many monitoring-plugins which send infrastructure performance and health monitoring data to an InfluxDB time series database, which we then visualize on a Grafana dashboard. This dashboard is not public.

Logging

In our Ruby on Rails app, we make use of lograge to log to disk. System logs go to a central logging server via rsyslog.

Application Performance Monitoring

Inside our Ruby on Rails app, we make use of influxdb-rails which sends performance data to an InfluxDB time series database visualizable on a Grafana dashboard and reachable at https://obs-measure.opensuse.org

Application Health Monitoring (Telemetry)

Inside our Ruby on Rails app, we make use of bunny which sends telemetry to a RabbitMQ message broker, where a telegraf server agent reads the telemetry and stores it into a InfluxDB time series database visualizable on a Grafana dashboard reachable at https://obs-measure.opensuse.org

Exception Tracking

Inside our Ruby on Rails app, we make use of airbrake which sends application exceptions to an errbit error catcher service at https://errbit-opensuse.herokuapp.com

Web Analytics

Tracing

Incident Management

Clone this wiki locally