Application Health Monitoring

We are going to collect metrics about the usage of OBS, such as logins of users, creation of packages and projects and alike. Since we are planning to refresh the OBS UI and might change and improve workflows in OBS, we want to be able to track if that has any negative, or positive, effects.

Architectural overview

Our AHM stack consists of:

RabbitMQ

Metrics we collect are sent to RabbitMQ to the metrics queue.

Telegraf

Telegraf fetches these metrics and reports them to InfluxDB.

InfluxDB

InfluxDB stores the time series data we collect (database telegraf).

Grafana

Grafana is used to create graphs to visualize the collected data.

Development setup

Instructions for setting up the development environment for AHM can be found in our docker documentation

Prepare and start the container

rake docker:ahm:prepare
docker-compose -f docker-compose.ahm.yml -f docker-compose.yml up

Configure Grafana

Go to Grafana frontend, http://localhost:8000, and login (admin/admin).

Add a new data source by adding following data:

Type: InfluxDB
URL: http://influx:8086
Database: telegraf
User: grafana
Password: grafana

Production setup

The Grafana dashboards are hosted at https://obs-measure.opensuse.org/. You can login with your GitHub account and should get the Editor role. The openSUSE RabbitMQ is running at https://rabbit.opensuse.org/.

Health Dashboards

Overview

This dashboard gives a general overview about the health status of the application. You could say if the application is up or not by looking at the following panels:

Number of successful requests per minute. :warning: It will send an alert when the traffic is too low.
Error rates tracks requests with an http status error code.
Authentication Failures monitor burst of authentication failures within 10 minutes.
Request State Change tracks request creation and request state changes.
Projects / min tracks projects destroyed and created within a minute.
Packages / min tracks packages destroyed and created.
Total project tracks the total amount of projects that were created and destroyed.
Total package tracks the total amount of packages that were created and destroyed.
User Creation tracks the total amount of users that were created within an hour.
Beta Users tracks the total amount of users who joined and left the beta program.

Detailed error panels

This dashboard gives a detailed picture of the errors happening in the application. Each type of error has its own panel:

500 (Internal server error): ⚠️ It will send an alert when there are more than 10 errors per minute during 2 minutes.
400 (Bad Request)
401 (Unauthorized)
403 (Forbidden)
404 (Not found) / min
408 (Request Timeout)
422 (Unprocessable Entity)

Performance Dashboards

This panel has a selector to choose which interface show data from: webui or api.

Then the first four panels show:

Response time: Mean of requests response time for the selected interface.
SQL Time: Mean of queries perform time for the selected interface.
View Time: Mean of views rendering time for the selected interface.
Total requests: Total amount of requests performed for the selected interface.

Below that:

Response time: Track controller response time of any kind of action/request performed in the selected interface. Giving three values per response: max, min and mean.
Database: Track database response time of any kind of request performed in the selected interface. Giving three values per response: max, min and mean.
View: Track views rendering time of any kind of view rendered in the selected interface. Giving three values per response: max, min and mean.
Requests: List of the 20 most time-consuming requests. Displays controller and action names, as well as the associated Request ID and its maximum response time.
Actions: List of all controllers' actions with their corresponding response time (mean, median, and max) and the number of times they were called.
SQL: List of all performed SQL queries with their corresponding response time (mean, median, and max) and the number of times they were called.
Templates: List of all rendered templates/views with their corresponding response time (mean, median, and max) and the number of times they were called.
Backend: List of all the backend calls with their corresponding response time (mean, median, and max) and the number of times they were called.

Application Health Monitoring

Architectural overview

RabbitMQ

Telegraf

InfluxDB

Grafana

Development setup

Prepare and start the container

Configure Grafana

Production setup

Health Dashboards

Overview

Detailed error panels

Performance Dashboards

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!