Skip to content

Application Health Monitoring

Dani Donisa edited this page Jul 31, 2020 · 44 revisions

We are going to collect metrics about the usage of OBS, such as logins of users, creation of packages and projects and alike. Since we are planning to refresh the OBS UI and might change and improve workflows in OBS, we want to be able to track if that has any negative, or positive, effects.

Architectural overview

Our AHM stack consists of:

RabbitMQ

Metrics we collect are sent to RabbitMQ to the metrics queue.

Telegraf

Telegraf fetches these metrics and reports them to InfluxDB.

InfluxDB

InfluxDB stores the time series data we collect (database telegraf).

Grafana

Grafana is used to create graphs to visualize the collected data.

Development setup

Instructions for setting up the development environment for AHM can be found in our docker documentation

Prepare and start the container

rake docker:ahm:prepare
docker-compose -f docker-compose.ahm.yml -f docker-compose.yml up

Configure Grafana

Go to Grafana frontend, http://localhost:8000, and login (admin/admin).

Add a new data source by adding following data:

Type: InfluxDB
URL: http://influx:8086
Database: telegraf
User: grafana
Password: grafana

Production setup

The Grafana dashboards are hosted at https://obs-measure.opensuse.org/. You can login with your GitHub account and should get the Editor role. The openSUSE RabbitMQ is running at https://rabbit.opensuse.org/.

Health Dashboards

Overview

This dashboard gives a general overview about the health status of the application. You could say if the application is up or not by looking at the following panels:

Detailed error panels

This dashboard gives a detailed picture of the errors happening in the application. Each type of error has its own panel:

Performance Dashboards

This panel has a selector to choose which interface show data from: webui or api.

Then the first four panels show:

  • Response time: Mean of requests response time for the selected interface.
  • SQL Time: Mean of queries perform time for the selected interface.
  • View Time: Mean of views rendering time for the selected interface.
  • Total requests: Total amount of requests performed for the selected interface.

Below that:

  • Response time: Track controller response time of any kind of action/request performed in the selected interface. Giving three values per response: max, min and mean.
  • Database: Track database response time of any kind of request performed in the selected interface. Giving three values per response: max, min and mean.
  • View: Track views rendering time of any kind of view rendered in the selected interface. Giving three values per response: max, min and mean.
  • Requests: List of the 20 most time-consuming requests. Displays controller and action names, as well as the associated Request ID and its maximum response time.
  • Actions: List of all controllers' actions with their corresponding response time (mean, median, and max) and the number of times they were called.
  • SQL: List of all performed SQL queries with their corresponding response time (mean, median, and max) and the number of times they were called.
  • Templates: List of all rendered templates/views with their corresponding response time (mean, median, and max) and the number of times they were called.
  • Backend: List of all the backend calls with their corresponding response time (mean, median, and max) and the number of times they were called.
Clone this wiki locally