Skip to content

Application Health Monitoring

Henne Vogelsang edited this page Aug 27, 2020 · 44 revisions

About

We collect metrics about the usage of OBS, such as logins of users, creation of packages and projects and alike.

The monitoring dashboards are hosted at https://obs-measure.opensuse.org/. You can login with your GitHub account and should get the Editor role. The openSUSE RabbitMQ is running at https://rabbit.opensuse.org/.

Two dashboards are particularly important: The Application Health Overview and the Detailed Errors Dashboard

Application Health Overview Dashboard

This dashboard gives a general overview about the health status of the application. You could say if the application is up or not by looking at the following panels:

Number of successful requests per minute

This panel tracks the application traffic from the application's point of view.

Number of successful requests per minute

⚠️ If the number of successful requests gets too low, it means we may have a problem that prevents users from working, it will send an alert when this happens.

Error rates

This panel tracks requests with an http status error code.

Error rates

If the number of errors gets too high, it means something is happening. Our exception tracker collects some of them.

Authentication Failures

This panel monitor burst of authentication failures within 10 minutes.

Authentication Failures

Request State Change

This panel tracks request creation and request state changes.

Request State Change

Projects / min

This panel tracks projects destroyed and created within a minute.

Projects per minute

Packages / min

This panel tracks packages destroyed and created.

Packages per minute

Total projects

This panel tracks the total amount of projects that were created and destroyed.

Total projects

Total packages

This panel tracks the total amount of packages that were created and destroyed.

Total packages

User Creation

This panel tracks the total amount of users that were created within an hour.

User Creation

Beta Users

This panel tracks the total amount of users who joined and left the beta program.

Beta Users

Detailed Errors Dashboard

This dashboard gives a detailed picture of the errors happening in the application. Each type of error has its own panel.

Internal server errors

This panel tracks 500 Server errors, translated as Unhandled Exceptions thrown by the application.

⚠️ It will send an alert when there are more than 10 errors per minute during 2 minutes.

Exhaustive details for each exception can be found in our exception tracker

Other error panels

Implementation

Our AHM stack consists of:

RabbitMQ

Metrics we collect are sent to RabbitMQ to the metrics queue.

Some of those metrics are:

Telegraf

Telegraf fetches these metrics using the amqp_consumer input plugin and reports them to InfluxDB using the influxdb output plugin.

InfluxDB

InfluxDB stores the time series data we collect (database telegraf).

Grafana

Grafana is used to create graphs to visualize the collected data.

Development Environment Setup

Instructions for setting up the development environment including application health monitoring can be found on Site-Reliability#development-environment

Clone this wiki locally