Skip to content
Marek Czernek edited this page Jun 12, 2025 · 17 revisions

Health Check

Introducing Health Check Tool

Health Check is a tool based on the A Health Check tool for Uyuni RFC.

Disconnected setup.

The tool takes a supportconfig as an input, reflecting the state of the Uyuni server at a certain point in time. This helps engineers and supporters to analyse and debug issues.

The Health Check is currently disconnected from the Uyuni server. This means that the users transfer the supportconfig to an environment not connected to the Uyuni server, and execute the health-check tool there. Health Check currently does not depend on any part of the Uyuni server, other than the supportconfig.

Based on the supportconfig, use Health Check to:

  • Search and visualize errors in log files.
  • Visualize state of the analyzed system.
  • Parse the configuration of the system and detect possibly incorrect configuration values.

Because Health Check is disconnected from a live Uyuni or SUSE Multi-Linux Manager node, all data is based on the supportconfig. When you make a modification to your configuration and want to verify the correctness to the modification, you must:

  • Take a new supportconfig
  • Reprovision the Health Check tool

Architecture

Health Check tool consists of the following parts:

  • Loki - a time-series database for Promtail.
  • Promtail - a log parser that parses logs from a supportconfig, and saves it in Loki.
  • Supportconfig-exporter - a tool that parses the server configuration from supportconfig and serves specified data by using an HTTP server.
  • Grafana - a visualization layer.
  • Manager - code that starts and stops the tool.
Source Code for Diagram
graph LR
    subgraph "Containers"
        promtail["Promptail"] --> loki["Loki"]
        grafana["Grafana"] --> loki
        grafana --> supportconfig-exporter["supportconfig-exporter"]
    end
    subgraph "filesystem"
      supportconfig["supportconfig directory"] --> promtail
      supportconfig --> supportconfig-exporter
    end
    user["User"] --> grafana

Currently, the tool uses the following ports:

Tool External Port Internal Port
Grafana 3000 3000
Loki 3100 3100
Promtail 9081 9081
supportconfig_exporter 9000 9000

Note

All ports bind to localhost only. Use the -p or --public option to bind to all network interfaces: $ mgr-health-check -s /path/to/supportconfig start -p In such cases, be careful not to expose sensitive supportconfig data online.

Installation Process

From https://build.opensuse.org/project/show/systemsmanagement:Uyuni:healthcheck:Stable, add the corresponding repository for your system OS and install the mgr-health-check package.

For example, in openSUSE Tumbleweed:

# zypper ar https://download.opensuse.org/repositories/systemsmanagement:/Uyuni:/healthcheck:/Stable/openSUSE_Tumbleweed/systemsmanagement:Uyuni:healthcheck:Stable.repo
# zypper ref
# zypper in mgr-health-check

For other operating systems, use the corresponding repository:

Usage

To start the tool, first generate a supportconfig from your Uyuni or SUSE Multi-Linux Manager server.

The supportconfig should be generated either from inside the "uyuni-server" container using the supportconfig command, or alternatively from the server host using mgradm support config.

Keep in mind that when using mgradm support config to generate a supportconfig, the generated tarball will contain a set of other tarballs inside it. This is because it contains a supportconfig from the server host itself and another one from inside the "uyuni-server" container.

Be sure that you pass the extracted "uyuni-server" container supportconfig, when calling mgr-health-check.

Note

The Health Check tool requires the SUSE Multi-Linux Manager supportconfig plugin for a large part of its functionality. Ensure you have installed the SUSE Multi-Linux Manager supportconfig plugin before generating the supportconfig.

Then, use the start command to start the Health Check tool:

$ mgr-health-check -s /path/to/supportconfig start

The previous command configures Grafana to filter logs from the past 7 days by default. Use the --since parameter to modify the default time window.

The following command configures the default Grafana window to one year:

$ mgr-health-check -s /path/to/supportconfig start --since 365

See the full help:

$ mgr-health-check --help
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                 Health Check                                 ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Usage: mgr-health-check [OPTIONS] COMMAND [ARGS]...

Options:
  -s, --supportconfig_path TEXT  Path to supportconfig path as the data source
  -v, --verbose                  Show more stdout, including image building
  --help                         Show this message and exit.

Commands:
  clean  Remove images for Health Check containers
  start  Start execution of Health Check
  stop   Stop execution of Health Check and remove containers

And the help of each command, for example:

$ mgr-health-check start --help
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃                                Health Check                                ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛
Usage: mgr-health-check start [OPTIONS]

  Start execution of Health Check

  Build the necessary containers, deploy them, get the metrics and display
  them

Options:
  --since INTEGER       Show logs from last X days (default: 7)
  --from_datetime TEXT  Look for logs from this date (in ISO 8601 format)
  --to_datetime TEXT    Exclude logs after this date (in ISO 8601 format)
  -p, --public          Expose ports on all interfaces (0.0.0.0)
  --help                Show this message and exit.

Navigating Grafana

When Health Check finishes deploying the Health Check tool, open localhost:3000 in your web browser to see the Grafana interface.

Note

If you need to log in, use the admin:admin user and password credentials.

In the left Home menu, click Dashboards, then click Supportconfig with Logs. This is a pre-provisioned dashboard that contains data about logs, errors, and configuration parsed from the provided supportconfig directory.

If you see any alerts firing, click View alert rule next to the alert, then click Details. Alerts typically have a summary that explains why the alert is firing.

To turn off and stop the health check containers, just run:

$ mgr-health-check stop

FAQ and Troubleshooting

I don't see any logs in Grafana

There are multiple causes for this issue, for example:

  • Promptail hasn't finished parsing the logs yet.
  • Grafana hasn't finished streaming the logs from Loki yet.
  • Grafana has a time window that filters log errors, for example most errors happened with a year-old timestamp and you are looking at a time window of 6 months.
  • There are no logs with error or warning in the supportconfig.
  • There is a bug in our collecting or filtering mechanisms.

Note

Promptail only parses log files specific to Uyuni and SUSE Multi-Linux Manager, such as Salt logs, HTTPd logs, SUSE Multi-Linux Manager specific logs, and similar. Promptail does not parse all log files. For example, journalctl logs are not parsed. It is possible the system is experiencing errors even when Grafana shows no errors.

If you have ruled out all of the problems and an error you found in supportconfig is not displayed in Grafana when you believe it should be, please create an issue.

Health Check consumes too much system resources

This is a known problem. There are several possible resource-intensive parts of Health Check. Use the podman stats command to verify which part of Health Check is the most resource intensive, for example:

$ podman stats --format "{{.Name}}\t{{.MemUsage}}\t{{.AVGCPU}}"
health-check-grafana			0B / 32.35GB	0.98%
health_check_loki			0B / 32.35GB	11.95%
health_check_supportconfig_exporter	0B / 32.35GB	0.04%
...output omitted...

Promtail Resource Consumption

Currently, Promtail contains a memory leak that eventually requires restart of the Promtail container.

To work around this problem, after you start seeing logs and other data from the time frame in which you are interested, you can stop the Promtail container:

$ podman stop health_check_promtail

You can start the container at a later date if you need Promtail to resume parsing logs:

$ podman start health_check_promtail