AIStore Observability

This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.

Observability Architecture

AIS provides multiple layers of observability:

┌─────────────────────────────────┐
│       Visualization Layer       │
│  ┌───────────┐    ┌───────────┐ │
│  │  Grafana  │    │   Custom  │ │
│  │ Dashboard │    │   UIs     │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│       Collection Layer          │
│  ┌───────────┐    ┌───────────┐ │
│  │ Prometheus│    │  StatsD*  │ │
│  │           │    │           │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│       Instrumentation Layer     │
│  ┌───────────┐    ┌───────────┐ │
│  │  Metrics  │    │   Logs    │ │
│  │ Endpoints │    │           │ │
│  └───────────┘    └───────────┘ │
├─────────────────────────────────┤
│          Access Layer           │
│  ┌───────────┐    ┌───────────┐ │
│  │    CLI    │    │   REST    │ │
│  │ Interface │    │   APIs    │ │
│  └───────────┘    └───────────┘ │
└─────────────────────────────────┘

(*) StatsD support will likely be removed in late 2025.

Transition from StatsD to Prometheus

AIS began with StatsD for metrics collection but has evolved to primarily use Prometheus. Key points about this transition:

Prometheus (and Grafana) is now the recommended monitoring system
All new metric implementations use Prometheus exclusively
The transition provides better scalability, more detailed metrics, variable labels for advanced filtering, and improved integration with modern observability stacks

Observability Methods

Method	Description	Use Cases	Documentation
CLI	Command-line tools for monitoring and troubleshooting	Quick checks, diagnostics, interactive troubleshooting	Observability: CLI
Logs	Detailed event logs with configurable verbosity	Debugging, audit trails, understanding system behavior	Observability: Logs
Prometheus	Time-series metrics exposed via HTTP endpoints	Performance monitoring, alerting, trend analysis	Observability: Prometheus
Metrics Reference	Metric groups, names, and descriptions	Quick search for specific metric	Observability: Metrics Reference
Grafana	Visualization dashboards for AIS metrics	Visual monitoring, sharing operational status	Observability: Grafana
Kubernetes	Kubernetes deployments	Working with Kubernetes monitoring stacks	Observability: Kubernetes

Kubernetes Integration

For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.

There's a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.

See the Kubernetes Observability document for details.

Key Metrics Categories

AIS exposes metrics across several categories:

Cluster Health: Node status, membership changes
Resource Usage: CPU, memory, disk utilization
Performance: Throughput, latency, error counts
Storage Operations: GET/PUT rates, object counts, error counts
Errors: Network errors ("broken pipe", "connection reset"), timeouts ("deadline exceeded"), retries ("too-many-requests"), disk faults, OOM, out-of-space, and more

In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).

Briefly, two CLI examples:

Cluster performance: operation counts and latency

$ ais performance latency --refresh 10 --regex get

| TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
|:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
| T1     | 800        | 180ms      | 3200   | 25ms   | 12GB / 3.75MB       | 50                    | 240ms                |
| T2     | 1000       | 150ms      | 4000   | 28ms   | 15GB / 3.75MB       | 70                    | 230ms                |
| T3     | 700        | 200ms      | 2800   | 32ms   | 10GB / 3.57MB       | 40                    | 215ms                |

- **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
- **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
- **GET(total/avg size)**: Approximate total data read and corresponding average object size.
- **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.

Batch job: Prefetch

$ ais show job prefetch --refresh 10

prefetch-objects[MV4ex8u6h] (run options: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
NODE             ID              KIND                    BUCKET          OBJECTS         BYTES           START           END     STATE
KactABCD         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 27              27.00MiB        18:28:55        -       Running
XXytEFGH         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
YMjtIJKL         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 41              41.00MiB        18:28:55        -       Running
oJXtMNOP         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 34              34.00MiB        18:28:55        -       Running
vWrtQRST         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 23              23.00MiB        18:28:55        -       Running
ybTtUVWX         MV4ex8u6h       prefetch-listrange      s3://cloud-bucket 31              31.00MiB        18:28:55        -       Running
                                Total:                                  179             179.00MiB ✓

Best Practices

Configure appropriate log levels based on your deployment stage (development or production).
Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
View or download logs via Loki. You can also use the CLI commands ais log or ais cluster download-logs (use --help for details) to access logs for troubleshooting and analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files

monitoring-overview.md

monitoring-overview.md

AIStore Observability

Observability Architecture

Transition from StatsD to Prometheus

Observability Methods

Kubernetes Integration

Key Metrics Categories

Cluster performance: operation counts and latency

Batch job: Prefetch

Best Practices

Further Reading

Collapse file tree

Files

monitoring-overview.md

Latest commit

History

monitoring-overview.md

File metadata and controls

AIStore Observability

Observability Architecture

Transition from StatsD to Prometheus

Observability Methods

Kubernetes Integration

Key Metrics Categories

Cluster performance: operation counts and latency

Batch job: Prefetch

Best Practices

Further Reading