This document provides an overview of AIStore (AIS) observability features, tools, and practices. AIS offers comprehensive observability through logs, metrics, and a CLI interface, enabling users to monitor, debug, and optimize their deployments.
AIS provides multiple layers of observability:
┌─────────────────────────────────┐
│ Visualization Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Grafana │ │ Custom │ │
│ │ Dashboard │ │ UIs │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Collection Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Prometheus│ │ StatsD* │ │
│ │ │ │ │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Instrumentation Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ Metrics │ │ Logs │ │
│ │ Endpoints │ │ │ │
│ └───────────┘ └───────────┘ │
├─────────────────────────────────┤
│ Access Layer │
│ ┌───────────┐ ┌───────────┐ │
│ │ CLI │ │ REST │ │
│ │ Interface │ │ APIs │ │
│ └───────────┘ └───────────┘ │
└─────────────────────────────────┘
(*)
StatsD support will likely be removed in late 2025.
AIS began with StatsD for metrics collection but has evolved to primarily use Prometheus. Key points about this transition:
- Prometheus (and Grafana) is now the recommended monitoring system
- All new metric implementations use Prometheus exclusively
- The transition provides better scalability, more detailed metrics, variable labels for advanced filtering, and improved integration with modern observability stacks
Method | Description | Use Cases | Documentation |
---|---|---|---|
CLI | Command-line tools for monitoring and troubleshooting | Quick checks, diagnostics, interactive troubleshooting | Observability: CLI |
Logs | Detailed event logs with configurable verbosity | Debugging, audit trails, understanding system behavior | Observability: Logs |
Prometheus | Time-series metrics exposed via HTTP endpoints | Performance monitoring, alerting, trend analysis | Observability: Prometheus |
Metrics Reference | Metric groups, names, and descriptions | Quick search for specific metric | Observability: Metrics Reference |
Grafana | Visualization dashboards for AIS metrics | Visual monitoring, sharing operational status | Observability: Grafana |
Kubernetes | Kubernetes deployments | Working with Kubernetes monitoring stacks | Observability: Kubernetes |
For Kubernetes deployments, AIS provides additional observability features designed to integrate with Kubernetes monitoring stacks.
There's a dedicated (and separate) GitHub repository that, in particular, provides Helm charts for AIS Cluster monitoring.
See the Kubernetes Observability document for details.
AIS exposes metrics across several categories:
- Cluster Health: Node status, membership changes
- Resource Usage: CPU, memory, disk utilization
- Performance: Throughput, latency, error counts
- Storage Operations: GET/PUT rates, object counts, error counts
- Errors: Network errors ("broken pipe", "connection reset"), timeouts ("deadline exceeded"), retries ("too-many-requests"), disk faults, OOM, out-of-space, and more
In addition, all supported jobs that read or write data report their respective progress in terms of objects and bytes (counts).
Briefly, two CLI examples:
$ ais performance latency --refresh 10 --regex get
| TARGET | AWS-GET(n) | AWS-GET(t) | GET(n) | GET(t) | GET(total/avg size) | RATELIM-RETRY-GET(n) | RATELIM-RETRY-GET(t) |
|:------:|:----------:|:----------:|:------:|:------:|:--------------------:|:---------------------:|:---------------------:|
| T1 | 800 | 180ms | 3200 | 25ms | 12GB / 3.75MB | 50 | 240ms |
| T2 | 1000 | 150ms | 4000 | 28ms | 15GB / 3.75MB | 70 | 230ms |
| T3 | 700 | 200ms | 2800 | 32ms | 10GB / 3.57MB | 40 | 215ms |
- **AWS-GET(n)** / **AWS-GET(t)**: Number and average latency of GET requests that actually hit the AWS backend.
- **GET(n)** / **GET(t)**: Number and average latency of *all* GET requests (including those served from local cache or in-cluster data).
- **GET(total/avg size)**: Approximate total data read and corresponding average object size.
- **RATELIM-RETRY-GET(n)** / **RATELIM-RETRY-GET(t)**: Number and average latency of GET requests retried due to hitting the rate limit.
$ ais show job prefetch --refresh 10
prefetch-objects[MV4ex8u6h] (run options: prefix:10, workers: 16, parallelism: w[16] chan-full[8,32])
NODE ID KIND BUCKET OBJECTS BYTES START END STATE
KactABCD MV4ex8u6h prefetch-listrange s3://cloud-bucket 27 27.00MiB 18:28:55 - Running
XXytEFGH MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
YMjtIJKL MV4ex8u6h prefetch-listrange s3://cloud-bucket 41 41.00MiB 18:28:55 - Running
oJXtMNOP MV4ex8u6h prefetch-listrange s3://cloud-bucket 34 34.00MiB 18:28:55 - Running
vWrtQRST MV4ex8u6h prefetch-listrange s3://cloud-bucket 23 23.00MiB 18:28:55 - Running
ybTtUVWX MV4ex8u6h prefetch-listrange s3://cloud-bucket 31 31.00MiB 18:28:55 - Running
Total: 179 179.00MiB ✓
- Configure appropriate log levels based on your deployment stage (development or production).
- Set up alerting for critical metrics using Prometheus AlertManager to proactively monitor system health.
- Implement regular dashboard reviews to analyze short- and long-term statistics and identify performance trends.
- View or download logs via Loki. You can also use the CLI commands
ais log
orais cluster download-logs
(use--help
for details) to access logs for troubleshooting and analysis.