feat: add telemetry metrics instrumentation #594

snowmead · 2025-12-03T17:05:23Z

Client Telemetry Infrastructure

Adds comprehensive Prometheus metrics instrumentation to the StorageHub client, enabling observability into storage provider operations, event handler lifecycle, file transfer throughput, and system resource utilization.

Design

The telemetry system is built around three core components:

StorageHubMetrics - Central metrics registry containing all Prometheus counters, histograms, and gauges
MetricsLink - Optional wrapper that enables zero-cost no-op when metrics are disabled
LifecycleMetricRecorder - Trait for tracking event handler states (pending → success/failure)

Event Handler Lifecycle Metrics

The LifecycleMetricRecorder trait is injected into event bus listeners via the subscribe_actor_event_map! macro. This enables automatic tracking of:

Metric	Description
`event_handler_pending_total`	Events received, handler starting
`event_handler_success_total`	Handler completed successfully
`event_handler_failure_total`	Handler failed with error
`event_handler_duration_seconds`	Handler execution time histogram

Usage:

subscribe_actor_event_map!(
    service: &self.blockchain,
    spawner: &self.task_spawner,
    context: self.clone(),
    metrics: self.metrics.clone(),  // Enables automatic lifecycle metrics
    critical: true,
    [
        NewStorageRequest<Runtime> => BspUploadFileTask,
        ProcessConfirmStoringRequest => BspUploadFileTask,
        // ...
    ]
);

When metrics is provided, the macro auto-generates an EventMetricRecorder for each mapping with the event label derived from the event type name (converted to snake_case).

Task-Level Metrics

Tasks record domain-specific metrics using helper macros:

inc_counter!(
    metrics: self.storage_hub_handler.metrics(),
    storage_requests_total,
    "bsp",      // provider_type
    "accept"    // status
);

observe_histogram!(
    metrics: self.storage_hub_handler.metrics(),
    file_download_seconds,
    "success",
    download_duration.as_secs_f64()
);

System Resource Metrics

A background task collects system and process resource metrics every 5 seconds using the sysinfo crate:

Metric	Description
`storagehub_system_cpu_usage_percent`	System-wide CPU usage (0-100%, averaged across cores)
`storagehub_process_cpu_usage_percent`	Process CPU usage (can exceed 100% on multi-core)
`storagehub_system_memory_total_bytes`	Total system memory
`storagehub_system_memory_used_bytes`	Used system memory
`storagehub_system_memory_available_bytes`	Available system memory
`storagehub_process_memory_rss_bytes`	Process resident set size (RSS)

The collector is spawned automatically when MetricsLink::new() is called with a registry, enabling zero-configuration resource monitoring.

Metrics Catalog

Storage Request Metrics

Metric	Labels	Description
`storage_requests_total`	`provider_type`, `status`	Storage request outcomes (accept/reject/fail)
`storage_request_retries_total`	`provider_type`	Proof-related retry attempts

Proof Metrics

Metric	Labels	Description
`proofs_submitted_total`	`status`	Proof submission outcomes
`proof_submission_duration_seconds`	`status`	Time to submit proofs

File Transfer Metrics

Metric	Labels	Description
`bytes_downloaded_total`	`status`	Download throughput
`bytes_uploaded_total`	`status`	Upload throughput
`chunks_downloaded_total`	`status`	Chunks downloaded
`chunks_uploaded_total`	`status`	Chunks uploaded
`file_download_seconds`	`status`	File download duration histogram

Provider Metrics

Metric	Labels	Description
`slashing_events_total`	`role`	Slashing events by provider role
`bucket_move_total`	`status`	Bucket move operations
`user_evictions_total`	`status`	Insolvent user evictions
`forest_verifications_total`	`status`	Forest verification outcomes
`file_deletions_total`	`provider_type`, `status`	File deletion operations

System Resource Metrics

Metric	Description
`storagehub_system_cpu_usage_percent`	System-wide CPU usage percentage
`storagehub_process_cpu_usage_percent`	Process CPU usage percentage
`storagehub_system_memory_total_bytes`	Total system memory in bytes
`storagehub_system_memory_used_bytes`	Used system memory in bytes
`storagehub_system_memory_available_bytes`	Available system memory in bytes
`storagehub_process_memory_rss_bytes`	Process RSS memory in bytes

Docker Infrastructure

Prometheus Integration

Each network type has a dedicated Prometheus configuration:

docker/bspnet-prometheus.yml - BSPNet scrape configs
docker/fullnet-prometheus.yml - FullNet scrape configs (BSPs, MSPs, Fisherman)

Prometheus scrapes metrics from provider containers on port 9615 (Substrate's default Prometheus port). The role label is automatically attached by Prometheus based on the scrape job configuration, enabling per-role filtering in dashboards.

Grafana Dashboards

Pre-built dashboards for each provider role:

Dashboard	Panels
`bsp-dashboard.json`	Storage requests, proofs, downloads, event handlers, resource usage
`msp-dashboard.json`	Storage requests, uploads, bucket ops, event handlers, resource usage
`fisherman-dashboard.json`	Challenge verifications, slashing events, resource usage

Each dashboard includes a Resource Usage section with:

CPU Usage timeseries (process and system)
Memory Usage timeseries (process RSS, system used/available/total)
Process CPU gauge
System CPU gauge
Process Memory stat
System Memory Usage gauge

Dashboards are automatically provisioned via docker/grafana/provisioning/.

- Add metrics.rs with StorageHubMetrics struct and helper macros - Instrument BSP tasks: upload, proof submission, fees, deletion, bucket moves - Instrument MSP tasks: upload, deletion, distribution, fees, bucket moves - Instrument fisherman batch deletions and file downloads - Add prometheus.yml config and Docker integration - Add centralized test/util/prometheus.ts API - Add integration tests for all metrics Metrics tracked: storage requests, proofs, fees, deletions, bucket moves, file transfers, and download operations with status labels and histograms.

…r tracking - Introduced new macros for metrics incrementing and histogram observation, allowing for cleaner and more consistent metric tracking across various tasks. - Updated file download manager to utilize new macros for recording successful and failed download metrics. - Enhanced proof generation task to track timing metrics for both success and failure scenarios. - Improved storage request handling in upload tasks to increment metrics based on success or failure of confirmations. - Refactored existing metric tracking code to reduce redundancy and improve readability.

Resolved conflicts by combining metrics instrumentation from feat/telemetry with improved return messages from main in: - bsp_upload_file.rs - msp_delete_bucket.rs - msp_distribute_file.rs - msp_upload_file.rs

…some tasks, add telemetry integration package script

…tion tests

…ption Add metrics instrumentation for previously uncovered task event handlers: - bsp_upload_file: chunk upload success/failure counters - msp_retry_bucket_move: retry attempt counters - msp_verify_bucket_forests: verification counters with duration histogram - msp_stop_storing_insolvent_user: bucket deletion counters - sp_slash_provider: slash submission counters Update Grafana dashboards with new panels for chunk uploads (BSP) and forest verification/retries (MSP).

Split the shared prometheus.yml into network-specific configs: - bspnet-prometheus.yml: scrapes BSP and user nodes only - fullnet-prometheus.yml: scrapes all nodes (BSP, MSPs, user, fisherman, indexer) Update fullnet-base-template.yml to reference the new fullnet config.

Add Prometheus and Grafana services to bspnet-base-template.yml for metrics collection during bspnet tests. - Expose metrics ports: 9615 (BSP), 9618 (user) - Add sh-prometheus service scraping bspnet nodes - Add sh-grafana service with pre-configured dashboards

Rename `prometheus` config to `telemetry` and allow it to work with both bspnet and fullnet (previously fullnet only). - Add safety checks for optional services before adding --prometheus-external - Update NetLaunchConfig type documentation

Remove unused helper methods from prometheus API: - waitForReady (tests use waitForScrape directly) - assertMetricIncremented, assertMetricAbove, assertMetricEquals (tests use getMetricValue with standard assertions) - metrics record (tests query prometheus directly) Also update telemetry config documentation to remove fullnet-only note.

Rename storage_request_seconds to storage_request_setup_seconds to clarify it measures initial request handling (validation, setup), not file transfer time. Add new bytes_uploaded_total counter metric to track bytes received from upload requests. Tracked in BSP's handle_remote_upload_request. Improve documentation for all metrics with clearer descriptions of what each metric measures and when it is incremented. Update Grafana dashboard and TypeScript metric definitions accordingly.

Remove specialized metrics tests that had complex setup requirements and were difficult to maintain: - metrics-bucket-move.test.ts - metrics-deletion.test.ts - metrics-fees.test.ts - metrics-fisherman.test.ts - metrics-proofs.test.ts - metrics-validation.test.ts Keep metrics-basic.test.ts for core metric verification and add metrics-bspnet.test.ts for testing telemetry in the bspnet environment. Update metrics-basic.test.ts for the renamed storage_request_setup_seconds metric.

Move the msp_storage_requests_total metric recording to occur after checking the extrinsic dispatch result rather than after submission. Success is now only recorded when no ExtrinsicFailed event is found, and failure is recorded for submission errors, missing events, or dispatch errors.

…work Move metrics collection from individual task files into the event bus infrastructure. The EventBusListener now automatically records lifecycle metrics (pending/success/failure counts and duration) for all registered event handlers, eliminating repetitive inc_counter! calls across tasks. Key changes: - Add LifecycleMetricRecorder trait and EventMetricRecorder implementation - Extend subscribe_actor_event_map! macro with metrics parameter - Replace 20+ task-specific counter metrics with unified event_handler_total and event_handler_seconds metrics labeled by handler name - Update Grafana dashboards to use new centralized metric queries - Remove duplicate metrics-bspnet test (consolidated into metrics-fullnet)

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Resolved modify/delete conflict by removing msp_verify_bucket_forests.rs (task was removed in main as it's now handled directly)

Remove the msp_forest_verification_seconds metric and related dashboard panels since the MspVerifyBucketForestsTask was removed in main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add CPU and memory monitoring using sysinfo crate: - System-wide CPU usage percentage - Process CPU usage percentage - System memory (total, used, available) - Process memory (RSS) Background task collects metrics every 5 seconds. Dashboard fixes: - Fix JSON escaping for role label queries (107 instances) - Remove max:100 constraint from process CPU panels (can exceed 100% on multi-core) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…ribe_actor_event_map Extract metric recorder expression into a variable to eliminate code duplication, always passing it to subscribe_actor_event! macro instead of duplicating the entire macro invocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…tricRecorder Add default no-op implementations to LifecycleMetricRecorder trait methods, allowing implementors to only override the methods they need. Rename NoMetricRecorder to NoOpMetricRecorder for clarity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Let subscribe_actor_event! use its internal NoOpMetricRecorder default instead of explicitly passing it from subscribe_actor_event_map!. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove the storage request setup duration histogram as it provides limited value. The event handler lifecycle metrics already track overall request handling time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

…s catalog - Fix metric queries to use `event` label instead of `handler` - Event names derived from event type (e.g., NewStorageRequest -> new_storage_request) - Update ALL_STORAGEHUB_METRICS to match actual metrics in client/src/metrics.rs - Add missing system resource gauges and event handler metrics - Remove non-existent BSP/MSP counter metrics from catalog 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Check event handler histogram metrics for the main events triggered during batch file uploads: - new_storage_request - remote_upload_request - process_confirm_storing_request - process_msp_respond_storing_request 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Update test to explicitly check both status labels: - pending: incremented when event is received - success: incremented when handler completes successfully Also verify pending >= success (every success was first pending). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Move the histogram observation to where generate_key_proof is called, eliminating the wrapper function pattern. This is cleaner and more explicit about when timing occurs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Function was only used once, so inline its logic directly where the base type name is extracted from the event type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

snowmead added 3 commits December 3, 2025 11:18

Merge main into feat/telemetry

3b7359b

Resolved conflicts by combining metrics instrumentation from feat/telemetry with improved return messages from main in: - bsp_upload_file.rs - msp_delete_bucket.rs - msp_distribute_file.rs - msp_upload_file.rs

snowmead changed the title ~~feat(client): add Prometheus metrics instrumentation~~ feat(client): Add Prometheus metrics instrumentation Dec 3, 2025

snowmead changed the title ~~feat(client): Add Prometheus metrics instrumentation~~ feat: Add Prometheus metrics instrumentation Dec 3, 2025

snowmead changed the title ~~feat: Add Prometheus metrics instrumentation~~ feat: add telemetry metrics instrumentation Dec 3, 2025

feat: add grafana container and dashboards, add missing telemetry in …

cd3d521

…some tasks, add telemetry integration package script

snowmead added B5-clientnoteworthy Changes should be mentioned client-related release notes D3-trivial👶 PR contains trivial changes that do not require an audit not-breaking Does not need to be mentioned in breaking changes labels Dec 3, 2025

snowmead requested a review from ffarall December 3, 2025 19:19

snowmead and others added 6 commits December 4, 2025 08:25

include prometheus and grafana containers in teardown process integra…

c2b3163

…tion tests

improve grafana dashboards

375a7a5

improve macro ergonomics

ff0ee81

Merge branch 'main' into feat/telemetry

bfbedaf

fix config param

f871704

snowmead requested a review from TDemeco December 5, 2025 13:18

snowmead and others added 12 commits December 17, 2025 15:24

Merge branch 'main' into feat/telemetry

8e952aa

update basic bsp net test

7deb00c

Merge branch 'main' into feat/telemetry

57a8614

Merge branch 'main' into feat/telemetry

446da2b

snowmead and others added 19 commits December 18, 2025 11:43

test: rename metrics-basic to metrics-fullnet

01c5984

revert(client): restore bsp_download_file.rs to main

c371894

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

refactor(client): remove redundant metric comments

5ba2f98

🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Merge branch 'main' into feat/telemetry

faf302b

Resolved modify/delete conflict by removing msp_verify_bucket_forests.rs (task was removed in main as it's now handled directly)

revert bsp_delete_file changes

722771d

revert out of scope changes

aa669a7

remove Arc

1cab925

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add telemetry metrics instrumentation #594

feat: add telemetry metrics instrumentation #594

Uh oh!

snowmead commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add telemetry metrics instrumentation #594

Are you sure you want to change the base?

feat: add telemetry metrics instrumentation #594

Uh oh!

Conversation

snowmead commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Client Telemetry Infrastructure

Design

Event Handler Lifecycle Metrics

Task-Level Metrics

System Resource Metrics

Metrics Catalog

Storage Request Metrics

Proof Metrics

File Transfer Metrics

Provider Metrics

System Resource Metrics

Docker Infrastructure

Prometheus Integration

Grafana Dashboards

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

snowmead commented Dec 3, 2025 •

edited

Loading