Skip to content

Conversation

@snowmead
Copy link
Contributor

@snowmead snowmead commented Dec 3, 2025

Client Telemetry Infrastructure

Adds comprehensive Prometheus metrics instrumentation to the StorageHub client, enabling observability into storage provider operations, event handler lifecycle, file transfer throughput, and system resource utilization.

Design

The telemetry system is built around three core components:

  1. StorageHubMetrics - Central metrics registry containing all Prometheus counters, histograms, and gauges
  2. MetricsLink - Optional wrapper that enables zero-cost no-op when metrics are disabled
  3. LifecycleMetricRecorder - Trait for tracking event handler states (pending → success/failure)

Event Handler Lifecycle Metrics

The LifecycleMetricRecorder trait is injected into event bus listeners via the subscribe_actor_event_map! macro. This enables automatic tracking of:

Metric Description
event_handler_pending_total Events received, handler starting
event_handler_success_total Handler completed successfully
event_handler_failure_total Handler failed with error
event_handler_duration_seconds Handler execution time histogram

Usage:

subscribe_actor_event_map!(
    service: &self.blockchain,
    spawner: &self.task_spawner,
    context: self.clone(),
    metrics: self.metrics.clone(),  // Enables automatic lifecycle metrics
    critical: true,
    [
        NewStorageRequest<Runtime> => BspUploadFileTask,
        ProcessConfirmStoringRequest => BspUploadFileTask,
        // ...
    ]
);

When metrics is provided, the macro auto-generates an EventMetricRecorder for each mapping with the event label derived from the event type name (converted to snake_case).

Task-Level Metrics

Tasks record domain-specific metrics using helper macros:

inc_counter!(
    metrics: self.storage_hub_handler.metrics(),
    storage_requests_total,
    "bsp",      // provider_type
    "accept"    // status
);

observe_histogram!(
    metrics: self.storage_hub_handler.metrics(),
    file_download_seconds,
    "success",
    download_duration.as_secs_f64()
);

System Resource Metrics

A background task collects system and process resource metrics every 5 seconds using the sysinfo crate:

Metric Description
storagehub_system_cpu_usage_percent System-wide CPU usage (0-100%, averaged across cores)
storagehub_process_cpu_usage_percent Process CPU usage (can exceed 100% on multi-core)
storagehub_system_memory_total_bytes Total system memory
storagehub_system_memory_used_bytes Used system memory
storagehub_system_memory_available_bytes Available system memory
storagehub_process_memory_rss_bytes Process resident set size (RSS)

The collector is spawned automatically when MetricsLink::new() is called with a registry, enabling zero-configuration resource monitoring.

Metrics Catalog

Storage Request Metrics

Metric Labels Description
storage_requests_total provider_type, status Storage request outcomes (accept/reject/fail)
storage_request_retries_total provider_type Proof-related retry attempts

Proof Metrics

Metric Labels Description
proofs_submitted_total status Proof submission outcomes
proof_submission_duration_seconds status Time to submit proofs

File Transfer Metrics

Metric Labels Description
bytes_downloaded_total status Download throughput
bytes_uploaded_total status Upload throughput
chunks_downloaded_total status Chunks downloaded
chunks_uploaded_total status Chunks uploaded
file_download_seconds status File download duration histogram

Provider Metrics

Metric Labels Description
slashing_events_total role Slashing events by provider role
bucket_move_total status Bucket move operations
user_evictions_total status Insolvent user evictions
forest_verifications_total status Forest verification outcomes
file_deletions_total provider_type, status File deletion operations

System Resource Metrics

Metric Description
storagehub_system_cpu_usage_percent System-wide CPU usage percentage
storagehub_process_cpu_usage_percent Process CPU usage percentage
storagehub_system_memory_total_bytes Total system memory in bytes
storagehub_system_memory_used_bytes Used system memory in bytes
storagehub_system_memory_available_bytes Available system memory in bytes
storagehub_process_memory_rss_bytes Process RSS memory in bytes

Docker Infrastructure

Prometheus Integration

Each network type has a dedicated Prometheus configuration:

  • docker/bspnet-prometheus.yml - BSPNet scrape configs
  • docker/fullnet-prometheus.yml - FullNet scrape configs (BSPs, MSPs, Fisherman)

Prometheus scrapes metrics from provider containers on port 9615 (Substrate's default Prometheus port). The role label is automatically attached by Prometheus based on the scrape job configuration, enabling per-role filtering in dashboards.

Grafana Dashboards

Pre-built dashboards for each provider role:

Dashboard Panels
bsp-dashboard.json Storage requests, proofs, downloads, event handlers, resource usage
msp-dashboard.json Storage requests, uploads, bucket ops, event handlers, resource usage
fisherman-dashboard.json Challenge verifications, slashing events, resource usage

Each dashboard includes a Resource Usage section with:

  • CPU Usage timeseries (process and system)
  • Memory Usage timeseries (process RSS, system used/available/total)
  • Process CPU gauge
  • System CPU gauge
  • Process Memory stat
  • System Memory Usage gauge

Dashboards are automatically provisioned via docker/grafana/provisioning/.

  - Add metrics.rs with StorageHubMetrics struct and helper macros
  - Instrument BSP tasks: upload, proof submission, fees, deletion, bucket moves
  - Instrument MSP tasks: upload, deletion, distribution, fees, bucket moves
  - Instrument fisherman batch deletions and file downloads
  - Add prometheus.yml config and Docker integration
  - Add centralized test/util/prometheus.ts API
  - Add integration tests for all metrics

  Metrics tracked: storage requests, proofs, fees, deletions, bucket moves,
  file transfers, and download operations with status labels and histograms.
…r tracking

- Introduced new macros for metrics incrementing and histogram observation, allowing for cleaner and more consistent metric tracking across various tasks.
- Updated file download manager to utilize new macros for recording successful and failed download metrics.
- Enhanced proof generation task to track timing metrics for both success and failure scenarios.
- Improved storage request handling in upload tasks to increment metrics based on success or failure of confirmations.
- Refactored existing metric tracking code to reduce redundancy and improve readability.
Resolved conflicts by combining metrics instrumentation from feat/telemetry
with improved return messages from main in:
- bsp_upload_file.rs
- msp_delete_bucket.rs
- msp_distribute_file.rs
- msp_upload_file.rs
@snowmead snowmead changed the title feat(client): add Prometheus metrics instrumentation feat(client): Add Prometheus metrics instrumentation Dec 3, 2025
@snowmead snowmead changed the title feat(client): Add Prometheus metrics instrumentation feat: Add Prometheus metrics instrumentation Dec 3, 2025
@snowmead snowmead changed the title feat: Add Prometheus metrics instrumentation feat: add telemetry metrics instrumentation Dec 3, 2025
…some tasks, add telemetry integration package script
@snowmead snowmead added B5-clientnoteworthy Changes should be mentioned client-related release notes D3-trivial👶 PR contains trivial changes that do not require an audit not-breaking Does not need to be mentioned in breaking changes labels Dec 3, 2025
@snowmead snowmead requested a review from ffarall December 3, 2025 19:19
snowmead and others added 6 commits December 4, 2025 08:25
…ption

  Add metrics instrumentation for previously uncovered task event handlers:
  - bsp_upload_file: chunk upload success/failure counters
  - msp_retry_bucket_move: retry attempt counters
  - msp_verify_bucket_forests: verification counters with duration histogram
  - msp_stop_storing_insolvent_user: bucket deletion counters
  - sp_slash_provider: slash submission counters

  Update Grafana dashboards with new panels for chunk uploads (BSP) and
  forest verification/retries (MSP).
@snowmead snowmead requested a review from TDemeco December 5, 2025 13:18
snowmead and others added 12 commits December 17, 2025 15:24
Split the shared prometheus.yml into network-specific configs:
- bspnet-prometheus.yml: scrapes BSP and user nodes only
- fullnet-prometheus.yml: scrapes all nodes (BSP, MSPs, user, fisherman, indexer)

Update fullnet-base-template.yml to reference the new fullnet config.
Add Prometheus and Grafana services to bspnet-base-template.yml for
metrics collection during bspnet tests.

- Expose metrics ports: 9615 (BSP), 9618 (user)
- Add sh-prometheus service scraping bspnet nodes
- Add sh-grafana service with pre-configured dashboards
Rename `prometheus` config to `telemetry` and allow it to work with
both bspnet and fullnet (previously fullnet only).

- Add safety checks for optional services before adding --prometheus-external
- Update NetLaunchConfig type documentation
Remove unused helper methods from prometheus API:
- waitForReady (tests use waitForScrape directly)
- assertMetricIncremented, assertMetricAbove, assertMetricEquals
  (tests use getMetricValue with standard assertions)
- metrics record (tests query prometheus directly)

Also update telemetry config documentation to remove fullnet-only note.
Rename storage_request_seconds to storage_request_setup_seconds to
clarify it measures initial request handling (validation, setup),
not file transfer time.

Add new bytes_uploaded_total counter metric to track bytes received
from upload requests. Tracked in BSP's handle_remote_upload_request.

Improve documentation for all metrics with clearer descriptions of
what each metric measures and when it is incremented.

Update Grafana dashboard and TypeScript metric definitions accordingly.
Remove specialized metrics tests that had complex setup requirements
and were difficult to maintain:
- metrics-bucket-move.test.ts
- metrics-deletion.test.ts
- metrics-fees.test.ts
- metrics-fisherman.test.ts
- metrics-proofs.test.ts
- metrics-validation.test.ts

Keep metrics-basic.test.ts for core metric verification and add
metrics-bspnet.test.ts for testing telemetry in the bspnet environment.

Update metrics-basic.test.ts for the renamed storage_request_setup_seconds
metric.
Move the msp_storage_requests_total metric recording to occur after
checking the extrinsic dispatch result rather than after submission.
Success is now only recorded when no ExtrinsicFailed event is found,
and failure is recorded for submission errors, missing events, or
dispatch errors.
Move the msp_storage_requests_total metric recording to occur after
checking the extrinsic dispatch result rather than after submission.
Success is now only recorded when no ExtrinsicFailed event is found,
and failure is recorded for submission errors, missing events, or
dispatch errors.
snowmead and others added 19 commits December 18, 2025 11:43
…work

Move metrics collection from individual task files into the event bus
infrastructure. The EventBusListener now automatically records lifecycle
metrics (pending/success/failure counts and duration) for all registered
event handlers, eliminating repetitive inc_counter! calls across tasks.

Key changes:
- Add LifecycleMetricRecorder trait and EventMetricRecorder implementation
- Extend subscribe_actor_event_map! macro with metrics parameter
- Replace 20+ task-specific counter metrics with unified event_handler_total
  and event_handler_seconds metrics labeled by handler name
- Update Grafana dashboards to use new centralized metric queries
- Remove duplicate metrics-bspnet test (consolidated into metrics-fullnet)
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolved modify/delete conflict by removing msp_verify_bucket_forests.rs
(task was removed in main as it's now handled directly)
Remove the msp_forest_verification_seconds metric and related dashboard
panels since the MspVerifyBucketForestsTask was removed in main.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add CPU and memory monitoring using sysinfo crate:
- System-wide CPU usage percentage
- Process CPU usage percentage
- System memory (total, used, available)
- Process memory (RSS)

Background task collects metrics every 5 seconds.

Dashboard fixes:
- Fix JSON escaping for role label queries (107 instances)
- Remove max:100 constraint from process CPU panels (can exceed 100% on multi-core)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ribe_actor_event_map

Extract metric recorder expression into a variable to eliminate code duplication,
always passing it to subscribe_actor_event! macro instead of duplicating the
entire macro invocation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…tricRecorder

Add default no-op implementations to LifecycleMetricRecorder trait methods,
allowing implementors to only override the methods they need. Rename
NoMetricRecorder to NoOpMetricRecorder for clarity.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Let subscribe_actor_event! use its internal NoOpMetricRecorder default
instead of explicitly passing it from subscribe_actor_event_map!.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove the storage request setup duration histogram as it provides
limited value. The event handler lifecycle metrics already track
overall request handling time.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
…s catalog

- Fix metric queries to use `event` label instead of `handler`
- Event names derived from event type (e.g., NewStorageRequest -> new_storage_request)
- Update ALL_STORAGEHUB_METRICS to match actual metrics in client/src/metrics.rs
- Add missing system resource gauges and event handler metrics
- Remove non-existent BSP/MSP counter metrics from catalog

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Check event handler histogram metrics for the main events triggered
during batch file uploads:
- new_storage_request
- remote_upload_request
- process_confirm_storing_request
- process_msp_respond_storing_request

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update test to explicitly check both status labels:
- pending: incremented when event is received
- success: incremented when handler completes successfully

Also verify pending >= success (every success was first pending).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move the histogram observation to where generate_key_proof is called,
eliminating the wrapper function pattern. This is cleaner and more
explicit about when timing occurs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Function was only used once, so inline its logic directly where the
base type name is extracted from the event type.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

B5-clientnoteworthy Changes should be mentioned client-related release notes D3-trivial👶 PR contains trivial changes that do not require an audit not-breaking Does not need to be mentioned in breaking changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants