-
Notifications
You must be signed in to change notification settings - Fork 11
feat: add telemetry metrics instrumentation #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
snowmead
wants to merge
41
commits into
main
Choose a base branch
from
feat/telemetry
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add metrics.rs with StorageHubMetrics struct and helper macros - Instrument BSP tasks: upload, proof submission, fees, deletion, bucket moves - Instrument MSP tasks: upload, deletion, distribution, fees, bucket moves - Instrument fisherman batch deletions and file downloads - Add prometheus.yml config and Docker integration - Add centralized test/util/prometheus.ts API - Add integration tests for all metrics Metrics tracked: storage requests, proofs, fees, deletions, bucket moves, file transfers, and download operations with status labels and histograms.
…r tracking - Introduced new macros for metrics incrementing and histogram observation, allowing for cleaner and more consistent metric tracking across various tasks. - Updated file download manager to utilize new macros for recording successful and failed download metrics. - Enhanced proof generation task to track timing metrics for both success and failure scenarios. - Improved storage request handling in upload tasks to increment metrics based on success or failure of confirmations. - Refactored existing metric tracking code to reduce redundancy and improve readability.
Resolved conflicts by combining metrics instrumentation from feat/telemetry with improved return messages from main in: - bsp_upload_file.rs - msp_delete_bucket.rs - msp_distribute_file.rs - msp_upload_file.rs
…some tasks, add telemetry integration package script
…ption Add metrics instrumentation for previously uncovered task event handlers: - bsp_upload_file: chunk upload success/failure counters - msp_retry_bucket_move: retry attempt counters - msp_verify_bucket_forests: verification counters with duration histogram - msp_stop_storing_insolvent_user: bucket deletion counters - sp_slash_provider: slash submission counters Update Grafana dashboards with new panels for chunk uploads (BSP) and forest verification/retries (MSP).
Split the shared prometheus.yml into network-specific configs: - bspnet-prometheus.yml: scrapes BSP and user nodes only - fullnet-prometheus.yml: scrapes all nodes (BSP, MSPs, user, fisherman, indexer) Update fullnet-base-template.yml to reference the new fullnet config.
Add Prometheus and Grafana services to bspnet-base-template.yml for metrics collection during bspnet tests. - Expose metrics ports: 9615 (BSP), 9618 (user) - Add sh-prometheus service scraping bspnet nodes - Add sh-grafana service with pre-configured dashboards
Rename `prometheus` config to `telemetry` and allow it to work with both bspnet and fullnet (previously fullnet only). - Add safety checks for optional services before adding --prometheus-external - Update NetLaunchConfig type documentation
Remove unused helper methods from prometheus API: - waitForReady (tests use waitForScrape directly) - assertMetricIncremented, assertMetricAbove, assertMetricEquals (tests use getMetricValue with standard assertions) - metrics record (tests query prometheus directly) Also update telemetry config documentation to remove fullnet-only note.
Rename storage_request_seconds to storage_request_setup_seconds to clarify it measures initial request handling (validation, setup), not file transfer time. Add new bytes_uploaded_total counter metric to track bytes received from upload requests. Tracked in BSP's handle_remote_upload_request. Improve documentation for all metrics with clearer descriptions of what each metric measures and when it is incremented. Update Grafana dashboard and TypeScript metric definitions accordingly.
Remove specialized metrics tests that had complex setup requirements and were difficult to maintain: - metrics-bucket-move.test.ts - metrics-deletion.test.ts - metrics-fees.test.ts - metrics-fisherman.test.ts - metrics-proofs.test.ts - metrics-validation.test.ts Keep metrics-basic.test.ts for core metric verification and add metrics-bspnet.test.ts for testing telemetry in the bspnet environment. Update metrics-basic.test.ts for the renamed storage_request_setup_seconds metric.
Move the msp_storage_requests_total metric recording to occur after checking the extrinsic dispatch result rather than after submission. Success is now only recorded when no ExtrinsicFailed event is found, and failure is recorded for submission errors, missing events, or dispatch errors.
Move the msp_storage_requests_total metric recording to occur after checking the extrinsic dispatch result rather than after submission. Success is now only recorded when no ExtrinsicFailed event is found, and failure is recorded for submission errors, missing events, or dispatch errors.
…work Move metrics collection from individual task files into the event bus infrastructure. The EventBusListener now automatically records lifecycle metrics (pending/success/failure counts and duration) for all registered event handlers, eliminating repetitive inc_counter! calls across tasks. Key changes: - Add LifecycleMetricRecorder trait and EventMetricRecorder implementation - Extend subscribe_actor_event_map! macro with metrics parameter - Replace 20+ task-specific counter metrics with unified event_handler_total and event_handler_seconds metrics labeled by handler name - Update Grafana dashboards to use new centralized metric queries - Remove duplicate metrics-bspnet test (consolidated into metrics-fullnet)
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Resolved modify/delete conflict by removing msp_verify_bucket_forests.rs (task was removed in main as it's now handled directly)
Remove the msp_forest_verification_seconds metric and related dashboard panels since the MspVerifyBucketForestsTask was removed in main. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add CPU and memory monitoring using sysinfo crate: - System-wide CPU usage percentage - Process CPU usage percentage - System memory (total, used, available) - Process memory (RSS) Background task collects metrics every 5 seconds. Dashboard fixes: - Fix JSON escaping for role label queries (107 instances) - Remove max:100 constraint from process CPU panels (can exceed 100% on multi-core) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ribe_actor_event_map Extract metric recorder expression into a variable to eliminate code duplication, always passing it to subscribe_actor_event! macro instead of duplicating the entire macro invocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…tricRecorder Add default no-op implementations to LifecycleMetricRecorder trait methods, allowing implementors to only override the methods they need. Rename NoMetricRecorder to NoOpMetricRecorder for clarity. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Let subscribe_actor_event! use its internal NoOpMetricRecorder default instead of explicitly passing it from subscribe_actor_event_map!. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove the storage request setup duration histogram as it provides limited value. The event handler lifecycle metrics already track overall request handling time. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
…s catalog - Fix metric queries to use `event` label instead of `handler` - Event names derived from event type (e.g., NewStorageRequest -> new_storage_request) - Update ALL_STORAGEHUB_METRICS to match actual metrics in client/src/metrics.rs - Add missing system resource gauges and event handler metrics - Remove non-existent BSP/MSP counter metrics from catalog 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Check event handler histogram metrics for the main events triggered during batch file uploads: - new_storage_request - remote_upload_request - process_confirm_storing_request - process_msp_respond_storing_request 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Update test to explicitly check both status labels: - pending: incremented when event is received - success: incremented when handler completes successfully Also verify pending >= success (every success was first pending). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Move the histogram observation to where generate_key_proof is called, eliminating the wrapper function pattern. This is cleaner and more explicit about when timing occurs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Function was only used once, so inline its logic directly where the base type name is extracted from the event type. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
B5-clientnoteworthy
Changes should be mentioned client-related release notes
D3-trivial👶
PR contains trivial changes that do not require an audit
not-breaking
Does not need to be mentioned in breaking changes
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Client Telemetry Infrastructure
Adds comprehensive Prometheus metrics instrumentation to the StorageHub client, enabling observability into storage provider operations, event handler lifecycle, file transfer throughput, and system resource utilization.
Design
The telemetry system is built around three core components:
StorageHubMetrics- Central metrics registry containing all Prometheus counters, histograms, and gaugesMetricsLink- Optional wrapper that enables zero-cost no-op when metrics are disabledLifecycleMetricRecorder- Trait for tracking event handler states (pending → success/failure)Event Handler Lifecycle Metrics
The
LifecycleMetricRecordertrait is injected into event bus listeners via thesubscribe_actor_event_map!macro. This enables automatic tracking of:event_handler_pending_totalevent_handler_success_totalevent_handler_failure_totalevent_handler_duration_secondsUsage:
When
metricsis provided, the macro auto-generates anEventMetricRecorderfor each mapping with theeventlabel derived from the event type name (converted to snake_case).Task-Level Metrics
Tasks record domain-specific metrics using helper macros:
System Resource Metrics
A background task collects system and process resource metrics every 5 seconds using the
sysinfocrate:storagehub_system_cpu_usage_percentstoragehub_process_cpu_usage_percentstoragehub_system_memory_total_bytesstoragehub_system_memory_used_bytesstoragehub_system_memory_available_bytesstoragehub_process_memory_rss_bytesThe collector is spawned automatically when
MetricsLink::new()is called with a registry, enabling zero-configuration resource monitoring.Metrics Catalog
Storage Request Metrics
storage_requests_totalprovider_type,statusstorage_request_retries_totalprovider_typeProof Metrics
proofs_submitted_totalstatusproof_submission_duration_secondsstatusFile Transfer Metrics
bytes_downloaded_totalstatusbytes_uploaded_totalstatuschunks_downloaded_totalstatuschunks_uploaded_totalstatusfile_download_secondsstatusProvider Metrics
slashing_events_totalrolebucket_move_totalstatususer_evictions_totalstatusforest_verifications_totalstatusfile_deletions_totalprovider_type,statusSystem Resource Metrics
storagehub_system_cpu_usage_percentstoragehub_process_cpu_usage_percentstoragehub_system_memory_total_bytesstoragehub_system_memory_used_bytesstoragehub_system_memory_available_bytesstoragehub_process_memory_rss_bytesDocker Infrastructure
Prometheus Integration
Each network type has a dedicated Prometheus configuration:
docker/bspnet-prometheus.yml- BSPNet scrape configsdocker/fullnet-prometheus.yml- FullNet scrape configs (BSPs, MSPs, Fisherman)Prometheus scrapes metrics from provider containers on port
9615(Substrate's default Prometheus port). Therolelabel is automatically attached by Prometheus based on the scrape job configuration, enabling per-role filtering in dashboards.Grafana Dashboards
Pre-built dashboards for each provider role:
bsp-dashboard.jsonmsp-dashboard.jsonfisherman-dashboard.jsonEach dashboard includes a Resource Usage section with:
Dashboards are automatically provisioned via
docker/grafana/provisioning/.