-
Notifications
You must be signed in to change notification settings - Fork 11
feat: add telemetry metrics instrumentation #594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
snowmead
wants to merge
10
commits into
main
Choose a base branch
from
feat/telemetry
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
- Add metrics.rs with StorageHubMetrics struct and helper macros - Instrument BSP tasks: upload, proof submission, fees, deletion, bucket moves - Instrument MSP tasks: upload, deletion, distribution, fees, bucket moves - Instrument fisherman batch deletions and file downloads - Add prometheus.yml config and Docker integration - Add centralized test/util/prometheus.ts API - Add integration tests for all metrics Metrics tracked: storage requests, proofs, fees, deletions, bucket moves, file transfers, and download operations with status labels and histograms.
…r tracking - Introduced new macros for metrics incrementing and histogram observation, allowing for cleaner and more consistent metric tracking across various tasks. - Updated file download manager to utilize new macros for recording successful and failed download metrics. - Enhanced proof generation task to track timing metrics for both success and failure scenarios. - Improved storage request handling in upload tasks to increment metrics based on success or failure of confirmations. - Refactored existing metric tracking code to reduce redundancy and improve readability.
Resolved conflicts by combining metrics instrumentation from feat/telemetry with improved return messages from main in: - bsp_upload_file.rs - msp_delete_bucket.rs - msp_distribute_file.rs - msp_upload_file.rs
…some tasks, add telemetry integration package script
…ption Add metrics instrumentation for previously uncovered task event handlers: - bsp_upload_file: chunk upload success/failure counters - msp_retry_bucket_move: retry attempt counters - msp_verify_bucket_forests: verification counters with duration histogram - msp_stop_storing_insolvent_user: bucket deletion counters - sp_slash_provider: slash submission counters Update Grafana dashboards with new panels for chunk uploads (BSP) and forest verification/retries (MSP).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
B5-clientnoteworthy
Changes should be mentioned client-related release notes
D3-trivial👶
PR contains trivial changes that do not require an audit
not-breaking
Does not need to be mentioned in breaking changes
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Pushes Prometheus metrics instrumentation across BSP, MSP, and Fisherman task handlers to ensure failure paths are properly tracked. Adds Grafana with pre-configured dashboards for visualizing StorageHub metrics when running integration tests with telemetry enabled.
Notable changes
Metrics Instrumentation
bsp_download_file.rs: New metricbsp_download_requests_totalwith wrapper pattern -handle_downloadrecords success/failure metrics and delegates tohandle_download_innerfor actual logic.bsp_delete_file.rs: Refactored metric recording intoremove_file_from_file_storageto centralize success/failure tracking. Added failure metrics on forest storage retrieval errors.bsp_charge_fees.rs: Addedinsolvent_users_processed_totalfailure counters on forest storage retrieval failures and forest root write tx errors.bsp_submit_proof.rs: Addedbsp_proofs_submitted_totalfailure counters when forest root tx is already taken or forest storage retrieval fails.bsp_upload_file.rs: Addedbsp_storage_requests_totalfailure counters on tx taken, wrong provider type, no BSP ID, empty proofs, and forest storage retrieval errors. Addedbsp_upload_chunks_received_totalfor tracking chunk upload success/failure inRemoteUploadRequesthandler.msp_upload_file.rs: Addedmsp_storage_requests_totalfailure counters on tx taken, wrong provider type, and no MSP ID errors.msp_delete_file.rs: Addedmsp_files_deleted_totalfailure counters on forest storage and metadata retrieval errors.msp_move_bucket.rs: Success metric now recorded after actual download completion rather than after on-chain confirmation. Added failure metrics for indexer disabled and download failures. Empty bucket (no files) correctly records success.msp_retry_bucket_move.rs: Addedmsp_bucket_move_retries_totalfor tracking retry attempt success/failure inRetryBucketMoveDownloadhandler.msp_verify_bucket_forests.rs: Addedmsp_forest_verifications_totalcounter andmsp_forest_verification_secondshistogram for tracking forest verification operations with timing.msp_stop_storing_insolvent_user.rs: Addedmsp_buckets_deleted_totalmetrics for tracking bucket deletion success/failure inFinalisedMspStopStoringBucketInsolventUserhandler.sp_slash_provider.rs: Addedsp_slash_submissions_totalfor tracking slash extrinsic submission success/failure inSlashableProviderhandler.Grafana Integration
fullnet-base-template.ymlon port 3030 with anonymous viewer access enabled.sh-prometheus:9090.Test Infrastructure
prometheus/totelemetry/for broader scope.test:telemetryandtest:telemetry:onlyscripts topackage.json.bsp_download_requests_totalto metrics validation test.NODE_INFOSconstants.Note for Node Operators
Download the pre fabricated Grafana dashboards specific to each Storage Hub node role to start off with a baseline.
Snippet of MSP Grafana Dashboard
Metrics Reference
BSP Metrics
storagehub_bsp_storage_requests_totalpending,success,failurestoragehub_bsp_proofs_submitted_totalpending,success,failurestoragehub_bsp_fees_charged_totalsuccess,failurestoragehub_bsp_files_deleted_totalsuccess,failurestoragehub_bsp_bucket_moves_totalpending,success,failurestoragehub_bsp_download_requests_totalsuccess,failurestoragehub_bsp_upload_chunks_received_totalsuccess,failurestoragehub_bsp_proof_generation_secondssuccess,failureMSP Metrics
storagehub_msp_storage_requests_totalpending,success,failurestoragehub_msp_files_distributed_totalpending,success,failurestoragehub_msp_files_deleted_totalsuccess,failurestoragehub_msp_buckets_deleted_totalsuccess,failurestoragehub_msp_fees_charged_totalsuccess,failurestoragehub_msp_bucket_moves_totalpending,success,failurestoragehub_msp_bucket_move_retries_totalsuccess,failurestoragehub_msp_forest_verifications_totalsuccess,failurestoragehub_msp_forest_verification_secondssuccess,failureSP Metrics
storagehub_sp_slash_submissions_totalsuccess,failureGeneral Metrics
storagehub_storage_request_secondssuccess,failurestoragehub_file_transfer_secondssuccess,failurestoragehub_insolvent_users_processed_totalsuccess,failurestoragehub_fisherman_batch_deletions_totalsuccess,failureDownload Metrics
storagehub_bytes_downloaded_totalsuccess,failurestoragehub_chunks_downloaded_totalsuccess,failurestoragehub_file_download_secondssuccess,failure