Skip to content

feat: add borsh serialisation and deserialisation to peer_db peer claim source #7358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: development
Choose a base branch
from

Conversation

hansieodendaal
Copy link
Contributor

@hansieodendaal hansieodendaal commented Jul 21, 2025

Description

Added Borsh serialisation/deserialisation to the peer_db's MultiaddrWithStats peer source field in favour of using serde_json.

See #7306 - further system-level monitoring is needed to verify that this was the root cause.

Motivation and Context

Serde (serde_json) conversion was not always successful - validated peer info would be added to the database, then, very rarely, the data would be skewed when either written to or read from the database, so that peer validation fails due to claim signature validation failure, most likely the updated_at: DateTime<Utc> field.

It was easy to detect invalid peer info sent over the wire from sync peers as the invalidated data is present in almost all peer dbs, but difficult to sumulate invalid data in the local test peer db. Starting with a fresh peer db each time, initial peer sync had to be done many times to capture a bad peer db instance where validation afterwards would fail. In the case presented here, ~0.4% of the entries was invalid.

The example below shows a valid peer info added to the database, but when read back to validate, it fails with the 2nd address.

2025-07-21 12:06:42.378104700 [comms::dht] TRACE [Validate peer db] peer: '193cdfa4bda140a44b11785365' / '2225da7b8e01debb3bf65a263bcf47bd49f69773ce7b59c04ae43f5838bbd11b' __W MultiaddressesWithStats { addresses: [MultiaddrWithStats { address: "/ip4/127.0.0.1/tcp/41907", last_seen: None, connection_attempts: 0, avg_initial_dial_time: None, initial_dial_time_sample_count: 0, avg_latency: None, latency_sample_count: 0, last_attempted: None, last_failed_reason: None, quality_score: None, source: FromDiscovery { peer_identity_claim: PeerIdentityClaim { addresses: ["/ip4/127.0.0.1/tcp/41907"], features: PeerFeatures(0x0), signature: IdentitySignature { version: 0, signature: CompressedSchnorrSignature { public_nonce: 821e9d504cb756d64d2ed2f2612b5fdab8a627cbc1d7d7b24c5fc3ffb13e6a28, signature: RistrettoSecretKey(***), _phantom: PhantomData<tari_crypto::signatures::schnorr::SchnorrSigChallenge> }, updated_at: 2025-07-21T08:23:13Z } } } }, MultiaddrWithStats { address: "/ip4/127.0.0.1/tcp/43047", last_seen: None, connection_attempts: 0, avg_initial_dial_time: None, initial_dial_time_sample_count: 0, avg_latency: None, latency_sample_count: 0, last_attempted: None, last_failed_reason: None, quality_score: None, source: FromDiscovery { peer_identity_claim: PeerIdentityClaim { addresses: ["/ip4/127.0.0.1/tcp/43047"], features: PeerFeatures(0x0), signature: IdentitySignature { version: 0, signature: CompressedSchnorrSignature { public_nonce: 2e1b1a2dd32792a84ab6ee47baa5b66abe83ff950cd2410ff1ab2ecfedc05753, signature: RistrettoSecretKey(***), _phantom: PhantomData<tari_crypto::signatures::schnorr::SchnorrSigChallenge> }, updated_at: 2025-07-21T08:48:18Z } } } }] }
2025-07-21 12:06:42.378314700 [dht::network_discovery::peer_validator] TRACE Peer '193cdfa4bda140a44b11785365' / '2225da7b8e01debb3bf65a263bcf47bd49f69773ce7b59c04ae43f5838bbd11b' passed validation for claim: PeerIdentityClaim { addresses: ["/ip4/127.0.0.1/tcp/41907"], features: PeerFeatures(0x0), signature: IdentitySignature { version: 0, signature: CompressedSchnorrSignature { public_nonce: 821e9d504cb756d64d2ed2f2612b5fdab8a627cbc1d7d7b24c5fc3ffb13e6a28, signature: RistrettoSecretKey(***), _phantom: PhantomData<tari_crypto::signatures::schnorr::SchnorrSigChallenge> }, updated_at: 2025-07-21T08:23:13Z } }
2025-07-21 12:06:42.378393900 [dht::network_discovery::peer_validator] TRACE Peer '193cdfa4bda140a44b11785365' / '2225da7b8e01debb3bf65a263bcf47bd49f69773ce7b59c04ae43f5838bbd11b' FAILED validation for claim: PeerIdentityClaim { addresses: ["/ip4/127.0.0.1/tcp/43047"], features: PeerFeatures(0x0), signature: IdentitySignature { version: 0, signature: CompressedSchnorrSignature { public_nonce: 2e1b1a2dd32792a84ab6ee47baa5b66abe83ff950cd2410ff1ab2ecfedc05753, signature: RistrettoSecretKey(***), _phantom: PhantomData<tari_crypto::signatures::schnorr::SchnorrSigChallenge> }, updated_at: 2025-07-21T08:48:18Z } }, error: Peer signature was invalid for peer '193cdfa4bda140a44b11785365'
2025-07-21 12:06:42.378410300 [comms::dht] WARN  [Validate peer db] Peer validation failed for peer '193cdfa4bda140a44b11785365', deleting from peer db, Error: 'Peer signature was invalid for peer '193cdfa4bda140a44b11785365''

How Has This Been Tested?

Added unit tests.
System-level testing.
To be monitored.

What process can a PR reviewer use to test or verify this change?

Code review.
System-level testing.

Breaking Changes

  • None
  • Requires data directory on base node to be deleted
  • Requires hard fork
  • Other - Please specify

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for Borsh serialization and deserialization to peer identity and address data, improving data handling efficiency.
    • Introduced new test utilities for peer creation and identity claims to streamline testing.
  • Bug Fixes

    • Ensured consistent serialization and deserialization of peer claims and addresses, verified by new automated tests.
  • Tests

    • Added comprehensive tests validating Borsh serialization round-trips for peer-related entities.
    • Added tests verifying database serialization consistency for peer claims.
  • Chores

    • Updated database schema and migration scripts to store peer address sources in binary format instead of text.

Added Borsh serialization/deserialisation to the peer_db's MultiaddrWithStats peer source field.
Serde conversion was not always successful - validated peer info would be added to the database,
then ~0.4% of the time the data would be skewed when read form the database do that peer
validation fails due to claim signature validation failure, most likely the
`updated_at: DateTime<Utc>` field.
Copy link
Contributor

coderabbitai bot commented Jul 21, 2025

Walkthrough

This change transitions the serialization of peer address sources in the peer manager from JSON (text) to Borsh (binary) format. It updates Rust structs, database schema, and migration scripts accordingly, and implements Borsh serialization/deserialization for relevant types. Comprehensive tests are added to ensure round-trip correctness and consistency.

Changes

File(s) Change Summary
comms/core/Cargo.toml Added borsh crate as a dependency.
comms/core/src/net_address/multiaddr_with_stats.rs Derived Borsh serialization for PeerAddressSource and added a serialization round-trip test.
comms/core/src/peer_manager/identity_signature.rs Implemented Borsh serialization/deserialization for IdentitySignature and added a test.
comms/core/src/peer_manager/peer_identity_claim.rs Implemented Borsh serialization/deserialization for PeerIdentityClaim, added test helper and test.
comms/core/src/peer_manager/manager.rs Added test-only function to create a peer with a claim for testing.
comms/core/src/peer_manager/mod.rs Exposed new test-only utilities for peer creation and claims.
comms/core/src/peer_manager/storage/database.rs Changed peer address source serialization from JSON to Borsh binary, updated struct fields and conversion logic.
comms/core/src/peer_manager/storage/schema.rs Changed multi_addresses.source column type from Text to Binary.
comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/up.sql Migration: dropped/recreated tables to update source column to BLOB (binary).
comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/down.sql Migration rollback: added comments, no action possible for binary-to-text conversion.

Sequence Diagram(s)

sequenceDiagram
    participant RustStruct as Rust Struct (PeerAddressSource, etc.)
    participant Borsh as Borsh Serializer
    participant DB as Database

    RustStruct->>Borsh: Serialize (to Vec<u8>)
    Borsh->>DB: Store as Binary (BLOB)
    DB->>Borsh: Retrieve Binary (BLOB)
    Borsh->>RustStruct: Deserialize (from Vec<u8>)
Loading

Estimated code review effort

4 (~90 minutes)

Possibly related PRs

  • feat: add sqlite peer_db #6963: Both PRs involve major changes to the peer database backend and serialization, likely affecting similar modules and migration strategies.

Suggested reviewers

  • SWvheerden

Poem

In binary fields the data hops,
From JSON’s text to Borsh it swaps.
Peer claims and sources, crisp and neat,
Now travel fast in binary fleet.
With every test a bunny cheers—
Serialization, no more fears!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4e5bcfe and 1ba889c.

📒 Files selected for processing (1)
  • comms/core/src/peer_manager/storage/database.rs (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • comms/core/src/peer_manager/storage/database.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: test (testnet, esmeralda)
  • GitHub Check: test (nextnet, nextnet)
  • GitHub Check: test (mainnet, stagenet)
  • GitHub Check: ledger build tests
  • GitHub Check: cargo check with stable
  • GitHub Check: ci
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@hansieodendaal hansieodendaal changed the title feat: add borsh to peer_db peer claim source feat: add borsh serialisation and deserialisation to peer_db peer claim source Jul 21, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9a16964 and 4e5bcfe.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • comms/core/Cargo.toml (1 hunks)
  • comms/core/src/net_address/multiaddr_with_stats.rs (4 hunks)
  • comms/core/src/peer_manager/identity_signature.rs (3 hunks)
  • comms/core/src/peer_manager/manager.rs (2 hunks)
  • comms/core/src/peer_manager/mod.rs (1 hunks)
  • comms/core/src/peer_manager/peer_identity_claim.rs (2 hunks)
  • comms/core/src/peer_manager/storage/database.rs (9 hunks)
  • comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/down.sql (1 hunks)
  • comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/up.sql (1 hunks)
  • comms/core/src/peer_manager/storage/schema.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (10)
📓 Common learnings
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1655-1658
Timestamp: 2025-05-29T09:40:09.356Z
Learning: In the Tari codebase, node_id hex strings in the database are guaranteed to be valid because they can only be added via `update_peer_sql(peer: Peer)` which converts from valid NodeId objects, ensuring data integrity at the insertion layer.
Learnt from: hansieodendaal
PR: tari-project/tari#6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.
Learnt from: hansieodendaal
PR: tari-project/tari#7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the `hard_delete_all_stale_peers` method in `comms/core/src/peer_manager/storage/database.rs`, the SQL query intentionally uses exact equality (`peers.features = ?`) rather than bitwise operations (`peers.features & ? != 0`) when matching `COMMUNICATION_NODE` features. This is the intended behavior to match only peers with exactly the `COMMUNICATION_NODE` feature, excluding those with additional feature flags.
comms/core/src/peer_manager/storage/schema.rs (2)

Learnt from: hansieodendaal
PR: #6963
File: common_sqlite/src/error.rs:88-92
Timestamp: 2025-05-23T07:49:57.349Z
Learning: In the StorageError enum in common_sqlite/src/error.rs, the HexError variant should keep the manual From implementation rather than using #[from] attribute, as it stores a String representation of the error rather than the HexError type itself.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/down.sql (1)

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

comms/core/src/net_address/multiaddr_with_stats.rs (8)

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: #6963
File: comms/core/src/peer_manager/storage/migrations/2025-04-14-072200_initial/up.sql:24-41
Timestamp: 2025-05-02T14:07:10.892Z
Learning: The peer system design requires each network address to be uniquely associated with exactly one peer, and an address cannot be reused across multiple peers.

Learnt from: hansieodendaal
PR: #6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The PeerFeatures::from_bits_u32_truncate method truncates a u32 to u8 bits but can still return None if the resulting bits don't match any valid flags, making the error handling with .ok_or_else() necessary even after truncation.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:721-735
Timestamp: 2025-07-09T08:13:37.206Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the 10-second STREAM_ITEM_TIMEOUT and retry logic are intentionally designed to handle service conflicts where other services kill seed peer connections during seedstrap operations. The underlying discovery_peer/dial_peer API timeouts are too lenient for seedstrap use cases, so the more aggressive timeout with retry logic is appropriate and necessary.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/up.sql (3)

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1655-1658
Timestamp: 2025-05-29T09:40:09.356Z
Learning: In the Tari codebase, node_id hex strings in the database are guaranteed to be valid because they can only be added via update_peer_sql(peer: Peer) which converts from valid NodeId objects, ensuring data integrity at the insertion layer.

Learnt from: hansieodendaal
PR: #6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.

comms/core/src/peer_manager/peer_identity_claim.rs (4)

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

Learnt from: hansieodendaal
PR: #6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The PeerFeatures::from_bits_u32_truncate method truncates a u32 to u8 bits but can still return None if the resulting bits don't match any valid flags, making the error handling with .ok_or_else() necessary even after truncation.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

comms/core/src/peer_manager/identity_signature.rs (2)

Learnt from: hansieodendaal
PR: #7284
File: applications/minotari_console_wallet/src/automation/commands.rs:0-0
Timestamp: 2025-07-15T12:23:14.650Z
Learning: In applications/minotari_console_wallet/src/automation/commands.rs, the consistent error handling pattern for command execution is to use match statements that: 1) On success: log with debug!, print user feedback, and push tx_id to tx_ids vector for monitoring, 2) On error: print error message with eprintln! using the format "{CommandName} error! {}", rather than using .unwrap() which would panic.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

comms/core/src/peer_manager/mod.rs (4)

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

comms/core/src/peer_manager/storage/database.rs (11)

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

Learnt from: hansieodendaal
PR: #6963
File: common_sqlite/src/error.rs:88-92
Timestamp: 2025-05-23T07:49:57.349Z
Learning: In the StorageError enum in common_sqlite/src/error.rs, the HexError variant should keep the manual From implementation rather than using #[from] attribute, as it stores a String representation of the error rather than the HexError type itself.

Learnt from: hansieodendaal
PR: #6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1655-1658
Timestamp: 2025-05-29T09:40:09.356Z
Learning: In the Tari codebase, node_id hex strings in the database are guaranteed to be valid because they can only be added via update_peer_sql(peer: Peer) which converts from valid NodeId objects, ensuring data integrity at the insertion layer.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the NUM_RETRIES logic in get_peers is specifically designed to handle peer connections that are closed while trying to RPC stream peer info, not general connection failures. The retry logic only applies when peers.is_empty() && !conn.is_connected() && attempt < NUM_RETRIES, which indicates a mid-stream disconnection.

Learnt from: hansieodendaal
PR: #6963
File: comms/core/src/peer_manager/storage/migrations/2025-04-14-072200_initial/up.sql:24-41
Timestamp: 2025-05-02T14:07:10.892Z
Learning: The peer system design requires each network address to be uniquely associated with exactly one peer, and an address cannot be reused across multiple peers.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Learnt from: hansieodendaal
PR: #6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The PeerFeatures::from_bits_u32_truncate method truncates a u32 to u8 bits but can still return None if the resulting bits don't match any valid flags, making the error handling with .ok_or_else() necessary even after truncation.

comms/core/src/peer_manager/manager.rs (7)

Learnt from: hansieodendaal
PR: #7123
File: comms/core/src/peer_manager/storage/database.rs:1517-1541
Timestamp: 2025-05-29T09:42:20.881Z
Learning: In the hard_delete_all_stale_peers method in comms/core/src/peer_manager/storage/database.rs, the SQL query intentionally uses exact equality (peers.features = ?) rather than bitwise operations (peers.features & ? != 0) when matching COMMUNICATION_NODE features. This is the intended behavior to match only peers with exactly the COMMUNICATION_NODE feature, excluding those with additional feature flags.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: SWvheerden
PR: #6951
File: base_layer/core/src/base_node/tari_pulse_service/mod.rs:327-352
Timestamp: 2025-04-16T07:06:53.981Z
Learning: The discovery_peer and dial_peer methods in the Tari codebase have built-in timeout mechanisms, so adding explicit timeouts with tokio::time::timeout is unnecessary.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the fetch_peers_from_connection and collect_peer_stream functions rely on RPC streaming, and when the main connection is closed by another process, collect_peer_stream times out after STREAM_ITEM_TIMEOUT because it cannot detect that the peer can no longer respond, returning an empty vector of peers. This is why the connection state check is important for the retry logic.

Learnt from: hansieodendaal
PR: #6963
File: comms/dht/src/proto/mod.rs:141-142
Timestamp: 2025-05-02T07:12:23.985Z
Learning: The PeerFeatures::from_bits_u32_truncate method truncates a u32 to u8 bits but can still return None if the resulting bits don't match any valid flags, making the error handling with .ok_or_else() necessary even after truncation.

Learnt from: hansieodendaal
PR: #7294
File: comms/dht/src/network_discovery/seed_strap.rs:352-456
Timestamp: 2025-07-09T08:33:29.320Z
Learning: In comms/dht/src/network_discovery/seed_strap.rs, the context.connectivity.dial_peer method should fail fast and return an error if a peer cannot be dialed, rather than requiring retry logic for general connection failures.

Learnt from: hansieodendaal
PR: #6963
File: comms/core/src/peer_manager/manager.rs:60-68
Timestamp: 2025-05-26T02:40:23.812Z
Learning: PeerDatabaseSql in the Tari codebase has been specifically refactored to handle concurrent access and mitigate blocking I/O concerns on async executor threads. The implementation has been tested under high load at both system level and through unit tests like test_concurrent_add_or_update_and_get_closest_peers which validates concurrent read/write operations.

🔇 Additional comments (22)
comms/core/Cargo.toml (1)

68-68: LGTM: Borsh dependency addition is appropriate.

The borsh = "1.5.7" dependency addition supports the transition from JSON to binary serialization for peer address sources, which should improve reliability and performance.

comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/down.sql (1)

1-5: LGTM: Appropriate handling of irreversible migration.

The comments correctly explain that rolling back from Binary to Text format is not feasible since the original JSON strings cannot be reliably reconstructed. This is a sensible approach for this type of data migration.

comms/core/src/peer_manager/storage/schema.rs (1)

46-46: LGTM: Schema change aligns with Borsh serialization migration.

The change from Text to Binary for the source column correctly supports the transition from JSON string to Borsh binary serialization format.

comms/core/src/peer_manager/mod.rs (2)

101-102: LGTM: Appropriate test utility export.

The addition of create_test_peer_with_claim export with proper #[cfg(test)] guarding provides necessary test utilities for Borsh serialization testing.


108-109: LGTM: Appropriate test utility export.

The addition of create_test_peer_identity_claim export with proper #[cfg(test)] guarding complements the testing infrastructure for peer identity claims.

comms/core/src/net_address/multiaddr_with_stats.rs (4)

14-14: LGTM: Appropriate Borsh imports.

The import of BorshDeserialize and BorshSerialize traits is necessary for the serialization implementation.


390-390: LGTM: Borsh derives added to PeerAddressSource.

Adding BorshSerialize, BorshDeserialize derives to the PeerAddressSource enum enables binary serialization, which is the core requirement for this migration.


471-475: LGTM: Appropriate test imports.

The test imports for BorshDeserialize, BorshSerialize, Multiaddr, and the test utility function are correctly added to support the new serialization test.


571-593: LGTM: Comprehensive Borsh serialization test.

The test performs 1000 iterations of round-trip serialization/deserialization testing, which provides excellent coverage to ensure the Borsh implementation is reliable and consistent. This addresses the core issue mentioned in the PR objectives where Serde conversion was occasionally unreliable.

comms/core/src/peer_manager/manager.rs (1)

404-449: LGTM! Well-structured test utility function.

The new create_test_peer_with_claim function properly creates a peer with a PeerIdentityClaim as the address source, which is essential for testing the Borsh serialization functionality. The implementation follows the existing pattern of create_test_peer while adding claim-based address source support.

comms/core/src/peer_manager/identity_signature.rs (2)

201-237: Solid Borsh serialization implementation.

The Borsh serialization/deserialization implementation for IdentitySignature is well-structured:

  • Correctly handles cryptographic key serialization using canonical byte representations
  • Properly serializes DateTime<Utc> by breaking it into timestamp and nanoseconds
  • Includes appropriate error handling converting parsing errors to IoError
  • Maintains consistent field ordering between serialization and deserialization

303-323: Comprehensive round-trip test for serialization correctness.

The test performs 1000 iterations of Borsh serialization/deserialization round-trips, which provides strong confidence in the consistency of the implementation. This directly addresses the core issue of serialization corruption mentioned in the PR objectives.

comms/core/src/peer_manager/peer_identity_claim.rs (3)

88-122: Well-implemented Borsh serialization with proper error handling.

The Borsh serialization implementation correctly handles the conversion of Multiaddr objects to/from strings and includes proper error handling for parsing failures and invalid feature bits. The approach of serializing addresses as strings is appropriate since Multiaddr doesn't natively support Borsh.


124-168: Excellent test utility with diverse address types.

The create_test_peer_identity_claim function creates comprehensive test data with IPv4, IPv6, and onion addresses, ensuring good test coverage across different multiaddr types. The proper signature generation makes this a valuable utility for testing claim-based functionality.


177-195: Thorough round-trip test matching the established pattern.

The test provides 1000 iterations of serialization/deserialization verification, which aligns with the testing approach used for IdentitySignature and provides strong confidence in the Borsh implementation correctness.

comms/core/src/peer_manager/storage/migrations/2025-07-21-170500_peer_address_source/up.sql (1)

1-54: Migration correctly handles column type change but results in data loss.

The migration appropriately drops and recreates the tables since changing from TEXT to BLOB cannot be reliably converted. All table structures, indexes, and constraints are properly preserved. However, this migration will result in complete loss of existing peer data.

Please verify that this data loss is acceptable for the deployment strategy. Consider if any data preservation steps are needed before applying this migration.

comms/core/src/peer_manager/storage/database.rs (6)

25-25: LGTM: Import addition for Borsh serialization

The addition of Borsh imports is appropriate for the serialization change.


1658-1658: Verify: Non-optional source field

The source field in NewMultiaddrWithStatsSql is now Vec<u8> (non-optional), while in UpdateMultiaddrWithStatsSql it remains optional (Option<Vec<u8>>). Is this intentional? Consider making it Option<Vec<u8>> for consistency and to handle cases where source data might be missing.


566-570: LGTM: Correct Borsh serialization implementation

The serialization of address.source() using Borsh is implemented correctly with proper error handling.


671-675: LGTM: Consistent serialization in update method

The Borsh serialization implementation is consistent with the add method.


1815-1815: LGTM: Correct Borsh deserialization

The deserialization using PeerAddressSource::deserialize_reader is implemented correctly with proper error propagation.


2466-2488: Excellent test coverage for serialization consistency

The test effectively validates the Borsh serialization/deserialization round-trip for 1000 peers with claims. This comprehensive test should help ensure the corruption issues mentioned in the PR objectives are resolved.

@SWvheerden
Copy link
Collaborator

it makes no sense that a json serialze/deserialze would fail without any error and create invalid data that can be read successfully.
I very much prefer us to keep that data json serialized because it makes it human readable,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants