-
Notifications
You must be signed in to change notification settings - Fork 11
log :: 2025‐06
It took a couple of minutes discussing with RK to find what was the source of the "messages lost" issue when connecting nodes: The Actor library we use, acto, has bounded size mailboxes for each spawned actor. When a mailbox is full, new incoming messages are dropped: they become dead letters are never processed. While this might be consistent with usual Actors' semantics, it's quite inconvenient for us.
There are a variety of options to deal with the issue, which amounts to handling back-pressure:
- keep track of sent messages and require an
Ack
from the actor. This complicates the machinery, but is in accord with actors' look and feel - provide a
try_send
method and retry sending the message if mailbox is full - increase the size of the mailbox!
We went for the latter solution as a stop-gap approach, looking forward to implementing new network approach with p2p as discussed the previous days.
- Collecting interesting metrics for consensus was relatively straightforward:
- define a
ConsensusMetrics
structure containing a fewGauge
s andCounter
s of interest - initialise the object at startup when Telemetry is active
- Pass an
Option<ConsensusMetrics>
to all stages so that they can report their own stuff. For stages which are simple wrappers around domain services, the metrics is passed to the latter so that it's the responsibility of the "business logic" to set and update metrics
- define a
I spent time since the beginning of the week to polish the "Amaru as a (primitive) relay" Demo for next Friday, which lead me to implement a few minor features and investigate a couple issues.
-
Simulation tests were failing on my machine (#278) and only on my machine. Spending some time comparing the
pure-stage
traces as discussed previously it became clear the simulation engine was sometimes not sending messages in the right order which lead to chain validation failure. SA independently investigated the issue himself and this effort lead to PRs: #293 and #294- Generated input messages are sorted by (planned) arrival time which was an
Instant::now()
, which on fast machine could lead to 2 messages having the same timestamp and therefore sorting to be non-deterministic. - The issue was sorted by introducing a delay for each message diffusion which follows an exponential distribution
- Generated input messages are sorted by (planned) arrival time which was an
-
In practice,
ChainSync
messages arrival follows 2 modes: when catching-up, they arrive as a constant stream of messages and therefore at the speed of the network, whereas when at tip, they arrive at a random interval dependent on block creation time and block diffusion. The former follows exponential distribution with parameterλ = 1/20
corresponding to block creation frequency expectation, whereas the latter depends on propagation model which could be ΔQ-based. -
I initially thought the validation failure observed in simulation testing and 2-nodes setup could be caused by a race condition between child validation and parent storage, but this wasn't the case. It nevertheless lead me to reorganise the pipeline to store the received headers before validating them (see #295) . This is currently done inefficiently and somewhat unsafely as we probably want to avoid storing headers multiple times and detect knonw invalid headers early.
-
I started trying amaru-doctor to investigate what was the state of each of the nodes. Wweirdly the upstream node doesn't appear to have stored the header it has sent, nor any other. Or rather I could not find it using amaru-doctor, and I am unsure if this is a problem with the soft or the DB
-
I then resorted to running the 2 nodes dumping their logs in JSON:
AMARU_SERVICE_NAME=amaru-node1 AMARU_OTLP_SPAN_URL=http://clermont:4317 AMARU_OTLP_METRIC_URL=http://clermont:4318/v1/metrics cargo run -- --with-json-traces daemon --network preprod --ledger-dir node1/ledger.db --chain-dir node1/chain.db --peer-address clermont:3001 --listen-address localhost:3000 > node1.log
- Note that sending the logs to a file is annoying right now because the output is mangled with ANSI escape characters.
ripgrep
colorizes the output and add line numbers only in the terminal, which is quite useful - Noticed that the
stage.ledger
does not trace the header hash and slot so it's hard to correlate those traces, I added some fields to the instrumentation there - Ultimately, we could observe a gap in
forward_chain
between the faulty child's operation handling and the previous operations:
{"timestamp":"2025-06-24T08:28:24.030110Z","level":"DEBUG","fields":{"message":"got op Forward { header: (2674874, (70264698, c3baea7ec628ae2da91bbae6e11c724b2103b2fe9ae741c063c0e78285c7e5cc)), tip: (2674874, (70 264698, c3baea7ec628ae2da91bbae6e11c724b2103b2fe9ae741c063c0e78285c7e5cc)) }"},"target":"amaru::stages::consensus::forward_chain::client_protocol"} {"timestamp":"2025-06-24T08:28:24.030134Z","level":"DEBUG","fields":{"message":"got op Forward { header: (2677966, (70349466, ce251a72e20cd6202f217f45e0542ac5c92446901dd040ccb019f126441ea12f)), tip: (2677966, (70349466, ce251a72e20cd6202f217f45e0542ac5c92446901dd040ccb019f126441ea12f)) }"},"target":"amaru::stages::consensus::forward_chain::client_protocol"}
between 70264698 and 70349466 that's 84768 slots, and more than 3100 headers that do not appear in the
sending op
listing a few hashes between those 2 bounds:- 57d7445b2f2b18dea80664111d4ddb56c78bcb3d7486b1a1a43ca8ae160c8445
- a9a9043106a384b2dc49f89b3eaed75097d446a1fffd98876c36edebb46224d1
- 9bb187043f99e501c72fa99b37d64522da33d9313f16e4288c3294958b195c17
and grepping in the logs show that they never get handled in the
client_protocol
actor. - Note that sending the logs to a file is annoying right now because the output is mangled with ANSI escape characters.
-
I commented on the issue and asked for RK's help
This experience lead me to the realisation our metrics story for consensus and stages pipeline wasn't great, so I started implementing a ConsensusMetrics
structure and threading it in the pipeline to report various metrics, following the guidelines from EDR#7.
Discussing integration of @scarmuega work on peer-to-peer networking into amaru:
-
Pallas primitives could be used in another context -> generic interfaces to the network
-
scope of p2p pallas work?
-
Network v2 provides strict boundary b/w actual IO in the network and state
- "business logic" communicates with state through commands and events
- there's one piece of state for each pair of miniprotocol and peer
- note that address is part of the state for each peer
-
The central concept is the Behaviour trait which exposes a limited set of functions invoked by the manager. Network2 strives to provide opinionated standard behaviours in
standard
- for example the
InitiatorCommand
groups several mini-protocols in a single interface - behaviours are composable
- standard behaviour takes care of p2p peer selection logic
- how much distance should there be between standard behaviour vs. amaru-specific behaviour? How much of standard behaviour can we reuse?
-
schedule_io
/apply_io
-> are called by the manager to signal incoming events or requests. Interestingly ther sent message is also passed back to theschedule_io
- for example the
-
To deal with the initiator/responder distinction, there could be 2 different behaviours: 1 initiator and 1 responder
- But note that the haskell node will try to upgrade a single stream connection to full duplex depending on how the cnx is initiated (initiatoronly vs. initiatorandresponder)
- Hence it makes sense to have a single beahviour for responder/initiator but with a composable approach
-
what about backpressure?
- could be handled by effects or by bounding the queue
-
What about disconnects?
- when getting a
PeerDisconnected
and a message is in flight, we have to tear down the state (and resend the message)
- when getting a
-
next steps:
- show around Amaru code to @scarmuega
- implement initiator-side of Amaru networking (eg.
fetch_header
stage) using the new library
Spent the morning with RK, SA and AB troubleshooting simulation tests failure surfacing only on some specific machine (now reported as issue #278).
-
First tried to reliably reproduced the observed failure (on Mac OS)
-
This lead us to add the ability to pass
seed
value to test generator. We need to be able to set that on the command-line but it's currently only setable in code -
The failure is not deterministic even when setting the seed, so this should come from the underlying scheduling. We then wanted to compare the trace of execution of a successful and a failed test.
[!WARNING] trace is heavily overloaded in that context and we should find a way to distinguish between the various usage of the word:
- The trace of messages the simulation tester uses for simulation testing SUT and which is dumped upon failure
- the "trace" of log messages that's printed during test execution and comes from
tracing
system - the trace from the
TraceBuffer
that collects all inputs and outputs in stage-graph runner
-
We then wanted to record and compare the stage-graph execution traces so we added to simulation the code to capture the trace for all instances and dump those to a file
- the
TraceBuffer
is shared through anArc<Mutex<TraceBuffer>>
reference and cloned for each node instance we launch - output is recorded in a timestamped file in raw CBOR format
- the
-
Once we got our hands on both a success and a failure dump, we realised our tooling was inadequate to easily compare those traces:
- raw CBOR is useless so we need to transform it. We first used rq to generate JSON out of CBOR:
rq -c -J < simulation/amaru-sim/success-1750320738.trace | jq -c
- this works fine but rq is not maintained anymore and some of the representations are not great (eg. bytestrings -> array of ints)
- moreover the content of the messages is very noisy, e.g with full bytes of headers, decoded bodies, etc.
- we ended up using cbor-bin which provides some filtering capabilities and lead to much simpler traces (see issue linked above)
- raw CBOR is useless so we need to transform it. We first used rq to generate JSON out of CBOR:
-
while a bit frustrating, the session proved fruitful to pinpoint issues on tests execution and how to leverage the simulator's and stage-graph capabilities, but also to highlight how useful it can be to explore various execution paths and identify bugs
This is a summary of what SA and RK concluded so far about what a good API should be.
Context: since the pure-stage library allows for modelling several connected stages that might share memory, one cannot simply assume that the processing of one message is a discrete event that produces output messages instantly (which is a simplification that the current Haskell simulator does). So the simulator needs to not only orchestrate the network, but also the thread scheduler so to say.
Here's a sketch of the algorithm for the simulator:
-
Ask all nodes when the next interesting point in time is;
-
Pop world heap to figure out which message to deliever next;
-
If there are no interesting times from step 1 or they all happen after the next message is supposed to be delivered, then advance the time to the arrival time of the message on all nodes and deliver the popped message.
Otherwise put all the interesting times into the world heap and recurse. (We need to make it so that if we ask for the next interesting points in time in step 1, the nodes returns an empty list?)
So the API should be something like:
handle : ArrivalTime -> Message -> NextArrivalTime -> Vec<Message>
nextInteresting : Time
advanceTime : Time -> Vec<Message>
Where NextArrivalTime
is when the next message (after the one we just popped)
arrives, the idea being that the node we are currently stepping is free to run
all its yield points untilt hat time (since nothing interesting is happining in
the world until then anyway).
At last we were able to have 2 Amaru nodes connected, with on node using the other node as its (only) peer:
graph LR;
amaru1 --> amaru2;
amaru2 --> cardano-node;
This mostly involved implementing the block_fetch
server in the forward_chain
stage for the consensus.
There is a Draft PR in progress which requires a bit of polishing before merge:
- add some tests for the
block_fetch
logic - document use case in the README
We can make bootstrap
in order to query all the data needed for an
Amaru node to start syncing on preprod
, which means:
- download gzip cbor encoded ledger state, aka. snapshots, from some external resource
- update nonces for snapshots' tip
- download enough headers (at least tip) to get started syncing from another node
I have packaged all those steps into a single command that takes a network identifier and various optional parameters to generate a node's database into a single directory. The goal is to provide a unified interface without exposing the details of how the boostrapping process happens to the user as ultimately we would like to:
- support bootstrapping an Amaru node from its own generated snapshots
- support bootstrapping from Mithril snapshots
- easy creation of testnets
The process was relatively straightforward, I only had to refactor a
bit the import_xxx.run
functions to avoid using Args
structures,
which in the end might not be that smart a move. I only had issues
implementing the decompression for import_headers
function because
of identical names from different modules or crates being available in
the environment:
- the async-compression
crate is used to wrap compression libraries into
async
IO functions, which originally were directly supported into those libraries (eg. flate2) but whose supported has been removed a while ago - this library supports 2 different async IO,
futures-io
ortokio
, both of which expose aAsyncBufRead
trait which are enabled through feature flags - I of course initially activated the wrong feature flag which lead me to spend a couple hours scratching my head over type errors. I was able to sort the issue thanks to RK's help!
The introduction of reqwest
lead to a dependency on another library with unsupported license:
% cargo deny check -- license
error[rejected]: failed to satisfy license requirements
┌─ registry+https://github.com/rust-lang/crates.io-index#[email protected]:4:12
│
4 │ license = "MPL-2.0"
│ ━━━━━━━
│ │
│ license expression retrieved via Cargo.toml `license`
│ rejected: license is not explicitly allowed
│
├ MPL-2.0 - Mozilla Public License 2.0:
├ - OSI approved
├ - FSF Free/Libre
├ - Copyleft
├ webpki-roots v0.25.4
└── reqwest v0.11.27
└── amaru v0.1.0
└── amaru-sim v0.1.0
licenses FAILED
This was solved simply by adding the license to .cargo/deny.toml
file.
Been having some discussions about observability and metrics with AB and RK, thought I'd jot down some thoughts about observability that's also useful for testing.
If we store the arrival time of when an item got enqueued to the inbox of a stage, we can calculate the latency (the time of being latent) by taking the current time when the stage dequeues it from the inbox and subtracting the arrival time.
We can then process the item and take the current time again after than, the difference is the service time.
The response time is latency + service time.
These three times can be stored in a histogram that isn't sampling, for example this. It's important that the histogram isn't sampling, because then we might lose the tails of the distributions. Prometheus samples and requires you to run a separate process to scrape the metrics, which is a pain when testing locally.
Other stuff to store per stage: queue length, total processed items, idle time.
We can then use Little's law from queuing theory to calculate, for example, the throughput or utilisation.
Having all these statistics inside data structures per stage basically gives us a built-in profiler. We can then expose the stats in different ways, e.g. let Prometheus sample them. Or build a visualiser similar to what they recently built for Clojure's async flow which pings each stage and asks for the stats and visualises them in real time.
(Also if we add traces that can be replayed to put the system back to a state when an error happened, it will probably be useful to have OS-level metrics from that time. Because the error might depend on something that happened in the environment that the trace cannot reproduce.)
Finished chain sync message generator that uses a pre-generated block tree in Rust. Also paired with AB and RK to port the consensus pipeline, I think once that's done we can start doing basic simulation testing fully in Rust.
In parallel I've been trying to simplify and debug a Haskell prototype of a simulator that can do network-level fault injection. My hope is to try out some ideas there, so that by the time the Rust simulator catches up we can move over the ideas that work with minimal effort.
Discussion on stages PR
- The PR improve the previous way to define and link stages through various ergonomics tweaks, most notably in the way initial state is handled
- We can now define
breakpoints
which are just predicates over messages sent to a stage defining when it should stop - We can also
overrides
external effects, also using a function that can intercept specific messages and handles them
- We can now define
-
ExternalEffect
came out of the want to be able to use async/await code outside of stages, eg. to call into other effectful code, while still keeping track of side effects- In practice, when we want to ensure determinism in tests we can either resort to swapping a specific effect or to inject an alternative implementation
- We decided to merge it and continue refactoring the amaru-sim code to use it in order to fully validate that's the way to go for the "full monty". The amaru-sim code only implements 3 stages but has to deal with storage
- We also discussed the main next step in this line of work, namely recording effects
- Recording effects will be very useful to troubleshoot production issues in a simulated environement: we could just replay the recorded trace in order to examine the state of the system
- In simulations, it's enough to record only inputs as everything is deterministic so outputs naturally derive from inputs (Eg. command sourcing)
- Messages need to be serialised to be recorded, and obviously this has to be very efficient. Current best idea is to store them as CBOR bytes in a ring buffer, and persist them only on demand or in testing
- We discussed format and how to best handle serialisation for trace recording and human consumption. serde being the de facto standard in rust world and providing a unified approach to serialisation allowing for different consumers, we agreed it would be great to leverage it as much as possible.
- We could reuse the same serde
Serialize/Deserialize
traits to generate both CBOR And JSON, the latter being more amenable to human consumption, or possibly even other formats.
- One of the shortcomings of minicbor is the need to write our own
Encoder/Decoder
implementations which is quite annoying. It seems there exists options to reuseserde
deriving macros which would remove quite a lot of boilerplate.