-
Notifications
You must be signed in to change notification settings - Fork 10
log :: 2025‐02
Tip
VRF
, Praos Nonce
, Ouroboros
, Chain Generation
, Simulation Testing
These logs introduce VRF and Praos nonce updates (including test vectors), discuss reworks in an ouroboros
crate for better validation logic, and outline ongoing simulation testing efforts (covering deterministic event loops, faults, workloads, etc.).
They also detail chain generation strategies for consensus testing, along with refinements in the consensus <-> ledger pipeline to handle block validation, rollback, and error handling.
The overarching themes involve improving correctness, determinism, and parallelism in both the consensus design and testing frameworks.
Researched the peculiarities of both versions (Praos
and PraosBatchCompat
) in cardano-base. Created PR in cardano-base aiming to create golden tests for both VRFs -> https://github.com/IntersectMBO/cardano-base/pull/519
Also checked the implementations of them in the context of draft they were based on: ver03 and ver13 for Praos
and PraosBatchCompat
, respectively.
I have also taken those generated test vectors from cardano-base and incorporated in txpipe/vrf
repo -> https://github.com/txpipe/vrf/pull/3
for future and proper testing.
The highlight of the week is that I finished documenting how the main loop in the simulator works.
I also started a discussion about what the workload (client requests and properties) of the system is supposed to be. Since there doesn't seem to be a clear answer to this question yet, this is likely what I'll work on documenting next.
I also stumbled across a post about simulation testing in Rust, where somebody claims that using the "sans-io" technique is key. I don't fully understand sans-io, and the link that the posts refers to doesn't explain it too well imo. Here's another post that does a better job, but i still can't tell what exactly this would simplify/endable in a simulator?
-
When applying a block, we must compute the new evolving nonce:
-
Extract and prefix the block's VRF output with UTF-8 "N" ('N' for Nonce):
tagged_vrf = "N" ++ block.vrf_output
-
Compute
$η$ (eta) from a range-extension of the tagged VRF output:eta = blake2b_256(blake2b_256(tagged_vrf))
-
Combine eta with the current evolving nonce to obtain the new one:
blake2b_256(block.evolving_nonce ++ eta)
-
If
block.slot + randomness_stabilization_window
1 <next_epoch.first_slot
- then: set candidate nonce to newly computed evolving nonce
- else: keep candidate nonce unchanged
-
Retain the block header's ancestor hash2.
-
-
When crossing an epoch boundary, the final epoch nonce can be computed from:
- the candidate nonce; and
- the blake2b-256 hash digest of last block's ancestor header from two epochs ago.
If we refer to the current epoch as
e
, then we have roughly:nonce(e) = blake2b_256(candidate_nonce(e - 1) | blake2b_256(last_block(e - 2).ancestor.header))
Note
[1] About randomness stabilization window
The randomness stabilization window corresponds to the number of slots before the next epoch start for which VRF outputs do not contribute to the epoch randomness. This ensures that The randomness for the leader election is somewhat stable when the epoch start (such that an adversary attempting to game the randomness at the end of an epoch cannot influence the schedule for the next epoch).
In Conway onwards, its value is set to:
From Shelley to Babbage, this value was set to
[2] About LABNonce
The Haskell's consensus also keep track and rotate a thing called 'LABNonce' where LAB stands for 'Last Applied Block'. It corresponds to the (blake2b-256) hash of the *previous/parent block header hash* of the last applied block. This nonce is only when crossing the _next_ epoch boundary, combined with the candidate nonce for the epoch that just ended.Technically, we only need the hash of the ancestor of the last block of the previous epoch; but we can't know in advance which block is going to be the last. So in practice, we keep a rolling hash around which we then fix at the epoch boundary (i.e. when the current epoch changes, indicating that we processed the last block of the previous epoch).
- Been working again on a chain generator in
ouroboros-consensus
repository to be used as test data for consensus. - I switched to strategy whereby the generator tracks SPOs and all the chains that are generated along they way, indexing produced headers by their slot. The goal is to try to generate Block trees which are an artifact useful to represent the evolution of a chain and potential adversarial behaviour a node has to cope with.
- Turns out I am having a hard time generating forks longer than 1 block which is a testimony to the robustness of Ouroboros but still annoying. Browsing the consensus code I noticed they actually implemented some chains generator in the context of Genesis, so it should be possible to reuse those for Amaru. This is the direction I will take now as we really need to make progress on this.
I spent some time between yesterday and today's morning to re-shape a bit the ouroboros
crate. The name is somewhat of a misnomer because there's quite little about Ouroboros in the end in this crate. Rather, it's (now) focused on three areas:
- A module wrapping the crypto related to Verifiable Random Functions (VRF)
- A module wrapping the crypto related to Key Evolving Signatures (KES)
- A module performing different validations on a block header
The latter is arguably the "ouroboros" part. The previous design was somewhat a bit unclear, instantiating a "BlockValidator" wrapping a block header, and providing a method to validate it on multiple dimensions (valid VRF, KES, certificates + few static checks), dispatching sub-validations to internal functions.
The rework mainly consisted in exposing those internal functions as the chosen default API, while still providing a top-level assert_all
entrypoint to bundle all validations together (and choose, for example, to run them in parallel).
Doing so, I have made the error (1) more granular and scoped by each kind of validation, and (2) more strongly typed, where before they were mostly namespaces with string values.
Besides, more work than I anticipated went into the vrf and kes modules, which felt somewhat inconsistent and not idiomatic. So I've opted to embracing some standard Rust traits such as From
, TryFrom
, AsRef
and Deref
as the basis for the API. In the end, those modules feel like they shouldn't exist within Amaru but be upstreamed to Pallas and be the default API.
Shelved the work on coverage and coverage-guidence for now, because with the merger of AB's PR: https://github.com/pragma-org/amaru/pull/114/files , the work to combine his and my work has taken priority. During the implementors meeting we also decided to demo something in this regard for the next demo day.
AB mentioned that he still finds the implementation of the test driver unclear, so I started thinking about how to document that (it's a natual continuation of my previous posts).
We also had a meeting about doing a trial with Antithesis. Seems everybody thinks it's a good idea and we'll go ahead with it.
In the light of this, as well as me starting to work more closely with the consensus node, I got a question:
What is the workload (request generator and property) that is supposed to be checked?
In addition, and I raised this in the chat on two occasions:
What are the faults under which the above workload is valid?
I'm not saying we need the answer to these questions now, but it would be good to start working on defining these things in detail. Note that both Jepsen/Maelstrom and Antithesis have the concept of workload and faults, so these questions need to be answered if we want to get further with those tests.
The current plan for the demo is to use a generator which picks a path through a pregenerated "block tree" (because block forging isn't implemented yet) and have the property be: the "tip" we end up on is the same as the block we last picked in the generator.
This feels a bit off to me, because we are writing a system test without having the whole system. Also the clients, making the generated requests, are the "block forging" nodes rather than something that looks like a real client (from Jepsen's perspective), i.e. something that uses the system as a blackbox (send an input and gets an output). The property we are checking is a liveness property, but what is it for? Liveness properties are typically there to support some safety property. What is the safety property?
I understand that maybe a safety property doesn't make sense at the current stage of the implementation (e.g. without "block forging"), but when I asked AB to sketch an future scenario when the whole system is implemented: what would be a safety property then? I.e. what is a guarantee we can give a client/user of the system, which is useful for them to building something useful with. I got:
The core property we need to test for is really the consensus property (or rather properties) the main one being: al honest nodes in the network ultimately converge to the same chain in no more than k blocks given there’s less than 50% adversarial stake
Which again is a liveness property, which by itself seems useless to a client/user. Or with other words: given this core property, what minimal use case can a client construct?
Compare this with:
- https://jepsen.io/analyses/tendermint-0-10-2 (comments)
- https://jepsen.io/analyses/radix-dlt-1.0-beta.35.1 (comments)
If you look under "test design" you see that in the Tendermint case the workload was a "CAS register" and a "Set of values", these are standard workloads that already exist in Jepsen can are shared to test many different databases. Or for Radix a sort of "bank account transfer" workload was used, which could be translated to the standard "list-append" workload.
On top of the guarantee that we got a working CAS register we can start building useful things, i.e. more complicated atomic datastructutres. Or if we know that we can transfer funds between accounts withot creating or destroying funds in the process, then that's also useful as a basis to build upon.
I don't see what the equivalent in the Amaru case is. Again it doesn't have to be possible to run the workload with the current implementation, but where are we heading? What workload should eventually be possible to run?
I finished documenting the sources of non-determinism in the Maelstrom/Jepsen approach and sketched a high-level plan on how to fix the problem:
Other than that I presented what I've done so far at the Amaru demo and I reviewed Arnaud's first PR that tries to connect what he has done with consensus and I've done with simulation testing.
I also started on the last Maelstrom example: Raft. I had to extend the node language with a PRNG (for random sleeps to avoid contention) and a way of getting the current time (to set/reset leader election deadlines).
Next up I'd like to work on adding a way to report coverage from within the SUT, and potentially use that to drive the generation (a la coverage-driven fuzzers).
Connecting simulation-testing and amaru (by AB)
Started work towards connecting simulation testing and amaru, first as a slow but "simple" out-of-process test environment that talks to the SUT through stdin/stdout.
I have created a new crate with a new executable amaru-sim
for that purpose, and then spent some time trying to get the ingress part right. The maelstrom-node seems quite promising as it contains already types and logic to fly around messages, connect nodes, run various parts asynchronously, etc. The protocol is quite simple but I wasted some time fiddling around with serde, and how to deserialize the MessageBody
which can contain reserved fields plus any other fields for payload.
Initially I had an enum Content
with 2 variants, one for roll forward and one for rollbacks, but the MessageBody
deserializer did not like the default layout, so I serialized those untagged
which unfortunately does not work if there are multiple variants! I ended up with two different structs
discrimniated by a custom type
field.
In the Observability ADR we advocates for the liberal use of spans
to record execution traces for all Amaru components. While working on multi-stage consensus I realised the way we are using leads to make each stage its own silo which is reflected in Jaeger UI. It would be much more helpful to be able to trace and display all the steps across various stages.
This was implemented by threading a Span
structure along each Stage
which works but is quite inelegant. There should be a way to make that threading transparent to the developer so that one does not have to care to pass it around in messages and could just pull it from the current stage it's in.
Consensus pipeline has been split in two parts, with ledger's block validation in between, so it now looks like:
flowchart LR
id1[receive & process header] --> id2[validate block]
id2[validate block] --> id3[store & forward block]
The last stage is currently just used for logging the outcome of the block validation, but we'll next plug in block storage and peer forwarding, which is the most involved part as we'll need to implement chain sync and block fetch protocol servers.
A possible solution to the "error handling" problem we outlined yesterday would be to use some state versioning scheme, similar to how MVCC works in Databases. We could leverage the fact that headers contain a reference to their parent, and snapshot the state indexed by header hash:
- When a header
$H_1$ is received, we validate it and compute a candidate chain, then download/validate block, etc. - The candidate chain
$C_1$ is tracked in theChainSelector
as one possible future of the selected chain - When another header
$H_2$ is received, there are several options:- the new header has
$H_1$ as parent and$H_1$ processing has finished: we can proceed with$H_2$ , creating a new candidate chain indexed by it - the new header has
$H_1$ as parent and$H_1$ is still in the pipeline: this means we need to wait so we can retry the stage until$H_1$ is no longer in the pipeline - the new header has not
$H_1$ as parent: this means we have a possible fork, we proceed with$H_2$ processing, creating a new candidate chain indexed by it
- the new header has
- When header
$H_1$ reaches the end of the pipeline, we need to update atomically the chain state from the candidate chain induced by it, depending on the longest chain rule
S. suggested this is something Scott Wlaschin talked at length about, and we make a small experiment to study the problem and how to solve it.
While working on this, I also tried to clean up and organise logs and spans in a better way. An small problem is that by splitting the pipeline in several stages, we lose "automatic" span computations for the various steps, so we need to reconstruct that in some ways, possibly passing down spans through the various stages?
The main highlight for this week is that I finished documenting how Maelstrom works:
I've also added Maelstrom/FoundationDB style "workloads", which group generation of messages and the property to check into a separate construct.
I also implemented shrinking, so we can display minimal counterexamples when an error is found.
I also finish the slides for tomorrow's demo: https://excalidraw.com/#room=5a22d1da02dd75fff478,aRtMV0TItYb5nqGFqnYnWg
Finally I had two meetings with Arnaud about pipelining consensus. Which made me start thinking about:
- Can we extract a toy example which has the same characteristics?
- How can we introduce pipelining into the simulation without breaking determinism, especially if speculative execution is involved.
Started working on implementing block validation in the pipeline of consensus. Currently, the consensus just outputs the result of chain selection from header received (or rollbacks) as an event sent to the ledger, hence the actually validity of the block is not checked. The chain selected should be only considered a candidate until the block is deemed valid, and then the chain can be selected and forwarded to downstrem peers (eg. followers).
- I leave aside the issue of consensus pipeline which speeds up delivery of blocks by making them available to followers before they have been validated, which requires specific conditions to be triggered
The following diagram represents the main steps of the pipeline to validate new block:
flowchart TD
id1[receive header] --> id2[store header]
id2[store header] --> id3[validate header]
id3[validate header] --> id4[select chain]
id4[select chain] --> id5[download block]
id5[download block] --> id6[validate block]
id6[validate block] --> id7[store block]
id7[store block] --> id8[forward chain]
Currently, there's a single gasket stage for the whole of consensus and the validation result of the block is ignored.
I want to implement the ledger's block validation as an intermediate processing stage, which implies splitting the current monolithic consensus process. This is relatively straightforward so far, but there's a particular challenge which I don't know (yet) how to solve with gasket: How to ensure a failure to validate the block by the ledger cancel the whole pipeline?
In other words, the chain selection pipeline must be transactional: a failure in any step should rollback any change in the overall state. Moreover, this failure should be reported to the upstream network layer so that the "adversarial" peer can be sanctioned accordingly, usually by killing the connection.
An obvious option is to share some state between the various stages and have a downstream failure modify the changes to the state that happened before. This won't work if the steps are run in parallel and will only lead to race conditions and concurrency bugs. In Haskell this issue is solved by working with Software Transactional Memory effects, but we don't have that available here!