log :: 2025‐03

Tip

KEYWORDS

Simulation Testing, Deterministic Runtime, Stake Distribution, Ledger Validation, DReps

SUMMARY

These logs detail new simulation testing progress (finalizing demos, forging a deterministic runtime, bridging with Amaru), improved ledger validation and data handling (DReps, multi-relations, reduced duplication), and refinements in telemetry/tracing.
They also describe efforts to remove hardcoded epoch/slot logic, integrate partial chain data for faster tests, and unify concurrency patterns.
Overall, this work aligns the system more closely with Haskell node implementations and sets the stage for more robust end-to-end testing.

2025-03-27

Weekly simulation testing update (by SA)

This week has mostly been wrapping things up for the demo with AB tomorrow. Fixing small things, improving the UI, trying to find a small bug to inject to show how the test output looks like in case of a failure, and working on slides.

I also started planning a bit for what to do after the demo. I feel like if we want to take simulation testing seriously we need to start taking determinism seriously, and right now we don't have a plan for how to achieve determinism. We are building upon a foundation that simply isn't good enough for what we are saying we want to achieve. The longer we keep doing this the more expensive it will be to fix down the road.

Therefore I'm working on documenting the Maelstrom distributed system DSL and how to implement a deterministic runtime for it. Here's what I believe to be true:

The DSL is expressive enough for the job, based on the fact that the Maelstrom DSL can be used to implement mini-Datalog and Raft (these are examples from the Maelstrom repo) and the fact that Kyle Kingsbury uses Maelstrom to teach distributed systems classes;
The approach works, based on my prior experience in writing a simulator.

What I don't know (yet):

Is the DSL too flexible for what we need? Or with other words: can we get away with implementing a subset of the DSL? I don't understand the Amaru project well enough to tell at this point;
I know how to do it if everything is single-threaded, but for performance we'll want to introduce parallelism. I have some ideas about how to introduce parallelism without breaking determinism for some subsets of the DSL, but I can't tell if this will be flexible enough or performant enough.

Next week, after the demo, when I've hopefully managed to write something coherent about the topic, I'll share it and try to get a conversation going.

2025-03-26

Remove hardcoded epoch/slots computations

Simulation tests do not work with real network data as we want to be able to produce "interesting" synthetic chains and blocks to test the behaviour of the node. But we still want to validate headers and block correctly, and of course detect invalid headers and blocks. The slot-arithmetic crate existed for that purpose but was so far unused, so what I did was:

Add a Testnet item to NetworkName, so that one can run amaru (and simulator) without historical data. The Testnet is parameterised by a network magic number
- Also added the capability to load an Era from JSON, this could be useful for testing or when we want to run amaru on other networks. For example, we could run amaru in a tesnet along with cardano-nodes and bootstrap it with their era history
Convert JSON era history for preprod into a static EraHistory object that can be used everywhere
Ensure that one can convert from a NetworkName to an EraHistory.
- This is only implemented for Preprod and Testnets with the latter being hardcoded to a single era with 1000 epochs.
Moved various hardcoded functions from amaru-kernel to slot-arithmetic, in the EraHistory where they really belong.
Threaded an EraHistory across the various layers, creating it from the given NetworkName passed to the main as an argument and then passing it to various layers that need it, most notably:
- ledger/state.rs, which can then pass it on to other substructures
- rocksdb and consensus store which need it to compute nonces correctly
I hit a small snag with the examples that were not compiled locally and failed on CI. It's nice there's a make all command that ensures everything compiles correctly
I found the (ab)use of u64 values to represent slots, epochs, or time, a bit confusing so tried to implement a newtype pattern in Rust but quickly gave up as it gave rise to way too much hand-written code
I tested running moskstraumen against the new simulator binary and it worked fine: No need to hardcode some values for slots conversions!

2025-03-25

Better implementation of telemetry

I spent some time working on clarifying tracing implementation and making sure the overall span traces make sense and are useful.

First step was to leverage smaller functions usage and leveraging the instrument. Manual span creation is too error prone and significantly obfuscates the code. Another step was to remove parent from spans as they should be inherited. This removes the need to pass spans around, cleaning the API. The one thing to enable this is to make sure parent spans are kept when crossing threads. gasket handles this internally so without modifying gasket this has to be done manually on amaru side when crossing stages.

Ultimately, the TLDR when adding new traces is:

only use instrument with skip_all
explicitely add fields
make sure fields are JSON compliant strings
no explicit target
new fields can be added dynamically to the current span (created by the closest instrument) but they have to be listed in instrument
events can be added via trace! (and family) macros
do not abuse instrument as it clogs the system and make it harder to understand the flow

2025-03-22

Notes on design decisions regarding ledger validations (@KtorZ)

A few important things for you to note:

There's now a preparation step, ahead of the validation. This step is used to figure out what data are going to be needed by the validation context. At this point, since the rules only requires information about inputs (for key witness validations), then we only construct a context for those.
The preparation of the validation context is entirely decoupled from the actual pre-fetching of the underlying data. Right now, we define two types Preparation/Validation context:
- A fake one, that is suitable for testing and requires the fixture data to be provided upfront. Used for example here or here.
- A simple one, that is currently being used during normal operation. Calls to require will mark data as needed, and I've introduced a step for pre-fetching the underlying data and constructing a valid simple validation context.
For now, the simple is extremely basic, and implement just what's needed to get the integration tests pass. But now, a next step (from my end) will be to fully port the current state management to this new approach, very likely just re-using what we have now.
Then, the validation code now successfully uses the provided context to (a) resolve inputs from it and (b) produce outputs which may be needed by further validations.

At this point, we do not consume inputs, so double-spending is effectively possible :). We shall do this at the end of every transaction, of course.
With this last changes, we're able to run the integration tests and actually validate signatures on PreProd as part of the ledger validations 🎉! Note that, this wasn't the case before since we would simply ignore any such validation of any unknown input...

Generally speaking, the code in the rules crucially misses tests. At the very least now, the integration tests will catch a few things, but they aren't sufficient -- they should remain only a last rampart.

I've added an example of a validation yielding a MissingVKeyWitness and moving forward, I'd expect more of those for each possible validation failure -- and this, before we introduce any new rules (and that'll give us some time in the meantime to finish merging the state & validations!).
I've also reworked some errors, to make the rules a bit more composable. The idea is to structure errors in the same way we structure rules, and instead of having "low-level rules" yields plain TransactionRuleViolation which will rapidly grow quite large; they can yield an error that's more restricted to the thing they're validating.

So far, I've done it in two (three if we count the unification of bootstrap/vkey witnesses verification) places:
- The InvalidVKeyWitnesses now takes an array of InvalidVKeyWitness errors which are produced by verify_ed25519_signature.
- The InvalidOutputs now takes an array of InvalidOutput errors, which are produced by single outputs validations. We only have one for now regarding the min ada value
Remark that I am no longer cloning the underlying data as part of the error, but instead prefer reporting errors using an indice/position of the corresponding element within the transaction. To avoid repeating this over and over, I've introduced a little helper structure that makes the construction of such errors straighforward.

2025-03-21

Weekly simulation testing update (by SA)

Supported AB with various small fixes to get the simulation testing demo working.

Introduced myself to RK and tried to get a DST discussion going by sending him some stuff I've written.

Had a kick-off meeting regarding Antithesis, where AB reminded me of the fact there is an example in their docs (I had forgotten about this).

Otherwise I've started documenting two things: workloads (and faults) and the Maelstrom DSL and how to implement a deterministic runtime for it. While trying to explain how to implement the runtime, I realised that my current implementation is too complicated to explain easily. So started thinking about how it can be simplified. I think this would will be important when discussing with RK about how to make the consensus node runtime deterministic.

2025-03-13

More work on consensus simulator

Made some progress towards being able to test consensus through moskstraumen walking a block tree and simulating interactions with upstream peers (and checking the SUT selects the right chain along the way)
Ditched gasket which does not add value at this stage and make the code more complicated than it should be: The run_simulator is just an async function that loops over lines read from stdin, pass the relevant messages to the Consensus for forwarding or rolling back, and then output the resulting BlockValidatedEvent to stdout using the same format as the input.
I wasted some time struggling with I/O and reading lines: I had created two BufReader structures reading stdin() in different functions, one to read the init message which gives the list of upstream nodes "addresses", one for looping over other input messages. Turns out this is not a good idea as it seems the first one swallows all the input there, or lock it, so the second one does not get a chance to read anything.
Ultimately created a MessageReader trait with 2 implementations in order to test the reading logic: one implementation dependent on stdin and the other one simply reading a list of String. This is a bit of a mess and I suspect there already exists plenty of tools to do that already I am not aware of...
Once the input part sorted out, I hit issues validating the header with computing nonces: Nonce computation assumes things about the current slot and epoch (eg. that it's past byron and shelley hard fork which is hardcoded to preprod value).
- TODO: we should have network-specific values for all those things, or at least the possibility to swap the "real" network with a test one.
- Note I could have just mocked the ChainStore trait which provides nonces computation, I might do it later down the road as it removes the dependency on RocksDB underlying storage while adding another layer of mocks
- I ended up manually hardcoding the epoch value and import-nonces with all zeros, and default to all-zeros if the header has no parent thus denoting first header after genesis.
Next steps to complete the whole journey are:
1. implement remaining methods from FakeStakeDistribution, similar to existing mock
2. mock block_fetch. Ideally, this should be part of the protocol interacting with the tester but for now, let's keep it simple
3. handle the ValidateBlockEvent emitted by the consensus once it has a chain candidate and has downloaded the block. This last one is a bit tricky and annoying because the events do not hold the block header, only the body, and we need the header when forwarding. I was pondering whether I should read the header from the ChainStore with the point, or if I should more simply add the header to the Validate event.

Tracking DReps (@KtorZ)

We took a first bite at implementing governance; starting with the tracking of delegate representatives (a.k.a DReps). We initially thought this would be relatively easy, although it had its share of problems.

Storing relations

Like pool, a DRep can have many delegators; and delegators can have at most one DRep. Fundamentally, we have something like:

erDiagram
    Account }o--|| DRep : "delegate vote"
    Account }o--|| Stake-Pool : "delegate consensus"

Hence, the first intuition is to store the DRep relationship next to the Pool relationship within the Account entity. So far so good. Although, our storage model is deferred: updates to apply to the persistent store appear first as deltas in a volatile queue. For one-to-many entity relationship like this, we use:

For new relations: key:value map, where the key represents an Account's stake credential and the value is an optional entity.
For account removals: a key set, containing Account's stake credential being removed.

This works well when tracking a single relation, but we ran into issues as soon we added a second relation. The main problem came from our internal representation of those relations:

pub struct DiffRelation<K: Ord, L, R> {
    pub registered: BTreeMap<K, Relation<L, R>>,
    pub unregistered: BTreeSet<K>,
}

pub struct Relation<L, R> {
    pub left: Option<L>,
    pub right: Option<R>,
}

By holding Option for the left and right relations, we cannot distinguish between an absence of change, and the need to remove a relation. For example, if we register or re-register a new entity, and then introduce a left-relation; we end up with a relation holding: Some(L) & None. And in this case, I need to set or reset the right relation to None since the entry is either new or re-registered (thus invalidating any previous relation).

However, I cannot distinguish it from the case where I simply introduce a left-relation on an existing entity. Which ends up also yielding a pair Some(L) and None. Although here, None does not indicate a reset of the right relation, but rather an absence of action. To solve this ambiguity, we introduced the following:

pub enum Resettable<A> {
    Set(A),
    Reset,
    Unchanged,
}

pub struct Relation<L, R> {
    pub left: Resettable<L>,
    pub right: Resettable<R>,
}

This correctly allows to represent the first example as Set(L) & Reset, while the second example becomes Set(L) & Unchanged now removing the ambiguity. We could also have introduced separate maps for the left and right relations, so that the absence of entry in the map would indicate the absence of changes, but this Resettable structure somewhat simplifies the implementation and testing underneath.

Conformance with Haskell node

A great part of testing the ledger state revolves around comparing epoch snapshots with similar snapshots produced from a Haskell node. The Haskell implementation acts here as an oracle for us. Although, since we use different data-formats and different serialisation strategies, we need to produce conformance snapshots that are somewhat in-between both implementations. So far, we have three kind of snapshots:

Stake distribution snapshots: for account balances, stake pool balances & parameters as well as account <-> stake pool relations.
Rewards summary snapshots: for intermediate rewards calculations (e.g. pools efficiency, total rewards, available rewards, ...)
DReps snapshots: for relations accounts <-> dreps, and (eventually) drep states.

The last snapshot, we introduced recently to ensure conformance with the Haskell node. To produce it, we initially used information coming from the local-state-query protocol's query: GetDRepState. This query returns, for each DRep, a set of delegators. However, after some headache, we realized that the delegators returned by this query where incoherent. The same delegator would sometimes be present in several DRep entries. After some digging, we found IntersectMBO/cardano-ledger#4772.

The issue and its fix make the problem clear: the DRep delegators weren't properly cleaned up in the early days of the Haskell implementation; effectively causing a memory leak. The Haskell node however also maintains a mapping in the other direction, from account to drep, which it uses internally as a source of truth. So the overall state of the system wasn't compromised. Thus, we resorted to using another local-state query: GetFilteredVoteDelegatees which yields the correct mapping from account to DReps.

2025-03-11

Completing workable simulator

I was able to do a first run of the simulator driven by Moskstraumen 🚀 However it seems the executable is crashing and the output is not very clear:

% RUST_BACKTRACE=1 cabal run blackbox-test -- /Users/arnaud/projects/amaru/amaru/./target/debug/simulator amaru 1 --peer-address=127.0.0.1:3000  --stake-distribution-file data/stake.json

thread 'thread 'mainmain' panicked at ' panicked at simulation/amaru-sim/src/bin/amaru-sim/simulator.rssimulation/amaru-sim/src/bin/amaru-sim/simulator.rs::99:29:
unable to open chain store at ./chain.dbthread '
main' panicked at simulation/amaru-sim/src/bin/amaru-sim/simulator.rs:thread 'main99' panicked at :29:
unable to open chain store at ./chain.db
99simulation/amaru-sim/src/bin/amaru-sim/simulator.rs::29:
unable to open chain store at ./chain.db
stack backtrace:
99:29:
unable to open chain store at ./chain.db
stack backtrace:
stack backtrace:
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: simulator::simulator::bootstrap
   3: tokio::runtime::park::CachedParkThread::block_on
   4: tokio::runtime::context::runtime::enter_runtime
   5: tokio::runtime::runtime::Runtime::block_on
   6: simulator::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: simulator::simulator::bootstrap
   3: tokio::runtime::park::CachedParkThread::block_on
   4: tokio::runtime::context::runtime::enter_runtime
   5: tokio::runtime::runtime::Runtime::block_on
   6: simulator::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: simulator::simulator::bootstrap
   3: tokio::runtime::park::CachedParkThread::block_on
   4: tokio::runtime::context::runtime::enter_runtime
   5: tokio::runtime::runtime::Runtime::block_on
   6: simulator::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: simulator::simulator::bootstrap
   3: tokio::runtime::park::CachedParkThread::block_on
   4: tokio::runtime::context::runtime::enter_runtime
   5: tokio::runtime::runtime::Runtime::block_on
   6: simulator::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
blackbox-test: fd:5: Data.ByteString.hGetLine: end of file
{"timestamp":"2025-03-11T17:43:07.629613Z","level":"INFO","fields":{"message":"stage bootstrap ok"},"target":"gasket::runtime","span":{"stage":"pull","name":"stage"},"spans":[{"stage":"pull","name":"stage"}]}
{"timestamp":"2025-03-11T17:43:07.629641Z","level":"INFO","fields":{"message":"switching stage phase","prev_phase":"Bootstrap","next_phase":"Working"},"target":"gasket::runtime","span":{"stage":"pull","name":"stage"},"spans":[{"stage":"pull","name":"stage"}]}
...

It does not crash when running standalone and is able to open the DB, so unsure what's going on here. Perhaps something with process control, or the way tokio is initialized? But that seems weird... Next step: Try to just wrap the executable from Haskell side and sned it canned messages. There's probably a need for some integration test on the simulator side.

I also bit the bullet and implemented a change in chain_selector to allow having Genesis or Origin as the tip of the chain selection which will mostly be the case, at least initially, for simulations. This rippled all over the chain selection process but did entail drastic refactorings, and lead me to some nice refactoring of the find_best_chain function which was the only place where handling Genesis was slightly annoying.

Interestingly, Haskell node has a dedicated type WithOrigin isomorphic to Maybe that's used every to wrap types that don't exist at Genesis. Perhaps something we should consider?

2025-03-10

Fake stake distribution [#140]

In order to be able to test the consensus pipeline, whether inside a simulated environment, with Moskstraumen/Maelstrom driver connected to stdin/stdout, or jepsen, with synthetic workload representing various possible scenarios, we need to be able to provide a test double for:

the HasStakeDistribution interface which is used to validate headers
the ledger stage which receives blocks to validate
the block fetch stage which calls upstream server(s) to retrieve the header's body once it's been validated.

Today I implemented the stake distribution part: The simulator takes a path to a stake distribution file containing data about the known stake pools, generated from Haskell code that's also used to generate the block tree we'd be testing the consensus with. Took me some time to get up to speed again and reconcile the work-in-progress branch I had, but finally nailed it down. I got hit by a mistake I made in the Haskell side: The PoolId was not serialised correctly, I was actually serializing the VRF Cold key hash. This became obvious when I tried to lookup the pool on the Rust side from the PoolId: The PoolId is a 28 bytes long hash, whereas the VRF Key hash is 32 bytes! I know I should have written a proper roundtrip property: Always implement FromJSON/ToJSON together to ensure consistency of your represenation.

Next steps:

fix the chain selector creation process, in order to be able to start from Genesis, which is not currently possible as we expect a concrete Header resolved from a Point
move the block fetching part out of the way, into its own stage, in order to be able to fake it too.
connect to SA's tester 🚀

2025-03-07

Continuous end-to-end tests against PreProd

We've spent some time this week to setup an integration environment for end-to-end tests. The main motivation was to ensure we could automate the snapshots checks that are the primary validation mechanism for the ledger state. The challenge being that those snapshot tests demand to run Amaru against the PreProd network for a few epochs, and then run the tests using the snapshots produced while syncing with the network.

Instrumenting Amaru

There's no out-of-the-box way to run Amaru for a fixed time or until a specific event, but thanks to JSON traces being readily available, we can instrument Amaru to do just that with a couple of lines of bash.

AMARU_TRACE="amaru=debug" cargo run -- --with-json-traces daemon --peer-address=$AMARU_PEER_ADDRESS --network=preprod | while read line; do
  EVENT=$(echo $line | jq -r '.fields.message' 2>/dev/null)
  SPAN=$(echo $line | jq -r '.spans[0].name' 2>/dev/null)
  if [ "$EVENT" == "exit" ] && [ "$SPAN" == "snapshot" ]; then
    EPOCH=$(echo $line | jq -r '.spans[0].epoch' 2>/dev/null)
    if [ "$EPOCH" == "$TARGET_EPOCH" ]; then
      echo "Target epoch reached, stopping the process."
      pkill -INT -P $$
      break
    fi
  fi
done

Setting up a local peer

The second challenge we faced was to get a reliable and fast-enough connection to a (Haskell) cardano-node to synchronize from. While there are public peers that are available, they are heavily rate-limited and synchronizing the dozen of epochs required by the test scenario would make the test too slow.

So instead, we've decided to synchronize from a local node, running in a container. To work, this however requires the node to be somewhat synchronized. This being a recurring problem of any client application wanting to perform e2e tests, we've re-used an existing solution to synchronize a node in a nightly workflow:

uses: CardanoSolutions/[email protected]
with:
  db-dir: ${{ runner.temp }}/db-${{ matrix.network }}
  network: ${{ matrix.network }}
  version: ${{ matrix.ogmios_version }}_${{ matrix.cardano_node_version }}
  synchronization-level: ${{ inputs.synchronization-level || 1 }}

To make all of this work, we've used a few tricks:

We've used Github caches to avoid having to re-synchronize the node on each test run. Instead, we have cron job running twice a day that refresh a cache. The cache is then used by tests in a read-only fashion:

id: cache
uses: actions/cache@v4
with:
  path: ${{ runner.temp }}/db-${{ matrix.network }}
  key: cardano-node-ogmios-${{ matrix.network }}
  restore-keys: |
    cardano-node-ogmios-${{ matrix.network }}

Github has some restrictions regarding cache accesses, but they can be re-used across workflows. The cache's restore-key must obviously match on the consumer workflow, but also the path into which the cache is being downloaded. Somehow, the output path is part of the cache invalidation strategy.

Running a containerized (Haskell) cardano-node is relatively easy once one knows what the right options are and where to find configuration files.

docker pull ghcr.io/intersectmbo/cardano-node:${{ matrix.cardano_node_version }}
make HASKELL_NODE_CONFIG_DIR=cardano-node-config NETWORK=${{ matrix.network }} download-haskell-config
docker run -d --name cardano-node \
  -v ${{ runner.temp }}/db-${{ matrix.network }}:/db \
  -v ${{ runner.temp }}/ipc:/ipc \
  -v ./cardano-node-config:/config \
  -v ./cardano-node-config:/genesis \
  -p 3001:3001 \
  ghcr.io/intersectmbo/cardano-node:${{ matrix.cardano_node_version }} run \
    --config /config/config.json \
    --database-path /db \
    --socket-path /ipc/node.socket \
    --topology /config/topology.json

There are some limits to both the maximum cache storage available per repository (10GB) and the Github runner hardware requirements, in particular with the available filesystem space (14GB). On PreProd, the node database weights already around 11GB, leaving little space for (1) growth and (2) the rest of the build (e.g. cargo dependencies). Plus, it creates needlessly large caches that are also slow(er) to download and uncompress. Since Amaru doesn't do full chain-sync, we can actually operate from a partial immutable database by removing immutable chunks that are past a certain point (roughly everything before chunk 3150). This saves us about 8GB:
```
name: prune node db
shell: bash
working-directory: ${{ runner.temp }}/db-${{ matrix.network }}
run: |
  rm -f immutable/00*.chunk immutable/01*.chunk immutable/02*.chunk immutable/030*.chunk
```

Bootstrapping Amaru

As the README shows, bootstrapping Amaru is becoming more and more complicated. To ease the setup, we've computed each of the bootstrapping steps in a Makefile, which comes in handy to define the Github workflow. A nice "help" is shown by default on make:

❯ make
Targets:
  bootstrap: Bootstrap the node
  dev: Compile and run for development with default options
  download-haskell-config: Download Cardano Haskell configuration for $NETWORK
  import-headers: Import headers from $AMARU_PEER_ADDRESS for demo
  import-nonces: Import PreProd nonces for demo
  import-snapshots: Import PreProd snapshots for demo
  snapshots: Download snapshots

Configuration:
  AMARU_PEER_ADDRESS ?= 127.0.0.1:3000
  HASKELL_NODE_CONFIG_DIR ?= cardano-node-config
  HASKELL_NODE_CONFIG_SOURCE := https://book.world.dev.cardano.org/environments
  NETWORK ?= preprod

Running snapshot tests in parallel

The last piece to the puzzle was to get the snapshot tests runnable in parallel since this is what cargo test typically does. The issue is how each test depends on 2 on-disk snapshots, and thus may open concurrent conflicting connections to the same store, causing the tests to fail arbitrarily. So instead, we know maintain a "pool" of re-usable read-only connections in a thread-safe manner, which ensures that tests can be invoked in any order, with non-conflicting ones even running in parallel:

pub static CONNECTIONS: LazyLock<Mutex<BTreeMap<Epoch, Arc<RocksDB>>>> =
    LazyLock::new(|| Mutex::new(BTreeMap::new()));

fn db(epoch: Epoch) -> Arc<impl Snapshot + Send + Sync> {
    let mut connections = CONNECTIONS.lock().unwrap();
    connections
        .entry(epoch)
        .or_insert_with(|| {
            Arc::new(
                RocksDB::for_epoch_with(&LEDGER_DB, epoch).unwrap_or_else(|_| {
                    panic!("Failed to open ledger snapshot for epoch {}", epoch)
                }),
            )
        })
        .clone()
}

Weekly simulation testing update (by SA)

I started working on connecting the simulation testing work with Amaru.

So far I got: given AB's prefabricated chain.json block tree, I can produce a sequence of sync protocol calls (Fwd and Bwd requests, where Bwd requests are issued when we reach deadends in the tree and there's a possibility to backtrack). These calls are not actually made against the Amaru node yet, since that functionality isn't working yet.

Initially my plan was to do this in Clojure using Maelstrom, but given that the outcome of the discussion last week is that there's no "normal" workload to speak of at the moment, I opted for doing it in Haskell and use the simulation testing prototype (Moskstraumen) instead, since this gives us more flexibility to do "special" workloads if necessary.

There's some glue code still needed to connect the tests and the Amaru node, but all the pieces are there. My plan is to start with the Amaru echo example, since that passes the Maelstrom tests and therefore it should also be able to pass the Moskstraumen tests. That will prove that all the glue works, and then whenever the Amaru sync protocol works we can trivially add that to the test suite.

I also discussed with AB about future steps, which include generalsing to multiple clients walking the block tree, partitions that can be simulated by clients stopping to send messages, and correctness criteria in terms of slots. But this is all still a bit abstract.

log :: 2025‐03

KEYWORDS

SUMMARY

2025-03-27

Weekly simulation testing update (by SA)

2025-03-26

Remove hardcoded epoch/slots computations

2025-03-25

Better implementation of telemetry

2025-03-22

Notes on design decisions regarding ledger validations (@KtorZ)

2025-03-21

Weekly simulation testing update (by SA)

2025-03-13

More work on consensus simulator

Tracking DReps (@KtorZ)

Storing relations

Conformance with Haskell node

2025-03-11

Completing workable simulator

2025-03-10

Fake stake distribution [#140]

2025-03-07

Continuous end-to-end tests against PreProd

Instrumenting Amaru

Setting up a local peer

Bootstrapping Amaru

Running snapshot tests in parallel

Weekly simulation testing update (by SA)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally