-
-
Notifications
You must be signed in to change notification settings - Fork 410
Lodestar Zig Roadmap Discussions
Recording/Transcript: https://drive.google.com/drive/folders/1EpLxzKetQuMr8klth1z5ZAoZmtnUacGu?usp=sharing
- Cayman reported significant progress on the napi-z library, which is being developed to provide Zig-based native bindings similar to their existing bun-ffi-z library.
- Napi-rs vs C++ napi is that you build your
- The key advancement involves implementing platform-specific binary publishing capabilities, mirroring the approach used by napi-rs where pre-built binaries are published for each target platform rather than compiling native bindings during npm install.
- A pull request has been opened against the hashtree-z library for testing, though it currently fails on ARM64 Mac builds due to lack of access to that platform.
- The library maintains a principled foundation by mapping the C library with a more user-friendly Zig interface, including wrappers for function creation and type conversion between NAPI values and Zig types.
- Nazar revealed he had been independently working on a similar native binding project before discovering the existing napi-z effort.
- Additionally, he developed CPU features bindings that support Bun, Deno, and Node.js, replacing their previous single-runtime dependency on Google's C implementation.
- The team discussed potentially moving this CPU features binding to the ChainSafe organization, though they noted that the underlying need may be eliminated since hashtree now includes fallback implementations, reducing dependency on CPU feature detection.
- Current State: Using Google’s C implementation of CPU features via native bindings
- Intermediate Step: Multi-runtime bindings while maintaining C dependency
- Future Vision: Pure Zig implementation leveraging Zig’s built-in CPU feature detection (
builtin.cpu)
- Tuyen Nguyen presented concerning performance findings regarding native beacon state property access, describing it as "really really expensive".
- This challenges their initial assumption of seamless interaction between runtime and Zig, suggesting they need to minimize beacon state pinging and instead implement separate caches in the beacon chain, similar to Lighthouse's architecture.
- Lodestar’s Current Approach: Heavy reliance on beacon state for data access and storage
- Lighthouse’s Approach: Minimal beacon state usage with separate caches in the beacon chain
- This comparison suggests that Lodestar’s current architecture may be inherently inefficient for their Zig integration goals.
- Cayman provided perspective on when binding overhead becomes problematic, noting that it typically only matters for operations occurring thousands of times per second, such as gossip validation with attestations.
- For less frequent operations like API access, the binding overhead should not be significant.
- Is bun-ffi really a true bottleneck?
- Tuyen’s hashtree benchmark represented a “worst case possible” scenario because individual hash operations are close to no-ops, meaning the benchmark primarily measured binding overhead rather than realistic workload performance.
- Quantitative Thresholds: Cayman established specific performance criteria, noting that binding overhead only becomes problematic at “thousands of times per second” with “several hundred nanoseconds” overhead per operation.
- Cayman identified specific high-risk scenarios where binding overhead could become problematic:
- High-Risk: Gossip validation with attestations processing thousands per second Low-Risk: API access operations that occur infrequently
- This differentiation suggests a selective optimization strategy rather than wholesale architectural changes.
- Tuyen outlined a concrete optimization approach for attestation validation:
- Direct beacon chain access for beacon committee data
- Configuration and indexing lookups without beacon state traversal
- Leveraging existing shuffling cache in the beacon chain
- This represents a surgical approach to reducing binding overhead by minimizing cross-boundary calls.
-
Hop Reduction Strategy
- Reducing “three hops or four hops” to “one hop”, indicating they can achieve significant performance improvements through call path optimization without fundamental architectural changes.
-
Memory Management and Lifecycle Challenges
- Cayman identified a critical architectural dependency in their current approach:
- Current System: All cache items attached to beacon state or epoch cache objects for automatic lifetime management
- Advantage: Simplified memory cleanup through reference counting tied to parent objects
- Mechanism: Cache lifetimes bound together through parent object disposal
- The proposed shift to separate beacon chain caches introduces significant complexity:
- New Challenge: Individual cache lifetime management without parent object coordination
- Risk: More complex memory management patterns
- Benefit: Improved performance and reduced memory footprint
- Cayman identified a critical architectural dependency in their current approach:
-
Storage Capacity Considerations
- Tuyen referenced discussions with Lion about fundamental capacity constraints:
- Problem: Beacon state objects are “too big” limiting stored object quantity
- Current Constraint: Limited historical data storage due to beacon state size
- Alternative Approach: Beacon chain cache layouts enabling “a lot of shuffling in the past” storage
- The beacon chain cache approach offers superior historical data retention:
- Higher storage capacity for historical shuffling data
- Better scalability for long-term cache requirements
- Reduced memory pressure from oversized beacon state objects
- Tuyen referenced discussions with Lion about fundamental capacity constraints:
- Cayman revealed that much of the required infrastructure already exists:
- Available Caches:
- Pubkey index and index pubkey caches
- Shuffling cache
- Various other specialized caches
- Missing Components:
- Decision routes in fork choice
- Proposer cache
- Available Caches:
-
Timeline and Implementation Concerns
- Tuyen emphasized the urgency of architectural decisions, noting that previous small changes like shuffling took “one month”.
- This timeline constraint suggests that early architectural decisions will significantly impact development velocity.
- Risk Assessment - The conversation reveals two competing risks:
- Performance Risk: Continuing with current beacon state-heavy approach
- Complexity Risk: Implementing individual cache lifetime management
- Optimization Strategy - Rather than wholesale architectural changes, the discussion points toward selective optimization:
- Target high-frequency operations like gossip validation
- Maintain current architecture for low-frequency operations
- Leverage existing cache infrastructure more efficiently
- Matthew Keil raises the fundamental question:
- If the caches are moved off beacon state, should they be centrally stored in a global singleton object passed everywhere (a “global context,” like the chain object)?
- He questions if that is any different from passing a pointer to a cache-enabled beacon state—since in either case, one ends up with a “large object with lots of pointers,” but all are passed by reference, so memory footprint is minimal and passing cost is negligible.
- Tuyen Nguyen and Cayman Nava elaborate:
- Keeping everything inside beacon state offers maintainability and convenient access, which can make maintainers default to storing too much there, possibly at the expense of architectural soundness.
- If everything is passed as part of a global context, it’s explicit but can become unwieldy if functions only use a small subset of what’s inside.
- Cayman notes at the JS layer, such a pattern (object with pointers/globals) is almost unavoidable, but at the native/Zig layer, there’s a choice: only pass explicit references needed.
- Matthew (drawing on C language experience) and Cayman agree:
- The cost in Zig (or C) of passing large structs by reference is minimal—just a pointer copy, not moving the data.
- In traditional C/low-level design, it’s normal to keep everything in a singleton context struct and pass a reference.
- Nazar objects strongly:
- Putting “everything” in a single global context is a security and maintainability liability: functions that shouldn’t have access to some internals do have access, leading to potential bugs and side effects.
- Instead, each function should only be given what it needs: explicit references as parameters, not access to the whole global state.
- Ekaterina Riazantseva notes that languages (like Rust) often clone objects for safety—to avoid pointers that allow unwanted mutations.
- Cloning can be expensive if objects are large.
- But passing by
const ref(Zig’s equivalent of read-only reference) can provide safety guarantees with low overhead. - Cayman agrees: functions can use
const refand need not rely on mutation or access to global state.
- Cayman explains that currently for large caches (e.g., pub key index cache), the codebase already acts like these are singletons/globals—variables at the top level, referenced everywhere, especially exposed to JS via bindings.
- Matthew says: this is mirrored on the JS side too—JS code refers to the same handle pointing at the same native singleton, so the pattern persists across boundaries.
- Nazar remains opposed to using native globals, advocating for pure functions relying only on explicit references, with no direct or implicit access to globals (citing historical challenges with side effects in impure/implicit code).
- The group discusses what happens if the context/singleton also needs to store a memory allocator (common for caches needing dynamic memory):
- Nazar questions if it’s necessary to include an allocator (behavior) when sometimes a pointer to the data alone should be sufficient.
- Cayman/Bing: If a cache or function needs an allocator for updates, and global state is discouraged, the allocator reference must also be passed explicitly.
- Nazar questions if it’s necessary to include an allocator (behavior) when sometimes a pointer to the data alone should be sufficient.
Resolution and Next Steps: Focus on Concrete Design
- Cayman proposes the group move from broad theoretical discussion to specifics:
- Create a design document or skeleton outlining the exact interfaces/patterns to be supported (e.g., state transition + caches in v1).
- Decide, for each case, if a global pattern, explicit parameter passing, or hybrid is best—then make these choices explicit in written designs.
| Pattern | Pros/Arguments | Cons/Arguments |
|---|---|---|
| Singleton/global context | Simplifies access, like C style, easy for JS | Security, hidden dependencies, possible side effects |
| Explicit reference-passing | Pure functions, explicit deps, safer | Boilerplate, possible verbosity |
| Mixing allocators in context | Direct allocation management | Blurs data vs. behavior, explicitness questioned |
- Cayman Nava emphasized the need for a concrete design document outlining the binding interface for their first iteration, focusing on state transition plus caches as the initial implementation target. He suggested studying well-architected Zig codebases like Tigerbeetle and Ghostty for architectural guidance, contrasting them with less organized projects like Bun.
This conversation between Matthew Keil and Cayman Nava centers on lessons from Lighthouse's architecture and how to design Lodestar's Zig and JavaScript bindings for both current interoperability and future pure-Zig clients.
Looking to Lighthouse for Inspiration
- Both agree Lighthouse's structure—with well-organized classes/structs and clear module boundaries—offers relevant architectural patterns, even if Teku and others differ.
- However, Lighthouse operates without the cross-language binding layer they need for Lodestar/JS, so not all patterns translate directly.
Separation of Concerns: Native vs. JS Layer
- Matthew emphasizes that how things are structured at the native/Zig layer must be settled before considering the JS interface.
- Cayman agrees and suggests that lower layers (state transition, caches, etc.) should only care about internal logic and not about how the top-level orchestration happens, whether in JS-land or in a future Zig-only client.
Composability and Abstraction
- Cayman likens function organization to "Lego blocks," advocating for maximum composability and minimal coupling: the way you assemble and glue blocks together at the top should not dictate details at the bottom.
- He notes that the bindings layer (to JS) is a layer on top of reusable Zig systems. It's vital to avoid locking in design meant only for JS interop into lower-level state transition logic.
Forward Compatibility and Avoiding Premature Constraints
- Matthew is thinking ahead to a "full Zig client" (native, beacon chain, caches), wondering if current design will serve them there.
- Cayman responds that for pure Zig, the design space is much broader. The current need is to get the boundaries and bindings layer right for JS while ensuring the Zig code for things like state transition, Merkle trees, and caches does not become polluted with binding-layer concerns.
Reference Counting and Opaque Handles
- They discuss reference counting and life-cycle management:
- Cayman: Ref counting is only needed in some low-level, resource-managed Zig structures (like a Merkle node pool); most caches, beacon blocks, etc. don’t require it.
- Once a JS binding is involved, opaque tokens/handles are used: the JS layer holds a handle (pointer) to a native structure, and the VM/bindings manage reference lifetime—not the Zig code itself.
- Matthew argues that as long as the cache is properly locked (mutex) on updates and read access is synchronized, the same Zig cache code can be wrapped with JS handles or used natively in a pure Zig client.
Technical Consensus
- The recommended approach is to write core logic in Zig without binding-specific (JS) constraints or ref-counting unless necessary for native resource management.
- Expose these as opaque pointers/tokens with accessor methods to JS; let the JS VM manage lifetime/handles.
- With this separation, the same core code will be reusable for a future all-Zig beacon chain client.
Strong modularity and composability (Lego block analogy) at the underlying logic level. Binding-specific logic (handles, reference management) must be kept shallow, at the interop boundary, not in core state or cache logic. Future-proofing: Do not pollute Zig code with constraints required only by current JS interoperability.
Three Zig Object Access Patterns Identified (Matthew)
- Singletons: Native objects that exist only once (e.g., large caches), usually referenced globally.
- Non-singleton, Non-refcounted Objects: Regular native objects that, while not globally unique, do not need reference counting.
- Refcounted/Pool Objects: Special cases, e.g., the hash pool, where explicit tracking and cleanup (ref counting, circular buffers) are necessary to manage resources safely.
Separation of Zig Module vs. Binding Layer (Cayman, supported by Matthew)
- Core Zig modules should be written for "Zig consumption"—concerned with correctness, clarity, and idiomatic practices.
- The binding layer is a separate concern: it adapts those modules for JS or C, possibly translating or transforming data as needed.
- This distinction is crucial for maintainability and for applying best practices from the Zig ecosystem.
Data Sharing Restrictions and API Layer
- Nazar points out that, except for a handful of types the ABI doesn't support (e.g., arrays of null-terminated strings), the internal shape of Zig structs doesn’t constrain what is shared with C/JS.
- The binding layer must translate unsupported types into acceptable forms (e.g., array list in Zig becomes a single comma-separated string for C/JS).
Best Practices: Zig Data Structures and Dev Experience
- Zig’s native data structures (like
ArrayList) are great for internal use, and efficient. - When sharing with C/JS, choose the simplest format that meets functional and performance needs.
- Focus first on writing unit-tested, idiomatic Zig code.
Binding Layer Design Can Proceed In Parallel
- Core logic (e.g., state transition, getter/setter logic) can be developed or ported to Zig directly from the TypeScript reference.
- Meanwhile, the binding layer (function signatures, wrappers, what gets exposed, and how handles are managed) can be designed and iterated independently.
- This parallelization allows for rapid prototyping and refinement.
Stick to Spec, Avoid Premature Optimization or Divergence
- Bing and others caution that major architectural changes should be aligned with the official spec (or, if inspired by Lighthouse or others, done with care).
- Naive rewrites may already be sufficient, and performance/architecture tweaks should be considered incrementally.
- Cayman starts by clarifying that while the binding architecture shouldn’t differ dramatically between NAPI (Node.js n-api) and bun-ffi (Bun’s FFI), for consistency and coordination, the team should choose one as the primary target for their initial binding implementation.
- Nazar describes his recent success with a pattern: first build the Zig module using standard library idioms, then create a Bun FFI binding as an intermediate C-ABI-friendly layer, and finally layer the Node.js binding over that.
- This approach simplifies the interface, as C ABI types are easier to target and adapt.
- Nazar strongly recommends starting with bun-ffi, targeting a C-ABI, instead of beginning with NAPI and then later trying to backport to bun-ffi.
- The general agreement is:
- Build the Zig library in a modular, idiomatic way.
- Add a C ABI FFI layer for Bun.
- Use that as a basis for any needed Node.js bindings via NAPI later.
- Cayman notes that bun-ffi actually supports passing n-api value types (like n-api values) but points out that if you want to do things like async, deal with promises, callbacks, or non-FFI-supported types, this is possible, but it’s “not thread safe” and more complex than standard FFI.
- Decision: Settle on bun ffi as the initial target, with room to adapt to nappy later if needed.
- Cayman calls out the risk: they can’t actually run the full Lodestar codebase on Bun yet, so some test coverage will have to happen in isolation (e.g., running ETH spec tests on state transitions in Zig via bun ffi, but not the whole node).
Recording/Transcript: https://drive.google.com/drive/folders/1tXIxf_iOMGNLaIv1DRnUoWhLj3PERhDq?usp=sharing
- Good fit for “vanilla” Zig libs (e.g. SSZ).
- Poor fit for target-specific code (hash-tree-root, BLST, Blast, Snappy bindings)
- Generates fallback build.zig/.zon, so easy to abandon if it becomes a problem
- Adopt zbuild for new or simple libs; avoid for native/asm or multi-arch code.
- Requires further issue discussion: https://github.com/ChainSafe/zbuild/issues/1
- Prioritise making Beacon
state-transitionnative; - Minimal JS ↔︎ Zig interface is a single process_block(state*, block_bytes) call
- Requires native Merkle-tree BeaconState + all epoch caches (proposer, committee, shuffling, etc.).
- Other consumers (gossip-validation, fork-choice, regen) must read these caches via bindings.
- Continue building native state-transition and cache structures; expose only required getters/setters first.
- Tuyen will finish cache implementation & add Zig spec-tests — est. 1 month for unit tests + 1 month for full spec-tests (optimistic timeline).
- Bun-FFI easier & familiar; risk of tying client to Bun runtime.
- N-API heavier but Node-stable; unknown performance until benchmarked.
- Large surface (≥5 k SSZ types) makes hand-written bindings infeasible → code-gen likely.
- Defer decision but must choose within ~2 months (by mid-Sep) or November-integration slips.
- Explore tiny PoC library in N-API to gauge DX & perf (no one assigned yet).
- Parallel task: Try to make Lodestar run under Bun today to unblock Bun-FFI testing.
- Whack-a-mole vs Architectural Redesign
- Hot-spot driven (“moles”) is practical, but may miss systemic wins (data-flow, thread layout).
- After native state-transition lands we must re-profile; new bottlenecks unknown.
- Use perf metrics to guide; re-evaluate design once native caches are in.
- Add fine-grained metrics inside Zig state-transition for future profiling (no one assigned yet).
- Next unlocks: Fork-choice, Gossip validation, Block production.
- Rough vision diagram shows JS network & API workers around fully-native chain core (see Vision document):
- Thought experiment, not committed to yet. Focus resources on initial integration.
- Bun integration running in Lodestar by November on-site meetup.
- Bi-weekly roadmap syncs will continue.