Skip to content

feat: deferral extension CUDA tracegen#2601

Open
stephenh-axiom-xyz wants to merge 2 commits intodevelop-v2.0.0-betafrom
feat/deferral-extension-cuda-tracegen
Open

feat: deferral extension CUDA tracegen#2601
stephenh-axiom-xyz wants to merge 2 commits intodevelop-v2.0.0-betafrom
feat/deferral-extension-cuda-tracegen

Conversation

@stephenh-axiom-xyz
Copy link
Contributor

Resolves INT-6241.

Overview

This PR adds CUDA trace generation support for the deferral extension and wires it into the deferral GPU prover path. It also updates test call sites and CUDA workflows so the new path is exercised in CI.

1) Deferral Extension: CUDA Tracegen Implementation

Files

  • extensions/deferral/circuit/build.rs
  • extensions/deferral/circuit/Cargo.toml
  • extensions/deferral/circuit/src/lib.rs
  • extensions/deferral/circuit/src/cuda_abi.rs
  • extensions/deferral/circuit/src/extension/mod.rs
  • extensions/deferral/circuit/src/extension/cuda.rs
  • extensions/deferral/circuit/src/call/mod.rs
  • extensions/deferral/circuit/src/call/cuda.rs
  • extensions/deferral/circuit/src/count/mod.rs
  • extensions/deferral/circuit/src/count/cuda.rs
  • extensions/deferral/circuit/src/output/mod.rs
  • extensions/deferral/circuit/src/output/cuda.rs
  • extensions/deferral/circuit/src/poseidon2/mod.rs
  • extensions/deferral/circuit/src/poseidon2/cuda.rs
  • extensions/deferral/circuit/cuda/include/def_types.h
  • extensions/deferral/circuit/cuda/include/def_poseidon2_buffer.cuh
  • extensions/deferral/circuit/cuda/src/call.cu
  • extensions/deferral/circuit/cuda/src/count.cu
  • extensions/deferral/circuit/cuda/src/output.cu
  • extensions/deferral/circuit/cuda/src/poseidon2.cu
  • extensions/deferral/circuit/cuda/src/canonicity.cuh

What changed

  • Re-enabled and completed CUDA build integration for the deferral circuit crate.
  • Added CUDA feature dependency on openvm-rv32im-circuit/cuda for GPU builder/prover compatibility.
  • Added Rust CUDA ABI bindings for deferral count, call, output, and poseidon2.
  • Added GPU chip implementations for deferral call/count/output/poseidon2 trace generation.
  • Added CUDA kernels for call, output, count, and poseidon2 trace rows.
  • Added shared deferral Poseidon2 record buffer with atomic append and multiplicity tracking.
  • Added GPU-side deduplication of poseidon2 records (sort + reduce-by-key) before final poseidon2 tracegen.
  • Added canonicity helper logic for BabyBear byte decomposition checks used by deferral call/output tracegen.
  • Completed GPU prover extension wiring in DeferralGpuProverExt so the deferral extension now installs all required GPU chips.

Reviewer focus

  • Check the buffer lifecycle/reset behavior for shared GPU state (count, poseidon2 record index/counters).
  • Check call/output kernels for consistency with existing CPU trace semantics (memory aux timestamps, bitwise/range checker interactions, poseidon2 recording, canonicity aux columns).
  • Check poseidon2 dedup+multiplicity logic for correctness and alignment with AIR expectations.

2) Builder API and Call-Site Updates

Files

  • extensions/deferral/circuit/src/extension/mod.rs
  • crates/continuations/src/tests/e2e.rs
  • extensions/deferral/tests/src/lib.rs
  • crates/sdk/src/tests.rs

What changed

  • Renamed the CPU-only builder type to Rv32DeferralCpuBuilder.
  • Added a feature-dependent public alias Rv32DeferralBuilder that resolves to Rv32DeferralGpuBuilder with CUDA and Rv32DeferralCpuBuilder without CUDA.
  • Updated tests to use Rv32DeferralBuilder instead of a CPU-only builder.
  • Removed the #[cfg(not(feature = "cuda"))] gate from deferral integration tests so they can run in CUDA builds.
  • Updated SDK deferral test to use Sdk::new(...) instead of CpuSdk::new(...), removing a CUDA-related TODO.

Reviewer focus

  • Confirm API rename/aliasing is intentional and does not break expected external usage patterns.
  • Confirm updated tests now exercise the CUDA path where intended.

3) CUDA Workflow Updates

Files

  • .github/workflows/extension-tests.cuda.yml
  • .github/workflows/continuations.cuda.yml
  • .github/workflows/sdk.cuda.yml

What changed

  • Added deferral to CUDA extension test matrix and paths-filter in extension-tests.cuda.yml.
  • Enabled root-prover alongside cuda in continuations CUDA workflow.
  • Removed --profile=heavy from SDK CUDA workflow nextest invocation (kept --test-threads=1).

Reviewer focus

  • Confirm CI trigger/filter behavior now includes deferral extension CUDA changes.
  • Confirm continuations CUDA workflow feature set (cuda,root-prover) matches intended test surface.

@github-actions

This comment was marked as outdated.

@github-actions
Copy link

github-actions bot commented Mar 20, 2026

Claude finished @stephenh-axiom-xyz's task in 8m 24s —— View job


Code Review: Deferral Extension CUDA Tracegen

  • Review CUDA kernels (call, output, count, poseidon2)
  • Review Rust CUDA ABI bindings and GPU chip implementations
  • Review GPU prover extension wiring and builder API changes
  • Review test updates and CI workflow changes
  • Final summary

Overall this is a well-structured PR that follows the established CUDA tracegen patterns in the codebase. The kernel/Rust split is clean, ABI bindings are consistent, and the GPU prover extension wiring correctly mirrors the CPU path. Below are my findings.

Issues

1. In-place ReduceByKey aliasing — poseidon2.cu:184-195

cub::DeviceReduce::ReduceByKey(
    d_temp_storage, temp_storage_bytes,
    d_records_fp16, d_records_fp16,  // same buffer for input and output keys
    d_counts, d_counts,              // same buffer for input and output values
    ...
);

CUB's ReduceByKey documentation states input/output iterators must not alias. However, I checked and crates/vm/cuda/src/system/poseidon2.cu:152-163 uses the exact same in-place aliasing pattern. So this is an established convention in this codebase. If this is intentionally relied upon, a comment documenting the assumption would help (applies to both locations).

2. Canonicity assert(found) only in CUDA_DEBUGcanonicity.cuh:55-57

#ifdef CUDA_DEBUG
assert(found);
#endif

If all bytes of the input match BABY_BEAR_ORDER_BE, found remains false and to_range_check stays 0. The AIR should reject such inputs anyway (since the diff_marker and diff_val columns would all be zero, which won't satisfy the constraint sum(diff_marker) == 1). Still, a unconditional assert here would provide a clearer error message vs. a mysterious AIR constraint failure. If omitting the check for performance reasons, a brief comment explaining why would be helpful.

3. Poseidon2 buffer capacity formula — poseidon2/cuda.rs:121-123

pub fn poseidon2_buffer_capacity(max_trace_height: usize) -> usize {
    max_trace_height.next_power_of_two() * 2 * (DIGEST_SIZE * 2)
}

This yields 32 * max_trace_height. The call chip produces 2 poseidon2 records per row, and the output chip produces 1 per row. Maximum concurrent records would be 3 * max_trace_height. The current formula provides ~10x headroom, which is safe but may waste GPU memory on systems processing many segments. A comment explaining the rationale (or a tighter bound) would help maintainers.

4. Missing height bounds guard in call.cu:335-341 and output.cu:130-136

const uint32_t row_idx = blockIdx.x * blockDim.x + threadIdx.x;
RowSlice row(trace + row_idx, height);

if (row_idx >= num_records) {
    row.fill_zero(0, sizeof(DeferralCallCols<uint8_t>));
    return;
}

When kernel_launch_params(height) rounds up to block boundaries, threads with row_idx >= height will still execute fill_zero, writing beyond the allocated trace. However, this is the same pattern used throughout the codebase (e.g., rv32im/circuit/cuda/src/alu.cu), so it's either safe by convention (e.g., trace buffers are always over-allocated) or a systemic issue. Not specific to this PR.

Minor Observations

5. poseidon2_buffer.cuh:45 — bare assert vs CUDA_DEBUG

assert(record_idx < capacity && "DeferralPoseidon2Buffer overflow");

This uses an unconditional assert while most other bounds checks use CUDA_DEBUG-gated asserts. Since buffer overflow is a correctness-critical check, keeping it unconditional seems intentional and good. A brief comment noting this is deliberate would clarify.

6. Unused OUTPUT_LEN_NUM_BYTES constant — def_types.h:13

This is defined but never directly referenced in the CUDA code. Actually, it IS used transitively: OUTPUT_LEN_NUM_BYTESOUTPUT_TOTAL_BYTESOUTPUT_TOTAL_MEMORY_OPS, which is used throughout the call and output kernels. No action needed.

7. Builder aliasing and API changes — extension/mod.rs:42-52

The cfg_if! aliasing is clean:

cfg_if::cfg_if! {
    if #[cfg(feature = "cuda")] {
        pub use self::cuda::Rv32DeferralGpuBuilder as Rv32DeferralBuilder;
    } else {
        pub use self::Rv32DeferralCpuBuilder as Rv32DeferralBuilder;
    }
}

Tests correctly use Rv32DeferralBuilder which resolves to the right implementation. The rename from the old builder name to Rv32DeferralCpuBuilder follows the Rv32* naming convention.

Things That Look Good

  • Canonicity logic (canonicity.cuh) correctly mirrors the CPU-side generate_subrow: big-endian comparison, first-diff tracking, range check value computation.
  • Memory aux timestamp sequencing in call.cu and output.cu correctly uses incrementing timestamps matching CPU execution order.
  • Atomic poseidon2 buffer (def_poseidon2_buffer.cuh) — atomicAdd is the right primitive for concurrent record insertion.
  • Count chip reset in count/cuda.rs:38-40fill_zero() after tracegen correctly resets for the next segment.
  • CUB dedup strategy (sort + reduce-by-key) is efficient and appropriate for GPU dedup.
  • GPU prover extension wiring (extension/cuda.rs) correctly registers all four chips (count, poseidon2, call, output) in the same AIR order as the CPU path.
  • CI changes correctly add deferral to CUDA test matrix, enable root-prover for continuations, and the --profile=heavy removal in sdk.cuda.yml is a good cleanup (it was redundant with --test-threads=1).
  • Shared buffer architecture — having call and output chips write into a shared poseidon2 buffer via Arc<DeviceBuffer>, with dedup happening at poseidon2 tracegen time, is a clean design.

Summary

The PR is solid and follows established patterns. The main item worth addressing is #2 (canonicity assert gating) — while the AIR constraints should catch invalid inputs, making the assert unconditional would provide better error diagnostics. Items #1 and #3 are worth documenting with comments. Everything else looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant