perf(v2-rc.1): optimize keccakf cuda kernel by shuklaayush · Pull Request #2599 · openvm-org/openvm

shuklaayush · 2026-03-20T16:15:34Z

Summary

Redesign keccakf_perm_tracegen as a two-phase kernel to fix a memory coalescing bottleneck. The original single-kernel design assigned 24 consecutive rows per thread, causing adjacent warp threads to store 24 rows apart in the column-major trace — wasting ~97% of memory bandwidth. The new design separates state computation (thread-per-permutation) from trace writing (thread-per-row), so adjacent threads write adjacent rows and stores coalesce. ~58x kernel speedup (9.2s → 160ms total GPU time across 138 invocations on the reth benchmark).

Changes

p3_keccakf.cuh:

Add apply_round_in_place: keccak-f round function (theta/rho/pi/chi/iota) without trace column writes, used by phase 1
Add RHO_PI_CYCLE_IDX/RHO_PI_CYCLE_ROT constants for in-place rho/pi permutation cycle
generate_trace_row_for_round: replace state_b[5][5] with in-place rho/pi, replace state_c_prime[5] with scalar d, in-place chi with 2 temps per row

keccakf_perm.cu:

Phase 1 kernel (keccakf_perm_phase1): one thread per permutation computes all 24 keccak-f rounds, stores 25-lane u64 round-input state to a scratch buffer (~4.8 KB/permutation)
Phase 2 kernel (keccakf_perm_phase2): one thread per row loads round state from scratch, writes all 2634 trace columns with coalesced stores
Remove initial_state[5][5], fill_zero, and prev-row global readback
Unify valid/dummy code paths (dummy rows use zero-initialized state)

cuda_abi.rs / mod.rs:

Plumb scratch buffer (DeviceBuffer<u64>) through FFI to the CUDA kernels
Scratch overhead is ~1.9% of the trace matrix (4800 bytes/perm for scratch vs 252864 bytes/perm for the trace), freed after tracegen completes

Reth benchmark comparison

Resolves INT-6958

shuklaayush · 2026-03-20T17:03:15Z

@codex review

chatgpt-codex-connector · 2026-03-20T17:10:50Z

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

jonathanpwang · 2026-03-20T18:50:33Z

lint failing

@gaxiom can review, but also I think it is likely fine given both claude & codex approved

shuklaayush · 2026-03-20T19:43:37Z

lint failing

that's because of cargo audit which will be addressed in another pr

github-actions · 2026-03-20T22:23:52Z

Claude finished @shuklaayush's task in 1m 27s —— View job

Read changed files and base versions
Analyze optimization correctness
Post review feedback

Code Review: perf: split keccakf into two-phase kernel

I've reviewed all 3 commits against develop-v2.0.0-rc.1. The PR contains two complementary optimizations: (1) register pressure reduction via in-place keccak-f steps, and (2) a two-phase kernel split for coalesced column-major writes.

Correctness Verification

Theta simplification (p3_keccakf.cuh:152-163): Correct. C[x] ^ C'[x] = C[x] ^ (C[x] ^ d) = d. Computing scalar d = C[x-1] ^ ROTL(C[x+1], 1) directly is algebraically equivalent. The c_prime column still correctly writes state_c[x] ^ d.

Rho/Pi cycle (p3_keccakf.cuh:167-175): Correct. Spot-checked:

cycle[0]=1 (x=1,y=0) → dest (0,2) = flat 10 = cycle[23], R[1][0]=1 = RHO_PI_CYCLE_ROT[23] ✓
cycle[1]=6 (x=1,y=1) → dest (1,0) = flat 1 = cycle[0], R[1][1]=44 = RHO_PI_CYCLE_ROT[0] ✓
(0,0) correctly excluded as fixed point ✓

Chi step (p3_keccakf.cuh:178-186): Correct. t0/t1 save lanes 0,1 before writes. Lanes 2-4 are read before being overwritten. Wrap-around uses saved temps.

Two-phase split (keccakf_perm.cu):

Phase 1: stores round-input state to scratch before each round, then advances in-place — correct
Phase 2: loads round state, writes inner.a directly from it (eliminating prev-row global readback), recomputes one round for intermediates — correct
inner.preimage reads from d_round_states at round-0 offset of each perm — correct since round-0 input = preimage

fill_zero removal: Safe — every field explicitly written: preimage, a, all intermediates via generate_trace_row_for_round, _export, timestamp.

Dummy handling: Zero-initialized state in phase 1 → valid keccak-f trace. Phase 2 writes _export=0, timestamp=0 for dummies. ✓

Rust FFI (cuda_abi.rs, mod.rs): Scratch buffer sized blocks_to_fill * NUM_ROUNDS * 25 u64s, matching C-side assertion. as_mut_ptr() + .len() passed through correctly.

Observations

Code duplication between apply_round_in_place and generate_trace_row_for_round: Both implement the same theta/rho-pi/chi/iota logic — one without trace writes (phase 1) and one with (phase 2). If either is updated without the other, they'll silently diverge. Consider adding a comment cross-referencing the two functions, or a brief note that they must stay in sync.
Scratch memory budget (mod.rs:167): 4800 bytes/perm. For 2^20 rows (~43K perms), that's ~200 MB. The comment notes this is ~1.9% of trace size, which is good. Worth documenting an upper bound if memory pressure becomes a concern with very large traces.
Phase 1 d_round_states write pattern: Each thread writes 24×25 = 600 u64s sequentially. Adjacent threads write to offsets 600 apart, so these stores are not coalesced. Since phase 1 is compute-bound (24 rounds of keccak-f per thread), this is likely fine — the stores are amortized over the computation. But worth noting if phase 1 ever becomes a bottleneck.
preimage_limbs aliasing in phase 2 (keccakf_perm.cu:115): d_round_states is declared const *__restrict__ but preimage_limbs aliases into it via reinterpret_cast. This is fine since the restrict only applies to d_trace, not between different read-only pointers. Just noting for clarity.
Minor: d_round_states mutability (cuda_abi.rs:131): d_round_states.as_mut_ptr() is called on a buffer that phase 2 only reads. This works because the C side takes uint64_t * for phase 1 and uint64_t const * for phase 2, but the Rust FFI declaration uses a single *mut u64. Could add a comment noting phase 2 only reads.

Summary

The two-phase split is a clean architectural improvement — phase 1 is compute-bound (keccak-f rounds), phase 2 is memory-bound (trace writes), and separating them enables coalesced writes in the bottleneck phase. Combined with the register pressure reductions (~110 registers saved), this should yield a significant speedup.

LGTM — correctness is solid across all transformations.

shuklaayush · 2026-03-20T22:24:18Z

@codex review

chatgpt-codex-connector · 2026-03-20T22:30:36Z

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

github-actions · 2026-03-21T04:05:14Z

Claude finished @jonathanpwang's task in 3m 11s —— View job

Read changed files and base versions
Analyze optimization correctness
Post review feedback

Code Review: Two-Phase Keccak-f Kernel Split

I've reviewed all changes across the 3 commits against develop-v2.0.0-rc.1. The optimization is well-motivated and correctly implemented.

Correctness Analysis

Two-phase split architecture — Correct. Phase 1 (keccakf_perm_phase1) stores the 25-lane u64 round-input state to scratch before each apply_round_in_place call. Phase 2 (keccakf_perm_phase2) loads that state, writes inner.a from it, then calls generate_trace_row_for_round which recomputes one round to materialize all intermediate columns. This eliminates the old prev-row global readback pattern entirely.

fill_zero removal (keccakf_perm.cu) — Safe. I traced every field of KeccakfPermCols:

inner.step_flags — generate_trace_row_for_round does COL_FILL_ZERO then sets the active flag
inner._export, timestamp — explicitly written for all rows (lines 126-132)
inner.preimage — written from d_round_states at round-0 offset (line 116)
inner.a — written from loaded current_state (line 120)
inner.c, c_prime, a_prime, a_prime_prime, a_prime_prime_0_0_bits, a_prime_prime_prime_0_0_limbs — all written by generate_trace_row_for_round

All columns accounted for; no uninitialized memory.

Theta simplification (p3_keccakf.cuh:152-163) — Correct. C[x] ^ C'[x] = C[x] ^ (C[x] ^ d) = d where d = C[x-1] ^ ROTL(C[x+1], 1). The A'[x,y] ^= d is algebraically equivalent to the old A'[x,y] ^= C[x] ^ C'[x].

In-place rho/pi (p3_keccakf.cuh:167-175) — Correct. The 24-element cycle correctly excludes (0,0) as a fixed point. Manually verified sample entries: cycle[0]=1 (x=1,y=0) → dest=10 = cycle[23], R[1][0]=1 = RHO_PI_CYCLE_ROT[23] ✓.

In-place chi (p3_keccakf.cuh:178-186) — Correct. t0/t1 save lanes 0,1 before they're overwritten. Lanes 2-4 are read before being written. The wrap-around at lanes 3,4 correctly uses the saved temps.

Dummy row handling — Correct. Zero-initialized state in phase 1 produces a valid keccak-f trace (keccak-f of all-zeros is well-defined). Phase 2 writes _export=0, timestamp=0 for perm_idx >= num_records.

preimage aliasing (keccakf_perm.cu:114-116) — Safe. d_round_states is const *__restrict__ in phase 2, and preimage_limbs aliases into it as read-only. The __restrict__ applies to d_trace, not between different read-only pointers.

Rust FFI (cuda_abi.rs, mod.rs) — Correct. Scratch buffer sized blocks_to_fill * NUM_ROUNDS * 25 u64s matches the C-side assertion at line 151-153 of keccakf_perm.cu.

Observations (non-blocking)

Code duplication between apply_round_in_place and generate_trace_row_for_round: Both implement the same theta/rho-pi/chi/iota logic. If either is updated without the other, they'll silently diverge and produce incorrect proofs. Consider adding a comment cross-referencing the two functions (e.g. // IMPORTANT: must stay in sync with generate_trace_row_for_round).
Phase 1 store pattern: Each thread writes 24×25 = 600 u64s sequentially, with adjacent threads 600 apart — not coalesced. This is fine since phase 1 is compute-bound (24 rounds of keccak-f amortize the stores), but worth noting if phase 1 ever becomes a bottleneck.
d_round_states mutability (cuda_abi.rs:134): as_mut_ptr() is called on a buffer that phase 2 only reads. The C-side correctly types it as uint64_t const * for phase 2, but the Rust FFI uses a single *mut u64 for both phases. Minor — could add a comment clarifying phase 2 is read-only.

Verdict

LGTM — The two-phase split is a clean, well-motivated optimization. Correctness is solid across all transformations. The ~58x kernel speedup with only ~1.9% additional memory overhead is excellent.