Skip to content

perf(v2-rc.1): optimize keccakf cuda kernel#2599

Open
shuklaayush wants to merge 2 commits intodevelop-v2.0.0-rc.1from
perf/optimize-keccakf-kernel
Open

perf(v2-rc.1): optimize keccakf cuda kernel#2599
shuklaayush wants to merge 2 commits intodevelop-v2.0.0-rc.1from
perf/optimize-keccakf-kernel

Conversation

@shuklaayush
Copy link
Collaborator

@shuklaayush shuklaayush commented Mar 20, 2026

Summary

Redesign keccakf_perm_tracegen as a two-phase kernel to fix a memory coalescing bottleneck. The original single-kernel design assigned 24 consecutive rows per thread, causing adjacent warp threads to store 24 rows apart in the column-major trace — wasting ~97% of memory bandwidth. The new design separates state computation (thread-per-permutation) from trace writing (thread-per-row), so adjacent threads write adjacent rows and stores coalesce. ~58x kernel speedup (9.2s → 160ms total GPU time across 138 invocations on the reth benchmark).

Changes

p3_keccakf.cuh:

  • Add apply_round_in_place: keccak-f round function (theta/rho/pi/chi/iota) without trace column writes, used by phase 1
  • Add RHO_PI_CYCLE_IDX/RHO_PI_CYCLE_ROT constants for in-place rho/pi permutation cycle
  • generate_trace_row_for_round: replace state_b[5][5] with in-place rho/pi, replace state_c_prime[5] with scalar d, in-place chi with 2 temps per row

keccakf_perm.cu:

  • Phase 1 kernel (keccakf_perm_phase1): one thread per permutation computes all 24 keccak-f rounds, stores 25-lane u64 round-input state to a scratch buffer (~4.8 KB/permutation)
  • Phase 2 kernel (keccakf_perm_phase2): one thread per row loads round state from scratch, writes all 2634 trace columns with coalesced stores
  • Remove initial_state[5][5], fill_zero, and prev-row global readback
  • Unify valid/dummy code paths (dummy rows use zero-initialized state)

cuda_abi.rs / mod.rs:

  • Plumb scratch buffer (DeviceBuffer<u64>) through FFI to the CUDA kernels
  • Scratch overhead is ~1.9% of the trace matrix (4800 bytes/perm for scratch vs 252864 bytes/perm for the trace), freed after tracegen completes

Reth benchmark comparison

Resolves INT-6958

@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the perf/optimize-keccakf-kernel branch from 321a83f to 25261d1 Compare March 20, 2026 16:53
@github-actions

This comment was marked as outdated.

@shuklaayush
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@branch-rebase-bot branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch from f6a3b41 to 3b2b382 Compare March 20, 2026 17:24
@shuklaayush shuklaayush force-pushed the perf/optimize-keccakf-kernel branch 2 times, most recently from bdf84f2 to 35b0525 Compare March 20, 2026 17:26
@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush marked this pull request as ready for review March 20, 2026 17:56
@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush changed the title perf: optimize keccakf cuda kernel perf(v2-rc.1): optimize keccakf cuda kernel Mar 20, 2026
@jonathanpwang jonathanpwang requested review from gaxiom and removed request for jonathanpwang March 20, 2026 18:49
@jonathanpwang
Copy link
Contributor

lint failing

@gaxiom can review, but also I think it is likely fine given both claude & codex approved

@shuklaayush
Copy link
Collaborator Author

lint failing

that's because of cargo audit which will be addressed in another pr

@branch-rebase-bot branch-rebase-bot bot force-pushed the develop-v2.0.0-rc.1 branch from 3b2b382 to 4911c15 Compare March 20, 2026 20:13
@shuklaayush shuklaayush force-pushed the perf/optimize-keccakf-kernel branch from 35b0525 to b4a1873 Compare March 20, 2026 20:19
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@shuklaayush shuklaayush force-pushed the perf/optimize-keccakf-kernel branch from 0ff54d0 to d3cc15c Compare March 20, 2026 22:23
@github-actions
Copy link

github-actions bot commented Mar 20, 2026

Claude finished @shuklaayush's task in 1m 27s —— View job


  • Read changed files and base versions
  • Analyze optimization correctness
  • Post review feedback

Code Review: perf: split keccakf into two-phase kernel

I've reviewed all 3 commits against develop-v2.0.0-rc.1. The PR contains two complementary optimizations: (1) register pressure reduction via in-place keccak-f steps, and (2) a two-phase kernel split for coalesced column-major writes.


Correctness Verification

Theta simplification (p3_keccakf.cuh:152-163): Correct. C[x] ^ C'[x] = C[x] ^ (C[x] ^ d) = d. Computing scalar d = C[x-1] ^ ROTL(C[x+1], 1) directly is algebraically equivalent. The c_prime column still correctly writes state_c[x] ^ d.

Rho/Pi cycle (p3_keccakf.cuh:167-175): Correct. Spot-checked:

  • cycle[0]=1 (x=1,y=0) → dest (0,2) = flat 10 = cycle[23], R[1][0]=1 = RHO_PI_CYCLE_ROT[23]
  • cycle[1]=6 (x=1,y=1) → dest (1,0) = flat 1 = cycle[0], R[1][1]=44 = RHO_PI_CYCLE_ROT[0]
  • (0,0) correctly excluded as fixed point ✓

Chi step (p3_keccakf.cuh:178-186): Correct. t0/t1 save lanes 0,1 before writes. Lanes 2-4 are read before being overwritten. Wrap-around uses saved temps.

Two-phase split (keccakf_perm.cu):

  • Phase 1: stores round-input state to scratch before each round, then advances in-place — correct
  • Phase 2: loads round state, writes inner.a directly from it (eliminating prev-row global readback), recomputes one round for intermediates — correct
  • inner.preimage reads from d_round_states at round-0 offset of each perm — correct since round-0 input = preimage

fill_zero removal: Safe — every field explicitly written: preimage, a, all intermediates via generate_trace_row_for_round, _export, timestamp.

Dummy handling: Zero-initialized state in phase 1 → valid keccak-f trace. Phase 2 writes _export=0, timestamp=0 for dummies. ✓

Rust FFI (cuda_abi.rs, mod.rs): Scratch buffer sized blocks_to_fill * NUM_ROUNDS * 25 u64s, matching C-side assertion. as_mut_ptr() + .len() passed through correctly.


Observations

  1. Code duplication between apply_round_in_place and generate_trace_row_for_round: Both implement the same theta/rho-pi/chi/iota logic — one without trace writes (phase 1) and one with (phase 2). If either is updated without the other, they'll silently diverge. Consider adding a comment cross-referencing the two functions, or a brief note that they must stay in sync.

  2. Scratch memory budget (mod.rs:167): 4800 bytes/perm. For 2^20 rows (~43K perms), that's ~200 MB. The comment notes this is ~1.9% of trace size, which is good. Worth documenting an upper bound if memory pressure becomes a concern with very large traces.

  3. Phase 1 d_round_states write pattern: Each thread writes 24×25 = 600 u64s sequentially. Adjacent threads write to offsets 600 apart, so these stores are not coalesced. Since phase 1 is compute-bound (24 rounds of keccak-f per thread), this is likely fine — the stores are amortized over the computation. But worth noting if phase 1 ever becomes a bottleneck.

  4. preimage_limbs aliasing in phase 2 (keccakf_perm.cu:115): d_round_states is declared const *__restrict__ but preimage_limbs aliases into it via reinterpret_cast. This is fine since the restrict only applies to d_trace, not between different read-only pointers. Just noting for clarity.

  5. Minor: d_round_states mutability (cuda_abi.rs:131): d_round_states.as_mut_ptr() is called on a buffer that phase 2 only reads. This works because the C side takes uint64_t * for phase 1 and uint64_t const * for phase 2, but the Rust FFI declaration uses a single *mut u64. Could add a comment noting phase 2 only reads.


Summary

The two-phase split is a clean architectural improvement — phase 1 is compute-bound (keccak-f rounds), phase 2 is memory-bound (trace writes), and separating them enables coalesced writes in the bottleneck phase. Combined with the register pressure reductions (~110 registers saved), this should yield a significant speedup.

LGTM — correctness is solid across all transformations.

@shuklaayush
Copy link
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. What shall we delve into next?

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@jpw-axiom jpw-axiom force-pushed the develop-v2.0.0-rc.1 branch from 4911c15 to cf270ee Compare March 21, 2026 04:02
@jonathanpwang jonathanpwang force-pushed the perf/optimize-keccakf-kernel branch from d3cc15c to 455ad1b Compare March 21, 2026 04:05
@github-actions
Copy link

github-actions bot commented Mar 21, 2026

Claude finished @jonathanpwang's task in 3m 11s —— View job


  • Read changed files and base versions
  • Analyze optimization correctness
  • Post review feedback

Code Review: Two-Phase Keccak-f Kernel Split

I've reviewed all changes across the 3 commits against develop-v2.0.0-rc.1. The optimization is well-motivated and correctly implemented.


Correctness Analysis

Two-phase split architecture — Correct. Phase 1 (keccakf_perm_phase1) stores the 25-lane u64 round-input state to scratch before each apply_round_in_place call. Phase 2 (keccakf_perm_phase2) loads that state, writes inner.a from it, then calls generate_trace_row_for_round which recomputes one round to materialize all intermediate columns. This eliminates the old prev-row global readback pattern entirely.

fill_zero removal (keccakf_perm.cu) — Safe. I traced every field of KeccakfPermCols:

  • inner.step_flagsgenerate_trace_row_for_round does COL_FILL_ZERO then sets the active flag
  • inner._export, timestamp — explicitly written for all rows (lines 126-132)
  • inner.preimage — written from d_round_states at round-0 offset (line 116)
  • inner.a — written from loaded current_state (line 120)
  • inner.c, c_prime, a_prime, a_prime_prime, a_prime_prime_0_0_bits, a_prime_prime_prime_0_0_limbs — all written by generate_trace_row_for_round

All columns accounted for; no uninitialized memory.

Theta simplification (p3_keccakf.cuh:152-163) — Correct. C[x] ^ C'[x] = C[x] ^ (C[x] ^ d) = d where d = C[x-1] ^ ROTL(C[x+1], 1). The A'[x,y] ^= d is algebraically equivalent to the old A'[x,y] ^= C[x] ^ C'[x].

In-place rho/pi (p3_keccakf.cuh:167-175) — Correct. The 24-element cycle correctly excludes (0,0) as a fixed point. Manually verified sample entries: cycle[0]=1 (x=1,y=0) → dest=10 = cycle[23], R[1][0]=1 = RHO_PI_CYCLE_ROT[23] ✓.

In-place chi (p3_keccakf.cuh:178-186) — Correct. t0/t1 save lanes 0,1 before they're overwritten. Lanes 2-4 are read before being written. The wrap-around at lanes 3,4 correctly uses the saved temps.

Dummy row handling — Correct. Zero-initialized state in phase 1 produces a valid keccak-f trace (keccak-f of all-zeros is well-defined). Phase 2 writes _export=0, timestamp=0 for perm_idx >= num_records.

preimage aliasing (keccakf_perm.cu:114-116) — Safe. d_round_states is const *__restrict__ in phase 2, and preimage_limbs aliases into it as read-only. The __restrict__ applies to d_trace, not between different read-only pointers.

Rust FFI (cuda_abi.rs, mod.rs) — Correct. Scratch buffer sized blocks_to_fill * NUM_ROUNDS * 25 u64s matches the C-side assertion at line 151-153 of keccakf_perm.cu.


Observations (non-blocking)

  1. Code duplication between apply_round_in_place and generate_trace_row_for_round: Both implement the same theta/rho-pi/chi/iota logic. If either is updated without the other, they'll silently diverge and produce incorrect proofs. Consider adding a comment cross-referencing the two functions (e.g. // IMPORTANT: must stay in sync with generate_trace_row_for_round).

  2. Phase 1 store pattern: Each thread writes 24×25 = 600 u64s sequentially, with adjacent threads 600 apart — not coalesced. This is fine since phase 1 is compute-bound (24 rounds of keccak-f amortize the stores), but worth noting if phase 1 ever becomes a bottleneck.

  3. d_round_states mutability (cuda_abi.rs:134): as_mut_ptr() is called on a buffer that phase 2 only reads. The C-side correctly types it as uint64_t const * for phase 2, but the Rust FFI uses a single *mut u64 for both phases. Minor — could add a comment clarifying phase 2 is read-only.


Verdict

LGTM — The two-phase split is a clean, well-motivated optimization. Correctness is solid across all transformations. The ~58x kernel speedup with only ~1.9% additional memory overhead is excellent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants