perf(v2-rc.1): optimize keccakf cuda kernel#2599
perf(v2-rc.1): optimize keccakf cuda kernel#2599shuklaayush wants to merge 2 commits intodevelop-v2.0.0-rc.1from
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
321a83f to
25261d1
Compare
This comment was marked as outdated.
This comment was marked as outdated.
|
@codex review |
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
f6a3b41 to
3b2b382
Compare
bdf84f2 to
35b0525
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
lint failing @gaxiom can review, but also I think it is likely fine given both claude & codex approved |
that's because of |
3b2b382 to
4911c15
Compare
35b0525 to
b4a1873
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
0ff54d0 to
d3cc15c
Compare
|
Claude finished @shuklaayush's task in 1m 27s —— View job
Code Review: perf: split keccakf into two-phase kernelI've reviewed all 3 commits against Correctness VerificationTheta simplification ( Rho/Pi cycle (
Chi step ( Two-phase split (
Dummy handling: Zero-initialized state in phase 1 → valid keccak-f trace. Phase 2 writes Rust FFI ( Observations
SummaryThe two-phase split is a clean architectural improvement — phase 1 is compute-bound (keccak-f rounds), phase 2 is memory-bound (trace writes), and separating them enables coalesced writes in the bottleneck phase. Combined with the register pressure reductions (~110 registers saved), this should yield a significant speedup. LGTM — correctness is solid across all transformations. |
|
@codex review |
|
Codex Review: Didn't find any major issues. What shall we delve into next? ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
4911c15 to
cf270ee
Compare
d3cc15c to
455ad1b
Compare
|
Claude finished @jonathanpwang's task in 3m 11s —— View job
Code Review: Two-Phase Keccak-f Kernel SplitI've reviewed all changes across the 3 commits against Correctness AnalysisTwo-phase split architecture — Correct. Phase 1 (
All columns accounted for; no uninitialized memory. Theta simplification ( In-place rho/pi ( In-place chi ( Dummy row handling — Correct. Zero-initialized state in phase 1 produces a valid keccak-f trace (keccak-f of all-zeros is well-defined). Phase 2 writes
Rust FFI ( Observations (non-blocking)
VerdictLGTM — The two-phase split is a clean, well-motivated optimization. Correctness is solid across all transformations. The ~58x kernel speedup with only ~1.9% additional memory overhead is excellent. |
Summary
Redesign
keccakf_perm_tracegenas a two-phase kernel to fix a memory coalescing bottleneck. The original single-kernel design assigned 24 consecutive rows per thread, causing adjacent warp threads to store 24 rows apart in the column-major trace — wasting ~97% of memory bandwidth. The new design separates state computation (thread-per-permutation) from trace writing (thread-per-row), so adjacent threads write adjacent rows and stores coalesce. ~58x kernel speedup (9.2s → 160ms total GPU time across 138 invocations on the reth benchmark).Changes
p3_keccakf.cuh:apply_round_in_place: keccak-f round function (theta/rho/pi/chi/iota) without trace column writes, used by phase 1RHO_PI_CYCLE_IDX/RHO_PI_CYCLE_ROTconstants for in-place rho/pi permutation cyclegenerate_trace_row_for_round: replacestate_b[5][5]with in-place rho/pi, replacestate_c_prime[5]with scalard, in-place chi with 2 temps per rowkeccakf_perm.cu:keccakf_perm_phase1): one thread per permutation computes all 24 keccak-f rounds, stores 25-lane u64 round-input state to a scratch buffer (~4.8 KB/permutation)keccakf_perm_phase2): one thread per row loads round state from scratch, writes all 2634 trace columns with coalesced storesinitial_state[5][5],fill_zero, and prev-row global readbackcuda_abi.rs/mod.rs:DeviceBuffer<u64>) through FFI to the CUDA kernelsReth benchmark comparison
Resolves INT-6958