You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The sequence for calculating the start of the next iteration has high latency,
so moving it to earlier in the loop improves IPC. Results from Zen 4 suggest
~10% better IPC and throughput across the board.
The kernel had a long dependency chain, and `vpbroadcastd` & `vpcmpleud`
& `kmovw` have pretty high latencies, especially on Zen 4 (Icelake is a few
cycles shorter).
With the old code, even passing in `-march=znver4 -mtune=znver4` isn't
enough for the compilers to fully move this sequence before the intersection
subroutine.
0 commit comments