You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've known for a while that the existing inliner+mem2reg setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.
Thankfully, @schell published a crabslab crate with a #[derive(...)] proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:
O(_ⁿ)
n=4
n=5
n=6
n=7
n=8
n=9
total
O(3.3ⁿ)
0.664
0.965
2.135
9.124
46.472
247.335
post-inline mem2reg
O(5.4ⁿ)
0.054
0.245
1.173
7.584
43.371
239.904
spirv-opt
O(2.2ⁿ)
0.081
0.169
0.351
0.767
1.783
4.397
inline
O(3.4ⁿ)
0
0.007
0.020
0.067
0.234
0.959
SPIR-V -> SPIR-T
O(2ⁿ)
0.005
0.008
0.014
0.032
0.071
0.167
If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes names), what you should be able to notice is there are two trends:
~2ⁿ: the amount of SPIR-V generated (as observed by SPIR-V -> SPIR-T and spirv-opt)
this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
>4ⁿ: post-inline mem2reg is at least (2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-V
we more or less knew this, but this test is simple enough that it shouldn't have any mem2reg work left!
What happened? We forgot to switch the inliner over to OpPhis for its return value dataflow, so to this day it generates OpVariables (w/ OpStores replacing callee returns, and OpLoads at call site):
(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg was spinning its wheels that entire time, due to those OpVariables generated by the inliner, instead of actually helping)
Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)] too:
O(_ⁿ)
n=4
n=5
n=6
n=7
n=8
n=9
total
O(1.7ⁿ)
0.543
0.567
0.625
0.875
1.952
7.683
post-inline mem2reg
O(4.8ⁿ)
0
0.013
0.059
0.264
1.225
6.695
spirv-opt
O(1.9ⁿ)
0.009
0.012
0.022
0.046
0.096
0.204
inline
O(3ⁿ)
0
0
0
0.009
0.024
0.080
SPIR-V -> SPIR-T
O(1.7ⁿ)
0.003
0.004
0.007
0.010
0.019
0.047
That one is fast enough that it deserved more columns, but I'm not messing with jq/sorting any further.
There is, however, a very compact testcase that can be generated from it:
on main, it takes 692s (~11.5min) in mem2reg, and ~11.7severywhere else
with the local hacky workaround, it's down to ~6.2sin total
alright, that should be impossible, even the inlining is faster, the hack is doing too much
then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
either way, anywhere between 6 and 12 seconds should be possible with the true OpPhi fix
And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>; should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main (I am not further delaying this issue just to prove that, though).
The text was updated successfully, but these errors were encountered:
We've known for a while that the existing inliner+
mem2reg
setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.Thankfully, @schell published a
crabslab
crate with a#[derive(...)]
proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:n=4
n=5
n=6
n=7
n=8
n=9
mem2reg
spirv-opt
If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the
-Z time-passes
names), what you should be able to notice is there are two trends:~2ⁿ
: the amount of SPIR-V generated (as observed bySPIR-V -> SPIR-T
andspirv-opt
)>4ⁿ
: post-inlinemem2reg
is at least(2ⁿ)²
, i.e. quadratic (or worse) in the amount of SPIR-Vmem2reg
work left!What happened? We forgot to switch the inliner over to
OpPhi
s for its return value dataflow, so to this day it generatesOpVariable
s (w/OpStore
s replacing callee returns, andOpLoad
s at call site):rust-gpu/crates/rustc_codegen_spirv/src/linker/inline.rs
Lines 658 to 664 in 8678d58
Some quick hacky test (using
OpUndef
), for two known projects, ended up makingmem2reg
:730s
-> ~50s
) for @hatoo'srene
rustc_codegen_spirv
taking a long time processing my (large) shader #851main
is really slow150s
-> ~5s
) on @schell's more recentrenderling
(at schell/renderling@d9f4d6f)(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them,
mem2reg
was spinning its wheels that entire time, due to thoseOpVariable
s generated by the inliner, instead of actually helping)Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just
#[derive(Clone)]
too:n=4
n=5
n=6
n=7
n=8
n=9
mem2reg
spirv-opt
That one is fast enough that it deserved more columns, but I'm not messing with
jq
/sorting any further.There is, however, a very compact testcase that can be generated from it:
main
, it takes692s
(~11.5min
) inmem2reg
, and ~11.7s
everywhere else6.2s
in totalOpPhi
fixAnd if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a
type D13<T> = D<D12<T>>;
should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour onmain
(I am not further delaying this issue just to prove that, though).The text was updated successfully, but these errors were encountered: