Description
We've known for a while that the existing inliner+mem2reg
setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.
Thankfully, @schell published a crabslab
crate with a #[derive(...)]
proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:
O(_ⁿ) | n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
|
---|---|---|---|---|---|---|---|
total | O(3.3ⁿ) | 0.664 | 0.965 | 2.135 | 9.124 | 46.472 | 247.335 |
post-inline mem2reg |
O(5.4ⁿ) | 0.054 | 0.245 | 1.173 | 7.584 | 43.371 | 239.904 |
spirv-opt |
O(2.2ⁿ) | 0.081 | 0.169 | 0.351 | 0.767 | 1.783 | 4.397 |
inline | O(3.4ⁿ) | 0 | 0.007 | 0.020 | 0.067 | 0.234 | 0.959 |
SPIR-V -> SPIR-T | O(2ⁿ) | 0.005 | 0.008 | 0.014 | 0.032 | 0.071 | 0.167 |
If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes
names), what you should be able to notice is there are two trends:
~2ⁿ
: the amount of SPIR-V generated (as observed bySPIR-V -> SPIR-T
andspirv-opt
)- this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
- the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
- while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
>4ⁿ
: post-inlinemem2reg
is at least(2ⁿ)²
, i.e. quadratic (or worse) in the amount of SPIR-V- we more or less knew this, but this test is simple enough that it shouldn't have any
mem2reg
work left!
- we more or less knew this, but this test is simple enough that it shouldn't have any
What happened? We forgot to switch the inliner over to OpPhi
s for its return value dataflow, so to this day it generates OpVariable
s (w/ OpStore
s replacing callee returns, and OpLoad
s at call site):
rust-gpu/crates/rustc_codegen_spirv/src/linker/inline.rs
Lines 658 to 664 in 8678d58
Some quick hacky test (using OpUndef
), for two known projects, ended up making mem2reg
:
- 15x faster (~
730s
-> ~50s
) for @hatoo'srene
- see also
rustc_codegen_spirv
taking a long time processing my (large) shader #851 - EDIT: initially this said "3x" but that was an invalid comparison and Rust-GPU
main
is really slow
- see also
- 30x faster (~
150s
-> ~5s
) on @schell's more recentrenderling
(at schell/renderling@d9f4d6f)
(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg
was spinning its wheels that entire time, due to those OpVariable
s generated by the inliner, instead of actually helping)
Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)]
too:
O(_ⁿ) | n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
|
---|---|---|---|---|---|---|---|
total | O(1.7ⁿ) | 0.543 | 0.567 | 0.625 | 0.875 | 1.952 | 7.683 |
post-inline mem2reg |
O(4.8ⁿ) | 0 | 0.013 | 0.059 | 0.264 | 1.225 | 6.695 |
spirv-opt |
O(1.9ⁿ) | 0.009 | 0.012 | 0.022 | 0.046 | 0.096 | 0.204 |
inline | O(3ⁿ) | 0 | 0 | 0 | 0.009 | 0.024 | 0.080 |
SPIR-V -> SPIR-T | O(1.7ⁿ) | 0.003 | 0.004 | 0.007 | 0.010 | 0.019 | 0.047 |
That one is fast enough that it deserved more columns, but I'm not messing with jq
/sorting any further.
There is, however, a very compact testcase that can be generated from it:
#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;
#[spirv(fragment)]
pub fn fs_main(
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
out: &mut D12<f32>,
) {
*out = buf.clone();
}
- on
main
, it takes692s
(~11.5min
) inmem2reg
, and ~11.7s
everywhere else - with the local hacky workaround, it's down to ~
6.2s
in total- alright, that should be impossible, even the inlining is faster, the hack is doing too much
- then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
- either way, anywhere between 6 and 12 seconds should be possible with the true
OpPhi
fix
And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>;
should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main
(I am not further delaying this issue just to prove that, though).