Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inliner->mem2reg can quadratically amplify already-exponential inlining (turning seconds into minutes). #1136

Open
eddyb opened this issue Feb 20, 2024 · 1 comment
Assignees

Comments

@eddyb
Copy link
Contributor

eddyb commented Feb 20, 2024

We've known for a while that the existing inliner+mem2reg setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.

Thankfully, @schell published a crabslab crate with a #[derive(...)] proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:

O(_ⁿ) n=4 n=5 n=6 n=7 n=8 n=9
total O(3.3ⁿ) 0.664 0.965 2.135 9.124 46.472 247.335
post-inline mem2reg O(5.4ⁿ) 0.054 0.245 1.173 7.584 43.371 239.904
spirv-opt O(2.2ⁿ) 0.081 0.169 0.351 0.767 1.783 4.397
inline O(3.4ⁿ) 0 0.007 0.020 0.067 0.234 0.959
SPIR-V -> SPIR-T O(2ⁿ) 0.005 0.008 0.014 0.032 0.071 0.167

If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes names), what you should be able to notice is there are two trends:

  • ~2ⁿ: the amount of SPIR-V generated (as observed by SPIR-V -> SPIR-T and spirv-opt)
    • this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
    • the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
      • while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
  • >4ⁿ: post-inline mem2reg is at least (2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-V
    • we more or less knew this, but this test is simple enough that it shouldn't have any mem2reg work left!

What happened? We forgot to switch the inliner over to OpPhis for its return value dataflow, so to this day it generates OpVariables (w/ OpStores replacing callee returns, and OpLoads at call site):

if let Some(call_result_type) = call_result_type {
// Generate the storage space for the return value: Do this *after* the split above,
// because if block_idx=0, inserting a variable here shifts call_index.
insert_opvariables(
&mut caller.blocks[0],
[Instruction::new(
Op::Variable,

Some quick hacky test (using OpUndef), for two known projects, ended up making mem2reg:

(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg was spinning its wheels that entire time, due to those OpVariables generated by the inliner, instead of actually helping)


Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)] too:

O(_ⁿ) n=4 n=5 n=6 n=7 n=8 n=9
total O(1.7ⁿ) 0.543 0.567 0.625 0.875 1.952 7.683
post-inline mem2reg O(4.8ⁿ) 0 0.013 0.059 0.264 1.225 6.695
spirv-opt O(1.9ⁿ) 0.009 0.012 0.022 0.046 0.096 0.204
inline O(3ⁿ) 0 0 0 0.009 0.024 0.080
SPIR-V -> SPIR-T O(1.7ⁿ) 0.003 0.004 0.007 0.010 0.019 0.047

That one is fast enough that it deserved more columns, but I'm not messing with jq/sorting any further.

There is, however, a very compact testcase that can be generated from it:

#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;

#[spirv(fragment)]
pub fn fs_main(
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
    out: &mut D12<f32>,
) {
    *out = buf.clone();
}
  • on main, it takes 692s (~11.5min) in mem2reg, and ~11.7s everywhere else
  • with the local hacky workaround, it's down to ~6.2s in total
    • alright, that should be impossible, even the inlining is faster, the hack is doing too much
    • then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
    • either way, anywhere between 6 and 12 seconds should be possible with the true OpPhi fix

And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>; should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main (I am not further delaying this issue just to prove that, though).

@eddyb eddyb self-assigned this Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@eddyb and others