Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes).

We've known for a while that the existing inliner+`mem2reg` setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.

Thankfully, @schell published a `crabslab` crate with a `#[derive(...)]` proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and [I was able to turn that into a benchmark](https://gist.github.com/eddyb/484c8828f4e4d5a57c22d7c564215385/fa7b3b148928f53db0159e5f0992b6ec867019fb):

|                       |   O(_ⁿ) | `n=4` | `n=5` | `n=6` | `n=7` |  `n=8` | `n=9`
|----------------------:|--------:|-------|-------|-------|-------|--------|------
|                 total | O(3.3ⁿ) | 0.664 | 0.965 | 2.135 | 9.124 | 46.472 | 247.335
| post-inline `mem2reg` | O(5.4ⁿ) | 0.054 | 0.245 | 1.173 | 7.584 | 43.371 | 239.904
|           `spirv-opt` | O(2.2ⁿ) | 0.081 | 0.169 | 0.351 | 0.767 |  1.783 | 4.397
|                inline | O(3.4ⁿ) |     0 | 0.007 | 0.020 | 0.067 |  0.234 | 0.959
|      SPIR-V -> SPIR-T |   O(2ⁿ) | 0.005 | 0.008 | 0.014 | 0.032 |  0.071 | 0.167

If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the `-Z time-passes` names), what you should be able to notice is there are two trends:
- `~2ⁿ`: the amount of SPIR-V generated (as observed by `SPIR-V -> SPIR-T` and `spirv-opt`)
  - this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
  - the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
    - while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
- `>4ⁿ`: post-inline `mem2reg` is at least `(2ⁿ)²`, i.e. _quadratic_ (or worse) in the amount of SPIR-V
  - we more or less knew this, but this test is simple enough that it _shouldn't_ have any `mem2reg` work left!

What happened? We forgot to switch the inliner over to `OpPhi`s for its return value dataflow, so to this day it generates `OpVariable`s (w/ `OpStore`s replacing callee returns, and `OpLoad`s at call site):
https://github.com/EmbarkStudios/rust-gpu/blob/8678d58d61a78f01201ec854cb5e3835c014fa3b/crates/rustc_codegen_spirv/src/linker/inline.rs#L658-L664

Some quick hacky test (using `OpUndef`), for two known projects, ended up making `mem2reg`:
- _**15x faster**_ (~`730s` -> ~`50s`) for @hatoo's `rene`
  - see also #851
  - **EDIT**: initially this said "3x" but that was an invalid comparison and Rust-GPU `main` is really slow
- _**30x faster**_ (~`150s` -> ~`5s`) on @schell's more recent `renderling` (at https://github.com/schell/renderling/commit/d9f4d6fa485b238bc00ba01a6e1c7119319b5cbd)

<sub>(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, `mem2reg` was spinning its wheels that entire time, due to those `OpVariable`s generated by the inliner, instead of actually helping)</sub>

---

Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), [I repro'd with just `#[derive(Clone)]`](https://gist.github.com/eddyb/484c8828f4e4d5a57c22d7c564215385/d7071e59fa6f3cce6984a76e524b8086025d96da) too:
|                       |   O(_ⁿ) | `n=4` | `n=5` | `n=6` | `n=7` | `n=8` | `n=9`
|----------------------:|--------:|-------|-------|-------|-------|-------|------
|                 total | O(1.7ⁿ) | 0.543 | 0.567 | 0.625 | 0.875 | 1.952 | 7.683
| post-inline `mem2reg` | O(4.8ⁿ) |     0 | 0.013 | 0.059 | 0.264 | 1.225 | 6.695
|           `spirv-opt` | O(1.9ⁿ) | 0.009 | 0.012 | 0.022 | 0.046 | 0.096 | 0.204
|                inline |   O(3ⁿ) |     0 |     0 |     0 | 0.009 | 0.024 | 0.080
|      SPIR-V -> SPIR-T | O(1.7ⁿ) | 0.003 | 0.004 | 0.007 | 0.010 | 0.019 | 0.047

That one is fast enough that it deserved more columns, but I'm not messing with `jq`/sorting any further.

There is, however, a _very_ compact testcase that can be generated from it:
```rust
#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;

#[spirv(fragment)]
pub fn fs_main(
    #[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
    out: &mut D12<f32>,
) {
    *out = buf.clone();
}
```
- on `main`, it takes `692s` (~`11.5min`) in `mem2reg`, and ~`11.7s` _everywhere else_
- with the local hacky workaround, it's down to ~`6.2s` _in total_
  - alright, that should be impossible, even the inlining is faster, the hack is doing too much
  - then again, variables _do_ require weirder handling, and the inliner isn't bottom-up, so maybe?
  - either way, anywhere between 6 and 12 seconds _should_ be possible with the true `OpPhi` fix

And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a `type D13<T> = D<D12<T>>;` should still take less than a minute once fixed, but anywhere from 45 minutes _to a whole hour_ on `main` (I am not further delaying this issue just to prove that, though).

	if let Some(call_result_type) = call_result_type {
	// Generate the storage space for the return value: Do this after the split above,
	// because if block_idx=0, inserting a variable here shifts call_index.
	insert_opvariables(
	&mut caller.blocks[0],
	[Instruction::new(
	Op::Variable,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes). #1136

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	O(_ⁿ)	`n=4`	`n=5`	`n=6`	`n=7`	`n=8`	`n=9`
total	O(3.3ⁿ)	0.664	0.965	2.135	9.124	46.472	247.335
post-inline `mem2reg`	O(5.4ⁿ)	0.054	0.245	1.173	7.584	43.371	239.904
`spirv-opt`	O(2.2ⁿ)	0.081	0.169	0.351	0.767	1.783	4.397
inline	O(3.4ⁿ)	0	0.007	0.020	0.067	0.234	0.959
SPIR-V -> SPIR-T	O(2ⁿ)	0.005	0.008	0.014	0.032	0.071	0.167

	O(_ⁿ)	`n=4`	`n=5`	`n=6`	`n=7`	`n=8`	`n=9`
total	O(1.7ⁿ)	0.543	0.567	0.625	0.875	1.952	7.683
post-inline `mem2reg`	O(4.8ⁿ)	0	0.013	0.059	0.264	1.225	6.695
`spirv-opt`	O(1.9ⁿ)	0.009	0.012	0.022	0.046	0.096	0.204
inline	O(3ⁿ)	0	0	0	0.009	0.024	0.080
SPIR-V -> SPIR-T	O(1.7ⁿ)	0.003	0.004	0.007	0.010	0.019	0.047

Inliner->mem2reg can quadratically amplify already-exponential inlining (turning seconds into minutes). #1136

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Inliner->`mem2reg` can quadratically amplify already-exponential inlining (turning seconds into minutes). #1136