mlochbaum
diff --git a/‎docs/implementation/compile/fusion.html
Lines changed: 13 additions & 2 deletions b/‎docs/implementation/compile/fusion.html
Lines changed: 13 additions & 2 deletions
@@ -8,8 +8,8 @@
 <h1 id="loop-fusion-in-array-languages"><a class="header" href="#loop-fusion-in-array-languages">Loop fusion in array languages</a></h1>
 <p>Interpreted array languages have a major problem. Let's say you evaluate some arithmetic on a few arrays. Perhaps the first operation adds two arrays. It will loop over them, ideally adding numbers a vector register at a time, and write the results to an array. Maybe next it will check if the result is more than 10. So it'll read vectors from the result, compare to 10, pack to bit booleans, and write to another array. Each primitive has been implemented well but the combination is already far from optimal! The first result array isn't needed: it would be much better to compare each added vector to 10 right when it's produced. The extra store and load (and index arithmetic) are instructions that we don't need, but by using extra memory we can also create cache pressure that slows down the program even more.</p>
 <p>Scalar languages don't have this problem. The programmer just writes the addition and comparison in a loop, the compiler compiles it, and every comparison naturally follows the corresponding addition. More modern languages might prefer approaches like iterators that abstract away the looping but still have the semantics of a fused loop. But an iterator call, let's say <code><span class='Value'>zipwith</span><span class='Paren'>(</span><span class='Function'>+</span><span class='Separator'>,</span> <span class='Value'>a.iter</span><span class='Paren'>()</span><span class='Separator'>,</span> <span class='Value'>b.iter</span><span class='Paren'>())</span><span class='Value'>.map</span><span class='Paren'>(</span><span class='Function'>&gt;</span><span class='Number'>10</span><span class='Paren'>)</span></code> to make up some syntax, has a pretty obvious array equivalent, and if the functions are pure the different semantics don't matter! This has led to several compiled array languages like <a href="https://www.snakeisland.com/apexup.htm">APEX</a> that work on the principle of re-interpreting the scalar parts of array operations in a way that fuses naturally.</p>
-<p>Scalar compilation gives up many advantages inherent to array programming, a topic I discussed more broadly <a href="intro.html">here</a>. The obvious complaint is that you lose the vector instructions, but that's easy enough to dismiss. Any decent C compiler can auto-vectorize a loop, and so could an array compiler. But arithmetic is rarely the bottleneck, so let's say that the comparison's result will be used to filter a third array, that is, the expression is now <code><span class='Paren'>(</span><span class='Number'>10</span><span class='Function'>&lt;</span><span class='Value'>a</span><span class='Function'>+</span><span class='Value'>b</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Value'>c</span></code>. Filtering doesn't auto-vectorize! Two vectors of input will produce an output with a different, unknown size, which is enough to throw off the analysis. At least the C compilers I've dealt with will fall back to producing completely scalar code. Depending on type, this can actually be slower than CBQN's un-fused, but SIMD, primitives.</p>
-<p>This example doesn't entirely reveal the extent of the problem (for one thing, writing filter's result a partial vector at a time isn't bad, the real difficulty would be fusing it with more arithmetic later on). But hopefully it gives a sense of the issues that arise. I believe fusing operations without losing CBQN's powerful single-primitive operations will require a system that considers not just the possibility of fusing at the scalar level but several layers, from a single vector up to larger blocks of memory.</p>
+<p>Scalar compilation gives up many advantages inherent to array programming, a topic I discussed more broadly <a href="intro.html">here</a>. The obvious complaint is that you lose the vector instructions, but that's easy enough to dismiss. Any decent C compiler can auto-vectorize a loop, and so could an array compiler. But arithmetic is rarely the bottleneck, so let's say that the comparison's result will be used to filter a third array, that is, the expression is now <code><span class='Paren'>(</span><span class='Number'>10</span><span class='Function'>&lt;</span><span class='Value'>a</span><span class='Function'>+</span><span class='Value'>b</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Value'>c</span></code>. Filtering doesn't auto-vectorize! At least the C compilers I've dealt with will fall back to producing completely scalar code. Depending on type, this can actually be slower than CBQN's un-fused, but SIMD, primitives.</p>
+<p>In this case the problem is more that compilers don't <em>know</em> how to vectorize <a href="../primitive/replicate.html#booleans">filtering</a>, since in AVX-512 at least there's an instruction to do it an write a partial result. Fusing the result into more arithmetic later would be a more fundamental difficulty, because round of Replicate produces an unknown number of values so it can't be directly paired with an input vector. What we need is a way to cut the fusion at this point, writing to memory as an escape. I believe array languages are best served by two levels of fusion, a looser level that ensures this memory use isn't excessive, and tighter fusion at the level of registers.</p>
 <h2 id="blocking-and-cache-levels"><a class="header" href="#blocking-and-cache-levels">Blocking and cache levels</a></h2>
 <p>The loosest form of loop fusion goes by various names such as blocking, chunking, or tiling. Instead of running each primitive on the entire array, we run it on a block of a few kilobytes. Looping within a block stays separate, but the outer layer of looping over blocks can be fused. So in <code><span class='Paren'>(</span><span class='Number'>10</span><span class='Function'>&lt;</span><span class='Value'>a</span><span class='Function'>+</span><span class='Value'>b</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Value'>c</span></code> we'd add blocks from <code><span class='Value'>a</span></code> and <code><span class='Value'>b</span></code>, compare each one to 10, and use the result to filter a block of <code><span class='Value'>c</span></code>, before moving on to the next set of blocks. This has the advantage that it doesn't actually require compilation, as blocks can still be processed with pre-compiled functions. It has the disadvantage that each block operation still reads and writes to memory—hang on, what problem are we actually trying to solve here?</p>
 <p>For basic arithmetic, working from memory is a big relative cost, because even at the fastest cache level a load or store costs about as much the arithmetic itself. Heavier primitives like scans, filtering, transpose, or searching in a short list, do a lot more work in between, so if load and store <em>only</em> cost about as much as arithmetic that's actually pretty good. But large arrays don't fit in the fastest cache level. If a primitive writes a large result, then by the time it's done only a little piece at the end is still saved in L1. The next primitive will start at the beginning and miss L1 entirely! If the interpreter instead works in blocks that are significantly smaller than the L1 cache, accesses between primitives should stay within L1 and only the boundary of fusion, meaning the initial reads and final write, can be slow.</p>
@@ -24,3 +24,14 @@ <h2 id="blocking-and-cache-levels"><a class="header" href="#blocking-and-cache-l
 </ul>
 <p>Multidimensional operations are a whole new world of trouble. With a 2D transpose, for example, you probably want to work on square-ish blocks. The short side should be at least a cache line long to avoid re-reading or re-writing cache lines. Tiling like this is also okay for shifts, scans, and folds in either direction, but in some cases maybe it would be better for a block to be a section of a row, or even a column.</p>
 <p>A computation that can be blocked but can't be freely reordered because of side effects, <code><span class='Function'>•Show</span><span class='Modifier'>¨</span></code> for example, can be fused with primitives if the elements are passed to it in the right order. But two such functions can't be fused because the first needs to run on every block before the second gets any. Fusion needs to be cut at some point between them, perhaps in a place where memory use is lowest. And a function that can't be blocked at all obviously can't be fused, but there may still be some value in reordering relative to primitives: for example <code><span class='Paren'>(</span><span class='Function'>F</span> <span class='Value'>c</span><span class='Paren'>)</span> <span class='Function'>×</span> <span class='Value'>a</span><span class='Function'>+</span><span class='Value'>b</span></code> is defined to compute <code><span class='Function'>+</span></code> before <code><span class='Function'>F</span></code> but doing them in the other order has the same result and allows <code><span class='Function'>+</span></code> and <code><span class='Function'>×</span></code> to be fused. This should only be done if any reordered primitives (<code><span class='Function'>+</span></code> here) are known not to have errors, to avoid calling <code><span class='Function'>F</span></code> and then throwing an error that should have happened first.</p>
+<p>Blocking has a natural benefit for adaptive algorithms, which is that pattern checks will apply at the block level rather than the whole array level. For example, for filtering <code><span class='Value'>a</span><span class='Function'>/</span><span class='Value'>b</span></code>, if the number of result elements is small enough, then a sparse algorithm can skip past 0s in <code><span class='Value'>a</span></code> and get the result faster. If this number is tested per-block, the implementation can take advantage of sparse regions of <code><span class='Value'>a</span></code> even if it's not sparse overall.</p>
+<h2 id="low-level-fusion"><a class="header" href="#low-level-fusion">Low-level fusion</a></h2>
+<p>With a JIT compiler we can begin fusing smaller loops, to eliminate loads and stores entirely. Of course we'd rather fuse vector loops and not scalar ones. There are a few domains where this is easy, for example arithmetic on floats. But for the most part we quickly run into issues with types:</p>
+<ul>
+<li>Most arithmetic can overflow. How often do you need to check for overflow and what do you do when it fails? How do you combine the results when one iteration overflows and another doesn't?</li>
+<li>Mixed types mean a different number of elements will fit into each register. So, if the calculation initially works on 2-byte ints but then a division switches it to floats, do we do a full 2-byte vector and then 4 copies of the float method, which might spill? Or only fill part of a 2-byte vector?</li>
+<li>When comparisons give you booleans, do you them results together to handle more with one instruction, or leave them at the width of the compared elements?</li>
+</ul>
+<p>The answers depend on which types are used and how much. A wrong answer for one step is not a big deal, but a badly wrong answer, like failing to pack booleans when they make up 10 out of 15 steps, might mean the fused loop would be better off being cut in half.</p>
+<p>Folds and scans should be fusable when they have nice SIMD implementations (but overflow for scans becomes quite a gnarly problem). Folds are particularly valuable because of the small output, meaning an expression ending in a fold might need essentially no writes. Simpler non-arithmetic functions can be compiled profitably, for example consider <code><span class='Function'>⌽»↕</span><span class='Value'>n</span></code> which has no loops but would benefit from a fused implementation (albeit, even more from being converted into arithmetic <code><span class='Number'>0</span><span class='Function'>⌈</span><span class='Paren'>(</span><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>2</span><span class='Paren'>)</span><span class='Function'>-↕</span><span class='Value'>n</span></code>). There are a pretty limited number of these and they look pretty easy to handle, even though shifts and reverse will require crossing vector and lane boundaries.</p>
+<p>Selection and search primitives can be partly fused. The indexed-into argument (values for selection; searched-in values) needs to be known in advance. In some cases the primitive actually vectorizes in the other argument, with shuffle-based methods like in-register lookup tables and binary search. Otherwise it probably has to be evaluated with scalar code, or gather instructions which run on vectors but run as a sequence of loads. But at worst you unpack the input vector into scalars and pack the result back into vectors. You'll still get the normal benefits of fusion and maybe the surrounding actually-SIMD code will run while waiting on memory. For searches that build a table, this step could similarly be fused into the computation of the searched-in argument.</p>