Links to loop fusion page

mlochbaum · mlochbaum · commit c13c9a194d97 · 2025-01-27T21:54:43.000-05:00
diff --git a/docs/implementation/compile/fusion.html b/docs/implementation/compile/fusion.html
@@ -30,8 +30,8 @@ <h2 id="low-level-fusion"><a class="header" href="#low-level-fusion">Low-level f
 <ul>
 <li>Most arithmetic can overflow. How often do you need to check for overflow and what do you do when it fails? How do you combine the results when one iteration overflows and another doesn't?</li>
 <li>Mixed types mean a different number of elements will fit into each register. So, if the calculation initially works on 2-byte ints but then a division switches it to floats, do we do a full 2-byte vector and then 4 copies of the float method, which might spill? Or only fill part of a 2-byte vector?</li>
-<li>When comparisons give you booleans, do you them results together to handle more with one instruction, or leave them at the width of the compared elements?</li>
+<li>When comparisons give you booleans, do you pack results together to handle more with one instruction, or leave them at the width of the compared elements?</li>
 </ul>
 <p>The answers depend on which types are used and how much. A wrong answer for one step is not a big deal, but a badly wrong answer, like failing to pack booleans when they make up 10 out of 15 steps, might mean the fused loop would be better off being cut in half.</p>
 <p>Folds and scans should be fusable when they have nice SIMD implementations (but overflow for scans becomes quite a gnarly problem). Folds are particularly valuable because of the small output, meaning an expression ending in a fold might need essentially no writes. Simpler non-arithmetic functions can be compiled profitably, for example consider <code><span class='Function'>⌽»↕</span><span class='Value'>n</span></code> which has no loops but would benefit from a fused implementation (albeit, even more from being converted into arithmetic <code><span class='Number'>0</span><span class='Function'>⌈</span><span class='Paren'>(</span><span class='Value'>n</span><span class='Function'>-</span><span class='Number'>2</span><span class='Paren'>)</span><span class='Function'>-↕</span><span class='Value'>n</span></code>). There are a pretty limited number of these and they look pretty easy to handle, even though shifts and reverse will require crossing vector and lane boundaries.</p>
-<p>Selection and search primitives can be partly fused. The indexed-into argument (values for selection; searched-in values) needs to be known in advance. In some cases the primitive actually vectorizes in the other argument, with shuffle-based methods like in-register lookup tables and binary search. Otherwise it probably has to be evaluated with scalar code, or gather instructions which run on vectors but run as a sequence of loads. But at worst you unpack the input vector into scalars and pack the result back into vectors. You'll still get the normal benefits of fusion and maybe the surrounding actually-SIMD code will run while waiting on memory. For searches that build a table, this step could similarly be fused into the computation of the searched-in argument.</p>
+<p>Selection and search primitives can be partly fused. The indexed-into argument (values for selection; searched-in values) needs to be known in advance. In some cases the primitive actually vectorizes in the other argument, with shuffle-based methods like in-register lookup tables and binary search. Otherwise it probably has to be evaluated with scalar code, or gather instructions which run on vectors but run as a sequence of loads. But at worst you unpack the input vector into scalars and pack the result back into vectors. You'll still get the normal benefits of fusion and maybe the surrounding actually-SIMD code will run while waiting on memory. For searches that build a table, this step could similarly be fused into the computation of the searched-in argument. Furthermore there are some possible ideas with sorted arguments: both sides can be fused in a selection where the indices are known to be sorted, or a search where both arguments are sorted.</p>
diff --git a/docs/implementation/compile/index.html b/docs/implementation/compile/index.html
@@ -10,6 +10,7 @@ <h1 id="optimizing-compilation-notes"><a class="header" href="#optimizing-compil
 <ul>
 <li><a href="intro.html">Array language compilation in context</a>, an introduction to the subject</li>
 <li><a href="dynamic.html">Dynamic compilation</a>, discussing high-level strategies</li>
+<li><a href="fusion.html">Loop fusion in array languages</a></li>
 </ul>
 <p>The self-hosted bytecode compiler isn't really documented, but there are some related resources elsewhere. I held a few early chat discussions on building an array-based compiler, but aborted these because the interactive format wasn't doing too much.</p>
 <ul>
diff --git a/docs/implementation/compile/intro.html b/docs/implementation/compile/intro.html
@@ -25,7 +25,7 @@ <h3 id="and-which-is-better"><a class="header" href="#and-which-is-better">And w
 <p>Because static typing lets the programmer say exactly what type the data should have, it ought to be faster, right? Well that's a bold hypothesis, C Programming Language, given that you spent the entire 90s working to convince programmers that it's better to let the compiler choose which assembly instructions are run and in what order.</p>
 <p>Array type detection can be both fast and accurate. First off, scanning a BQN array to get the appropriate type is very quick to do with vector instructions, particularly if it's an integer array. But also, this isn't usually necessary, because most primitives have an obvious result type. Structural manipulation leaves the type unchanged, and arithmetic can be done with built-in overflow checking, adding about 10% overhead to the operation in a typical case.</p>
 <p>And on the other side, it's hard to pick a static type that always works. Most times your forum software is run, each user will have under <code><span class='Number'>2</span><span class='Function'>⋆</span><span class='Number'>15</span></code> posts. But not always, and using a short integer for the post count wouldn't be safe. With dynamically typed arrays, your program has the performance of a fast type when possible but scales to a larger one when necessary. So it can be both faster and more reliable.</p>
-<p>However, with current array implementation technology, these advantages only apply to array-style code, when you're calling primitives that each do a lot of work. In order to get many small primitive calls working quickly, you need a compiler. Compilers have another significant benefit in the form of <a href="https://en.wikipedia.org/wiki/Loop_fusion"><strong>loop fusion</strong></a>, allowing multiple primitives to be executed in one round without writing intermediate results. This is most important for arithmetic on arrays, where the cost of doing the actual operation is much less than the cost of reading and writing the results if evaluated using SIMD instructions. I think the benefits of fusion are often overstated because it's rare for such simple operations to make up a large portion of program runtime, but there's no doubt that it can provide some benefit for most array-oriented programs and a large benefit for some of them.</p>
+<p>However, with current array implementation technology, these advantages only apply to array-style code, when you're calling primitives that each do a lot of work. In order to get many small primitive calls working quickly, you need a compiler. Compilers have another significant benefit in the form of <a href="https://en.wikipedia.org/wiki/Loop_fusion"><strong>loop fusion</strong></a>, allowing multiple primitives to be executed in one round without writing intermediate results. This is most important for arithmetic on arrays, where the cost of doing the actual operation is much less than the cost of reading and writing the results if evaluated using SIMD instructions. I think the benefits of fusion are often overstated because it's rare for such simple operations to make up a large portion of program runtime, but there's no doubt that it can provide some benefit for most array-oriented programs and a large benefit for some of them. <a href="fusion.html">This page</a> has more about the topic.</p>
 <h2 id="optimizing-array-primitives"><a class="header" href="#optimizing-array-primitives">Optimizing array primitives</a></h2>
 <p>There is a bit more to say while we're still in interpreter-land (the title says &quot;compilation in context&quot;, but I'm sorry to inform you that this page is heavy on &quot;context&quot;, but not so hot on &quot;compilation&quot;, and frankly lukewarm on &quot;in&quot;). The function <code><span class='Function'>+</span><span class='Modifier'>´</span></code> isn't a primitive, it's two!</p>
 <p>The way that <code><span class='Function'>+</span><span class='Modifier'>´</span></code> is evaluated using specialized code is that <code><span class='Modifier'>´</span></code>, when invoked, checks whether its operand is one of a few known cases (at the time of writing, <code><span class='Function'>+×⌊⌈∧∨</span></code>). If so, it checks the type and applies special code accordingly. Arguably, this is a very rudimentary form of just-in-time compilation for the function <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, as it takes a program that would apply <code><span class='Function'>+</span></code> many times and transforms it to special pre-compiled code that doesn't call the <code><span class='Function'>+</span></code> function. However, it's pretty different from what a compiled language would do in that this function is never associated with the object representing <code><span class='Function'>+</span><span class='Modifier'>´</span></code>, so that <code><span class='Function'>+</span><span class='Modifier'>´</span></code> is re-compiled each time it's run.</p>
diff --git a/docs/implementation/versusc.html b/docs/implementation/versusc.html
@@ -172,7 +172,7 @@ <h3 id="high-level-versus-low-level"><a class="header" href="#high-level-versus-
 <p>Unlike BQN, C doesn't just fill in the details of what you told it to do. Auto-vectorization is an attempt to build some high-level understanding and use it to change over to a different low-level implementation. This is much harder than just implementing primitives and I think that's the main reason you won't see a C compiler do something like transposing a matrix with a SIMD kernel. C also has limitations on how it can rearrange memory accesses. A common one is that it can't read extra memory because this might segfault, so if you write a scalar search to find the first 0 in an array it's actually not legal to rewrite this to a vector search that might read past that 0.</p>
 <p>On the topic of memory, it's a very simple structure—the whole world, just a sequence of bytes!—but it's also mutinous I mean mutable. If you call an unknown function in C, it could write anywhere, so the compiler no longer knows the value of any part of memory. If you write to an unknown pointer, and pointers are hard to know mind you, it could change any part of memory too. This leads to a whole category of optimization problems known as <a href="https://en.wikipedia.org/wiki/Pointer_aliasing#Conflicts_with_optimization">pointer aliasing</a>, where something as simple as adding one to a bunch of values with a source and destination pointer can't be vectorized unless the pointers are known to not overlap.</p>
 <h4 id="fusion-versus-fission"><a class="header" href="#fusion-versus-fission">Fusion versus fission</a></h4>
-<p>I view getting the balance between <a href="https://en.wikipedia.org/wiki/Loop_fission_and_fusion">loop fusion and fission</a> right as a sort of holy grail of array programming. I so wish I could say &quot;we've already got one!&quot;. Nope, as it stands, C chooses fusion and BQN chooses fission. That is, a C programmer usually writes one loop with lots of stuff in it, but each BQN primitive is like a loop, making a BQN program a series of loops. But the best approaches usually have more complicated shapes. Some loops can be fused at the level of a vector instruction, and this is where C auto-vectorization works great and BQN is worst with lots of extra loads and stores. Loops involving filtering or other data movement might not be tightly fusable; auto-vectorization gives up and CBQN looks great in comparison. But it's still missing out on any instruction-level fusion that <em>can</em> be done (<code><span class='Value'>a</span><span class='Function'>/</span><span class='Value'>b</span><span class='Function'>+</span><span class='Value'>c</span><span class='Function'>×</span><span class='Value'>d</span></code> won't fuse <code><span class='Value'>b</span><span class='Function'>+</span><span class='Value'>c</span><span class='Function'>×</span><span class='Value'>d</span></code>), and if the arrays are large it's missing out on looser-grained fusion that would make better use of caches. It's a complicated topic; I should probably write a separate page on it.</p>
+<p>I view getting the balance between <a href="https://en.wikipedia.org/wiki/Loop_fission_and_fusion">loop fusion and fission</a> right as a sort of holy grail of array programming. I so wish I could say &quot;we've already got one!&quot;. Nope, as it stands, C chooses fusion and BQN chooses fission. That is, a C programmer usually writes one loop with lots of stuff in it, but each BQN primitive is like a loop, making a BQN program a series of loops. But the best approaches usually have more complicated shapes. Some loops can be fused at the level of a vector instruction, and this is where C auto-vectorization works great and BQN is worst with lots of extra loads and stores. Loops involving filtering or other data movement might not be tightly fusable; auto-vectorization gives up and CBQN looks great in comparison. But it's still missing out on any instruction-level fusion that <em>can</em> be done (<code><span class='Value'>a</span><span class='Function'>/</span><span class='Value'>b</span><span class='Function'>+</span><span class='Value'>c</span><span class='Function'>×</span><span class='Value'>d</span></code> won't fuse <code><span class='Value'>b</span><span class='Function'>+</span><span class='Value'>c</span><span class='Function'>×</span><span class='Value'>d</span></code>), and if the arrays are large it's missing out on looser-grained fusion that would make better use of caches. I've written more about the problem and approaches to it that BQN might take on <a href="compile/fusion.html">another page</a>.</p>
 <h3 id="dynamic-versus-static"><a class="header" href="#dynamic-versus-static">Dynamic versus static</a></h3>
 <p>A C compiler decides what it's going to do at compile time, before it's even caught a whiff of the data that'll be processed (all right, profile-guided optimization is a decent sniff in that direction, but no touching). CBQN decides what to do again every time a primitive is called. This has some overhead, but it also means these calls can adapt to conditions as they change.</p>
 <p>An example is selection, <code><span class='Function'>⊏</span></code>. If you select from any old array of 1-byte values, it'll pick one element at a time (okay, call a gather instruction that then loads one at a time) which I measure at 0.2ns per selection. If you select from a <em>small</em> array, say 32 values or less, CBQN will load them into vector registers and do the selection with shuffle instructions, 0.04ns per selection. That includes a range check, that C is supposedly speeding your code up by ignoring! By having the high-level information of a known right argument range, and checking it dynamically, BQN goes much faster in certain cases.</p>
diff --git a/implementation/compile/README.md b/implementation/compile/README.md
@@ -6,6 +6,7 @@ Pages in this directory discuss advanced compilation strategies for BQN, that is
 
 - [Array language compilation in context](intro.md), an introduction to the subject
 - [Dynamic compilation](dynamic.md), discussing high-level strategies
+- [Loop fusion in array languages](fusion.md)
 
 The self-hosted bytecode compiler isn't really documented, but there are some related resources elsewhere. I held a few early chat discussions on building an array-based compiler, but aborted these because the interactive format wasn't doing too much.
 
diff --git a/implementation/compile/fusion.md b/implementation/compile/fusion.md
@@ -36,10 +36,10 @@ Blocking has a natural benefit for adaptive algorithms, which is that pattern ch
 With a JIT compiler we can begin fusing smaller loops, to eliminate loads and stores entirely. Of course we'd rather fuse vector loops and not scalar ones. There are a few domains where this is easy, for example arithmetic on floats. But for the most part we quickly run into issues with types:
 - Most arithmetic can overflow. How often do you need to check for overflow and what do you do when it fails? How do you combine the results when one iteration overflows and another doesn't?
 - Mixed types mean a different number of elements will fit into each register. So, if the calculation initially works on 2-byte ints but then a division switches it to floats, do we do a full 2-byte vector and then 4 copies of the float method, which might spill? Or only fill part of a 2-byte vector?
-- When comparisons give you booleans, do you them results together to handle more with one instruction, or leave them at the width of the compared elements?
+- When comparisons give you booleans, do you pack results together to handle more with one instruction, or leave them at the width of the compared elements?
 
 The answers depend on which types are used and how much. A wrong answer for one step is not a big deal, but a badly wrong answer, like failing to pack booleans when they make up 10 out of 15 steps, might mean the fused loop would be better off being cut in half.
 
 Folds and scans should be fusable when they have nice SIMD implementations (but overflow for scans becomes quite a gnarly problem). Folds are particularly valuable because of the small output, meaning an expression ending in a fold might need essentially no writes. Simpler non-arithmetic functions can be compiled profitably, for example consider `⌽»↕n` which has no loops but would benefit from a fused implementation (albeit, even more from being converted into arithmetic `0⌈(n-2)-↕n`). There are a pretty limited number of these and they look pretty easy to handle, even though shifts and reverse will require crossing vector and lane boundaries.
 
-Selection and search primitives can be partly fused. The indexed-into argument (values for selection; searched-in values) needs to be known in advance. In some cases the primitive actually vectorizes in the other argument, with shuffle-based methods like in-register lookup tables and binary search. Otherwise it probably has to be evaluated with scalar code, or gather instructions which run on vectors but run as a sequence of loads. But at worst you unpack the input vector into scalars and pack the result back into vectors. You'll still get the normal benefits of fusion and maybe the surrounding actually-SIMD code will run while waiting on memory. For searches that build a table, this step could similarly be fused into the computation of the searched-in argument.
+Selection and search primitives can be partly fused. The indexed-into argument (values for selection; searched-in values) needs to be known in advance. In some cases the primitive actually vectorizes in the other argument, with shuffle-based methods like in-register lookup tables and binary search. Otherwise it probably has to be evaluated with scalar code, or gather instructions which run on vectors but run as a sequence of loads. But at worst you unpack the input vector into scalars and pack the result back into vectors. You'll still get the normal benefits of fusion and maybe the surrounding actually-SIMD code will run while waiting on memory. For searches that build a table, this step could similarly be fused into the computation of the searched-in argument. Furthermore there are some possible ideas with sorted arguments: both sides can be fused in a selection where the indices are known to be sorted, or a search where both arguments are sorted.
diff --git a/implementation/compile/intro.md b/implementation/compile/intro.md
diff --git a/implementation/versusc.md b/implementation/versusc.md