Introduction to page on loop fusion

mlochbaum · mlochbaum · commit 3cf0e051532b · 2025-01-27T10:35:50.000-05:00
diff --git a/docs/implementation/compile/fusion.html b/docs/implementation/compile/fusion.html
@@ -0,0 +1,12 @@
+<head>
+  <meta charset="utf-8">
+  <link href="../../favicon.ico" rel="shortcut icon" type="image/x-icon"/>
+  <link href="../../style.css" rel="stylesheet"/>
+  <title>BQN: Loop fusion in array languages</title>
+</head>
+<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">compile</a></div>
+<h1 id="loop-fusion-in-array-languages"><a class="header" href="#loop-fusion-in-array-languages">Loop fusion in array languages</a></h1>
+<p>Interpreted array languages have a major problem. Let's say you evaluate some arithmetic on a few arrays. Perhaps the first operation adds two arrays. It will loop over them, ideally adding numbers a vector register at a time, and write the results to an array. Maybe next it will check if the result is more than 10. So it'll read vectors from the result, compare to 10, pack to bit booleans, and write to another array. Each primitive has been implemented well but the combination is already far from optimal! The first result array isn't needed: it would be much better to compare each added vector to 10 right when it's produced. The extra store and load (and index arithmetic) are instructions that we don't need, but by using extra memory we can also create cache pressure that slows down the program even more.</p>
+<p>Scalar languages don't have this problem! The programmer just writes the addition and comparison in a loop, the compiler compiles it, and every comparison naturally follows the corresponding addition. More modern languages might prefer approaches like iterators that abstract away the looping but still have the semantics of a fused loop. But an iterator call, let's say <code><span class='Value'>zipwith</span><span class='Paren'>(</span><span class='Function'>+</span><span class='Separator'>,</span> <span class='Value'>a.iter</span><span class='Paren'>()</span><span class='Separator'>,</span> <span class='Value'>b.iter</span><span class='Paren'>())</span><span class='Value'>.map</span><span class='Paren'>(</span><span class='Function'>&gt;</span><span class='Number'>10</span><span class='Paren'>)</span></code> to make up some syntax, has a pretty obvious array equivalent, and if the functions are pure the different semantics don't matter! This has led to several compiled array languages like <a href="https://www.snakeisland.com/apexup.htm">APEX</a> that work on the principle of re-interpreting the scalar parts of array operations in a way that fuses naturally.</p>
+<p>Scalar compilation gives up many advantages inherent to array programming, a topic I discussed more broadly <a href="intro.html">here</a>. The obvious complaint is that you lose the vector instructions, but that's easy enough to dismiss. Any decent C compiler can auto-vectorize a loop, and so could an array compiler. But arithmetic is rarely the bottleneck, so let's say that the comparison's result will be used to filter a third array, that is, the expression is now <code><span class='Paren'>(</span><span class='Number'>10</span><span class='Function'>&gt;</span><span class='Value'>a</span><span class='Function'>+</span><span class='Value'>b</span><span class='Paren'>)</span><span class='Function'>/</span><span class='Value'>c</span></code>. Filtering doesn't auto-vectorize! Two vectors of input will produce an output with a different, unknown size, which is enough to throw off the analysis. At least the C compilers I've dealt with will fall back to producing completely scalar code. Depending on type, this can actually be slower than CBQN's un-fused, but SIMD, primitives.</p>
+<p>This example doesn't entirely reveal the extent of the problem (for one thing, writing filter's result a partial vector at a time isn't bad, the real difficulty would be fusing it with more arithmetic later on). But hopefully it gives a sense of the issues that arise. I believe fusing operations without losing CBQN's powerful single-primitive operations will require a system that considers not just the possibility of fusing at the scalar level but several layers, from a single vector up to larger blocks of memory.</p>
diff --git a/implementation/compile/fusion.md b/implementation/compile/fusion.md
@@ -0,0 +1,11 @@
+*View this file with results and syntax highlighting [here](https://mlochbaum.github.io/BQN/implementation/compile/fusion.html).*
+
+# Loop fusion in array languages
+
+Interpreted array languages have a major problem. Let's say you evaluate some arithmetic on a few arrays. Perhaps the first operation adds two arrays. It will loop over them, ideally adding numbers a vector register at a time, and write the results to an array. Maybe next it will check if the result is more than 10. So it'll read vectors from the result, compare to 10, pack to bit booleans, and write to another array. Each primitive has been implemented well but the combination is already far from optimal! The first result array isn't needed: it would be much better to compare each added vector to 10 right when it's produced. The extra store and load (and index arithmetic) are instructions that we don't need, but by using extra memory we can also create cache pressure that slows down the program even more.
+
+Scalar languages don't have this problem! The programmer just writes the addition and comparison in a loop, the compiler compiles it, and every comparison naturally follows the corresponding addition. More modern languages might prefer approaches like iterators that abstract away the looping but still have the semantics of a fused loop. But an iterator call, let's say `zipwith(+, a.iter(), b.iter()).map(>10)` to make up some syntax, has a pretty obvious array equivalent, and if the functions are pure the different semantics don't matter! This has led to several compiled array languages like [APEX](https://www.snakeisland.com/apexup.htm) that work on the principle of re-interpreting the scalar parts of array operations in a way that fuses naturally.
+
+Scalar compilation gives up many advantages inherent to array programming, a topic I discussed more broadly [here](intro.md). The obvious complaint is that you lose the vector instructions, but that's easy enough to dismiss. Any decent C compiler can auto-vectorize a loop, and so could an array compiler. But arithmetic is rarely the bottleneck, so let's say that the comparison's result will be used to filter a third array, that is, the expression is now `(10>a+b)/c`. Filtering doesn't auto-vectorize! Two vectors of input will produce an output with a different, unknown size, which is enough to throw off the analysis. At least the C compilers I've dealt with will fall back to producing completely scalar code. Depending on type, this can actually be slower than CBQN's un-fused, but SIMD, primitives.
+
+This example doesn't entirely reveal the extent of the problem (for one thing, writing filter's result a partial vector at a time isn't bad, the real difficulty would be fusing it with more arithmetic later on). But hopefully it gives a sense of the issues that arise. I believe fusing operations without losing CBQN's powerful single-primitive operations will require a system that considers not just the possibility of fusing at the scalar level but several layers, from a single vector up to larger blocks of memory.