|
7 | 7 | <div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
|
8 | 8 | <h1 id="implementation-of-fold-and-scan"><a class="header" href="#implementation-of-fold-and-scan">Implementation of Fold and Scan</a></h1>
|
9 | 9 | <p>Folds and scans with some arithmetic primitives like <code><span class='Function'>+</span></code>, <code><span class='Function'>⌈</span></code>, and boolean <code><span class='Function'>≠</span></code> are staples of array programming. Fortunately these cases are also suitable for SIMD implementation. There is also the minor note that it's worth optimizing folds with <code><span class='Function'>⊣</span></code> or <code><span class='Function'>⊢</span></code> that give the first and last element (or cell), and the scan <code><span class='Function'>⊣</span><span class='Modifier'>`</span></code> broadcasts the first cell to the entire array, which has some uses like <code><span class='Function'>⊣</span><span class='Modifier'>`</span><span class='Modifier2'>⊸</span><span class='Function'>≡</span></code> to test if all cells match.</p>
|
| 10 | +<p>When implementing SIMD scans, it's crucial to allow processing of a word or vector to begin before the previous one is completed, to benefit from instruction-level parallelism. Generally a good way to do this is to perform a scan just on that unit, and only then fix it up with a carry. For example, for <code><span class='Function'>+</span><span class='Modifier'>`</span></code>, scan within a vector, then broadcast the last element of the previous vector of sums and add it to the scan.</p> |
10 | 11 | <p>My talk "Implementing Reduction" (<a href="https://dyalog.tv/Dyalog19/?v=TqmpSP8Knvg">video</a>, <a href="https://www.dyalog.com/uploads/conference/dyalog19/presentations/D09_Implementing_Reduction.zip">slides</a>) quickly covers some ideas about folding, particularly on high-rank arrays. The slides have illustrations of some extra algorithms not discussed in the talk.</p>
|
11 | 12 | <h2 id="associative-arithmetic"><a class="header" href="#associative-arithmetic">Associative arithmetic</a></h2>
|
12 | 13 | <p>The arithmetic operations <code><span class='Function'>+×</span></code> on integers, and <code><span class='Function'>⌈⌊</span></code> on all types, are associative and commutative (and for <code><span class='Value'>•math.</span><span class='Function'>Sum</span></code>, float addition may be considered commutative for optimization). This allows for folds and scans to be reordered in a way that's suitable for SIMD evaluation, where without some insight into the operand function they would be inherently sequential. Also, <code><span class='Function'>-</span><span class='Modifier'>´</span></code> can be performed by negating every other value then summing, and monadic <code><span class='Function'>¬</span><span class='Modifier'>´</span></code> is <code><span class='Brace'>{</span><span class='Paren'>(</span><span class='Function'>¬</span><span class='Number'>2</span><span class='Function'>|≠</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>+-</span><span class='Modifier'>´</span><span class='Value'>𝕩</span><span class='Brace'>}</span></code>.</p>
|
@@ -49,7 +50,17 @@ <h2 id="booleans"><a class="header" href="#booleans">Booleans</a></h2>
|
49 | 50 | </tr>
|
50 | 51 | </tbody>
|
51 | 52 | </table>
|
52 |
| -<p>Boolean scans are more varied. For <code><span class='Function'>∨</span></code>, the result switches from all <code><span class='Number'>0</span></code> to all <code><span class='Number'>1</span></code> after the first <code><span class='Number'>1</span></code>, and the other way around for <code><span class='Function'>∧</span></code>. For <code><span class='Function'>≠</span></code>, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are <a href="#xor-scan">discussed below</a>. The scan <code><span class='Function'><</span><span class='Modifier'>`</span></code> turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they <a href="https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96">describe in detail</a> a method that uses subtraction for carrying. And <code><span class='Function'>></span><span class='Modifier'>`</span></code> flips to all 0 at the first bit if it's a 0 or the <em>second</em> 1 bit otherwise. <code><span class='Function'>≤</span><span class='Modifier'>`</span></code> is <code><span class='Function'><</span><span class='Modifier'>`</span><span class='Modifier2'>⌾</span><span class='Function'>¬</span></code>, and <code><span class='Function'>≥</span><span class='Modifier'>`</span></code> is <code><span class='Function'>></span><span class='Modifier'>`</span><span class='Modifier2'>⌾</span><span class='Function'>¬</span></code>.</p> |
| 53 | +<p>Boolean scans are more varied. For <code><span class='Function'>∨</span></code>, the result switches from all <code><span class='Number'>0</span></code> to all <code><span class='Number'>1</span></code> after the first <code><span class='Number'>1</span></code>, and the other way around for <code><span class='Function'>∧</span></code>. For <code><span class='Function'>≠</span></code>, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are <a href="#xor-scan">discussed below</a>. The scan <code><span class='Function'><</span><span class='Modifier'>`</span></code> turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they <a href="https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96">describe in detail</a> a method that uses subtraction for carrying. And <code><span class='Function'>></span><span class='Modifier'>`</span></code> flips to all 0 at the first bit if it's a 0 or the <em>second</em> 1 bit otherwise; <code><span class='Function'>∧</span><span class='Modifier'>`</span><span class='Function'>¬</span><span class='Modifier2'>⌾</span><span class='Paren'>(</span><span class='Number'>1</span><span class='Modifier2'>⊸</span><span class='Function'>↓</span><span class='Paren'>)</span></code> is one implementation. <code><span class='Function'>≤</span><span class='Modifier'>`</span></code> is <code><span class='Function'><</span><span class='Modifier'>`</span><span class='Modifier2'>⌾</span><span class='Function'>¬</span></code>, and <code><span class='Function'>≥</span><span class='Modifier'>`</span></code> is <code><span class='Function'>></span><span class='Modifier'>`</span><span class='Modifier2'>⌾</span><span class='Function'>¬</span></code>.</p> |
| 54 | +<p>Simple sequences for a few scans are given below. <code><span class='Function'>∧</span><span class='Modifier'>`</span></code> and <code><span class='Function'>∨</span><span class='Modifier'>`</span></code> on lists only call for one evaluation where the first 0 or 1 bit is found, but also have nice segmented forms that can be used for a row-scan like <code><span class='Function'>∧</span><span class='Modifier'>`˘</span></code>. Here <code><span class='Value'>even</span></code> is the even bits <code><span class='Number'>0x555</span><span class='Value'>…</span></code>, and <code><span class='Value'>odd</span></code> is the odd bits <code><span class='Value'>even</span><span class='Function'><<</span><span class='Number'>1</span></code> or <code><span class='Number'>0xAAA</span><span class='Value'>…</span></code>.</p> |
| 55 | +<table> |
| 56 | +<tr><th>Scan</th><th>C code (word)</th><th>C code (segment starts <code><span class='Value'>m</span></code>)</th></tr> |
| 57 | +<tr><td align="center"><code><span class='Function'>∧</span><span class='Modifier'>`</span></code></td><td><code><span class='Value'>x</span> <span class='Function'>|</span> <span class='Function'>-</span><span class='Value'>x</span></code> </td><td><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Value'>&~</span> <span class='Value'>m</span><span class='Paren'>)</span> <span class='Function'>>></span> <span class='Number'>1</span><span class='Head'>;</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Function'>-</span> <span class='Value'>t</span><span class='Paren'>)</span> <span class='Value'>^</span> <span class='Value'>t</span></code></td></tr> |
| 58 | +<tr><td align="center"><code><span class='Function'>∨</span><span class='Modifier'>`</span></code></td><td><code><span class='Value'>x</span> <span class='Value'>&~</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'>+</span><span class='Number'>1</span><span class='Paren'>)</span></code></td><td><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Function'>|</span> <span class='Value'>m</span><span class='Paren'>)</span> <span class='Function'>>></span> <span class='Number'>1</span><span class='Head'>;</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>)</span> <span class='Value'>^</span> <span class='Value'>t</span></code></td></tr> |
| 59 | +<tr><td align="center"><code><span class='Function'><</span><span class='Modifier'>`</span></code></td><td colspan=2><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Value'>odd</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'><<</span><span class='Number'>1</span><span class='Paren'>)</span><span class='Head'>;</span> <span class='Value'>x</span> <span class='Value'>&</span> <span class='Paren'>(</span><span class='Value'>odd</span> <span class='Value'>^</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>))</span></code></td></tr> |
| 60 | +<tr><td align="center"><code><span class='Function'>≤</span><span class='Modifier'>`</span></code></td><td colspan=2><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Value'>even</span> <span class='Value'>&</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'><<</span><span class='Number'>1</span><span class='Paren'>)</span><span class='Head'>;</span> <span class='Value'>x</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>odd</span> <span class='Value'>^</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>))</span></code></td></tr> |
| 61 | +</table> |
| 62 | + |
| 63 | +<p>Handling carries in lists for <code><span class='Function'><</span><span class='Modifier'>`</span></code> and <code><span class='Function'>≤</span><span class='Modifier'>`</span></code> is possible by modifying <code><span class='Value'>x</span><span class='Function'><<</span><span class='Number'>1</span></code>, but for shorter dependency chains you modify the result. For example, for <code><span class='Function'><</span><span class='Modifier'>`</span></code>, a carry of 1 means all the result bits corresponding to trailing 1s in <code><span class='Value'>x</span></code> need to be flipped. If the result from the previous word is <code><span class='Value'>c</span></code> with type <code><span class='Value'>u64</span></code>, the result should by xor-ed with <code><span class='Function'>-</span><span class='Paren'>(</span><span class='Value'>c</span><span class='Function'>>></span><span class='Number'>63</span><span class='Paren'>)</span> <span class='Value'>&</span> <span class='Value'>x</span> <span class='Value'>&~</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'>+</span><span class='Number'>1</span><span class='Paren'>)</span></code>, where the shared <code><span class='Value'>x</span> <span class='Value'>&</span></code> can be factored out. Note that for <code><span class='Function'>≤</span><span class='Modifier'>`</span></code>, the "passive" bit is 1 and so -1 is the right initial carry.</p> |
53 | 64 | <h3 id="xor-scan"><a class="header" href="#xor-scan">Xor scan</a></h3>
|
54 | 65 | <p>The scan <code><span class='Function'>≠</span><span class='Modifier'>`</span></code> has the ordinary implementation using power-of-two shifts, covered in Hacker's Delight section 5-2, "Parity". Broadcast the carry to the entire word with a signed shift and xor into the next word after scanning it.</p>
|
55 | 66 | <p>If available, carry-less multiply (clmul) can also be used to scan a word, by multiplying by the all-1s word, a trick explained <a href="https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq">here</a>. The 128-bit result has an inclusive scan in the low 64 bits and a reverse exclusive scan in the high 64 bits (the top bit is always 0). This is useful because xor-ing high with low gives a word of all carry bits. And the clmul method also works for high-rank <code><span class='Function'>≠</span><span class='Modifier'>`</span></code> if the row length <code><span class='Value'>l</span></code> is a divisor of 64, by choosing a mask where every <code><span class='Value'>l</span></code>-th bit is set. Then the high-low trick is much more important because shifting doesn't give a valid carry! For strides of 8 or more, this method might not be faster than AVX2 using element-level operations, but hey, it's free.</p>
|
|
0 commit comments