Skip to content

Commit 8d259dd

Browse files
committed
SWAR code for boolean scans ∧∨<≤
1 parent 83b2a30 commit 8d259dd

File tree

2 files changed

+27
-2
lines changed

2 files changed

+27
-2
lines changed

docs/implementation/primitive/fold.html

+12-1
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
<div class="nav">(<a href="https://github.com/mlochbaum/BQN">github</a>) / <a href="../../index.html">BQN</a> / <a href="../index.html">implementation</a> / <a href="index.html">primitive</a></div>
88
<h1 id="implementation-of-fold-and-scan"><a class="header" href="#implementation-of-fold-and-scan">Implementation of Fold and Scan</a></h1>
99
<p>Folds and scans with some arithmetic primitives like <code><span class='Function'>+</span></code>, <code><span class='Function'></span></code>, and boolean <code><span class='Function'></span></code> are staples of array programming. Fortunately these cases are also suitable for SIMD implementation. There is also the minor note that it's worth optimizing folds with <code><span class='Function'></span></code> or <code><span class='Function'></span></code> that give the first and last element (or cell), and the scan <code><span class='Function'></span><span class='Modifier'>`</span></code> broadcasts the first cell to the entire array, which has some uses like <code><span class='Function'></span><span class='Modifier'>`</span><span class='Modifier2'></span><span class='Function'></span></code> to test if all cells match.</p>
10+
<p>When implementing SIMD scans, it's crucial to allow processing of a word or vector to begin before the previous one is completed, to benefit from instruction-level parallelism. Generally a good way to do this is to perform a scan just on that unit, and only then fix it up with a carry. For example, for <code><span class='Function'>+</span><span class='Modifier'>`</span></code>, scan within a vector, then broadcast the last element of the previous vector of sums and add it to the scan.</p>
1011
<p>My talk &quot;Implementing Reduction&quot; (<a href="https://dyalog.tv/Dyalog19/?v=TqmpSP8Knvg">video</a>, <a href="https://www.dyalog.com/uploads/conference/dyalog19/presentations/D09_Implementing_Reduction.zip">slides</a>) quickly covers some ideas about folding, particularly on high-rank arrays. The slides have illustrations of some extra algorithms not discussed in the talk.</p>
1112
<h2 id="associative-arithmetic"><a class="header" href="#associative-arithmetic">Associative arithmetic</a></h2>
1213
<p>The arithmetic operations <code><span class='Function'></span></code> on integers, and <code><span class='Function'>⌈⌊</span></code> on all types, are associative and commutative (and for <code><span class='Value'>•math.</span><span class='Function'>Sum</span></code>, float addition may be considered commutative for optimization). This allows for folds and scans to be reordered in a way that's suitable for SIMD evaluation, where without some insight into the operand function they would be inherently sequential. Also, <code><span class='Function'>-</span><span class='Modifier'>´</span></code> can be performed by negating every other value then summing, and monadic <code><span class='Function'>¬</span><span class='Modifier'>´</span></code> is <code><span class='Brace'>{</span><span class='Paren'>(</span><span class='Function'>¬</span><span class='Number'>2</span><span class='Function'>|≠</span><span class='Value'>𝕩</span><span class='Paren'>)</span><span class='Function'>+-</span><span class='Modifier'>´</span><span class='Value'>𝕩</span><span class='Brace'>}</span></code>.</p>
@@ -49,7 +50,17 @@ <h2 id="booleans"><a class="header" href="#booleans">Booleans</a></h2>
4950
</tr>
5051
</tbody>
5152
</table>
52-
<p>Boolean scans are more varied. For <code><span class='Function'></span></code>, the result switches from all <code><span class='Number'>0</span></code> to all <code><span class='Number'>1</span></code> after the first <code><span class='Number'>1</span></code>, and the other way around for <code><span class='Function'></span></code>. For <code><span class='Function'></span></code>, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are <a href="#xor-scan">discussed below</a>. The scan <code><span class='Function'>&lt;</span><span class='Modifier'>`</span></code> turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they <a href="https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96">describe in detail</a> a method that uses subtraction for carrying. And <code><span class='Function'>&gt;</span><span class='Modifier'>`</span></code> flips to all 0 at the first bit if it's a 0 or the <em>second</em> 1 bit otherwise. <code><span class='Function'></span><span class='Modifier'>`</span></code> is <code><span class='Function'>&lt;</span><span class='Modifier'>`</span><span class='Modifier2'></span><span class='Function'>¬</span></code>, and <code><span class='Function'></span><span class='Modifier'>`</span></code> is <code><span class='Function'>&gt;</span><span class='Modifier'>`</span><span class='Modifier2'></span><span class='Function'>¬</span></code>.</p>
53+
<p>Boolean scans are more varied. For <code><span class='Function'></span></code>, the result switches from all <code><span class='Number'>0</span></code> to all <code><span class='Number'>1</span></code> after the first <code><span class='Number'>1</span></code>, and the other way around for <code><span class='Function'></span></code>. For <code><span class='Function'></span></code>, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are <a href="#xor-scan">discussed below</a>. The scan <code><span class='Function'>&lt;</span><span class='Modifier'>`</span></code> turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they <a href="https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96">describe in detail</a> a method that uses subtraction for carrying. And <code><span class='Function'>&gt;</span><span class='Modifier'>`</span></code> flips to all 0 at the first bit if it's a 0 or the <em>second</em> 1 bit otherwise; <code><span class='Function'></span><span class='Modifier'>`</span><span class='Function'>¬</span><span class='Modifier2'></span><span class='Paren'>(</span><span class='Number'>1</span><span class='Modifier2'></span><span class='Function'></span><span class='Paren'>)</span></code> is one implementation. <code><span class='Function'></span><span class='Modifier'>`</span></code> is <code><span class='Function'>&lt;</span><span class='Modifier'>`</span><span class='Modifier2'></span><span class='Function'>¬</span></code>, and <code><span class='Function'></span><span class='Modifier'>`</span></code> is <code><span class='Function'>&gt;</span><span class='Modifier'>`</span><span class='Modifier2'></span><span class='Function'>¬</span></code>.</p>
54+
<p>Simple sequences for a few scans are given below. <code><span class='Function'></span><span class='Modifier'>`</span></code> and <code><span class='Function'></span><span class='Modifier'>`</span></code> on lists only call for one evaluation where the first 0 or 1 bit is found, but also have nice segmented forms that can be used for a row-scan like <code><span class='Function'></span><span class='Modifier'></span></code>. Here <code><span class='Value'>even</span></code> is the even bits <code><span class='Number'>0x555</span><span class='Value'></span></code>, and <code><span class='Value'>odd</span></code> is the odd bits <code><span class='Value'>even</span><span class='Function'>&lt;&lt;</span><span class='Number'>1</span></code> or <code><span class='Number'>0xAAA</span><span class='Value'></span></code>.</p>
55+
<table>
56+
<tr><th>Scan</th><th>C code (word)</th><th>C code (segment starts <code><span class='Value'>m</span></code>)</th></tr>
57+
<tr><td align="center"><code><span class='Function'></span><span class='Modifier'>`</span></code></td><td><code><span class='Value'>x</span> <span class='Function'>|</span> <span class='Function'>-</span><span class='Value'>x</span></code> </td><td><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Value'>&~</span> <span class='Value'>m</span><span class='Paren'>)</span> <span class='Function'>>></span> <span class='Number'>1</span><span class='Head'>;</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Function'>-</span> <span class='Value'>t</span><span class='Paren'>)</span> <span class='Value'>^</span> <span class='Value'>t</span></code></td></tr>
58+
<tr><td align="center"><code><span class='Function'></span><span class='Modifier'>`</span></code></td><td><code><span class='Value'>x</span> <span class='Value'>&~</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'>+</span><span class='Number'>1</span><span class='Paren'>)</span></code></td><td><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Paren'>(</span><span class='Value'>x</span> <span class='Function'>|</span> <span class='Value'>m</span><span class='Paren'>)</span> <span class='Function'>>></span> <span class='Number'>1</span><span class='Head'>;</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>)</span> <span class='Value'>^</span> <span class='Value'>t</span></code></td></tr>
59+
<tr><td align="center"><code><span class='Function'><</span><span class='Modifier'>`</span></code></td><td colspan=2><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Value'>odd</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'><<</span><span class='Number'>1</span><span class='Paren'>)</span><span class='Head'>;</span> <span class='Value'>x</span> <span class='Value'>&</span> <span class='Paren'>(</span><span class='Value'>odd</span> <span class='Value'>^</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>))</span></code></td></tr>
60+
<tr><td align="center"><code><span class='Function'></span><span class='Modifier'>`</span></code></td><td colspan=2><code><span class='Value'>t</span> <span class='Function'>=</span> <span class='Value'>even</span> <span class='Value'>&</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'><<</span><span class='Number'>1</span><span class='Paren'>)</span><span class='Head'>;</span> <span class='Value'>x</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>odd</span> <span class='Value'>^</span> <span class='Paren'>(</span><span class='Value'>t</span> <span class='Function'>-</span> <span class='Value'>x</span><span class='Paren'>))</span></code></td></tr>
61+
</table>
62+
63+
<p>Handling carries in lists for <code><span class='Function'>&lt;</span><span class='Modifier'>`</span></code> and <code><span class='Function'></span><span class='Modifier'>`</span></code> is possible by modifying <code><span class='Value'>x</span><span class='Function'>&lt;&lt;</span><span class='Number'>1</span></code>, but for shorter dependency chains you modify the result. For example, for <code><span class='Function'>&lt;</span><span class='Modifier'>`</span></code>, a carry of 1 means all the result bits corresponding to trailing 1s in <code><span class='Value'>x</span></code> need to be flipped. If the result from the previous word is <code><span class='Value'>c</span></code> with type <code><span class='Value'>u64</span></code>, the result should by xor-ed with <code><span class='Function'>-</span><span class='Paren'>(</span><span class='Value'>c</span><span class='Function'>&gt;&gt;</span><span class='Number'>63</span><span class='Paren'>)</span> <span class='Value'>&amp;</span> <span class='Value'>x</span> <span class='Value'>&amp;~</span> <span class='Paren'>(</span><span class='Value'>x</span><span class='Function'>+</span><span class='Number'>1</span><span class='Paren'>)</span></code>, where the shared <code><span class='Value'>x</span> <span class='Value'>&amp;</span></code> can be factored out. Note that for <code><span class='Function'></span><span class='Modifier'>`</span></code>, the &quot;passive&quot; bit is 1 and so -1 is the right initial carry.</p>
5364
<h3 id="xor-scan"><a class="header" href="#xor-scan">Xor scan</a></h3>
5465
<p>The scan <code><span class='Function'></span><span class='Modifier'>`</span></code> has the ordinary implementation using power-of-two shifts, covered in Hacker's Delight section 5-2, &quot;Parity&quot;. Broadcast the carry to the entire word with a signed shift and xor into the next word after scanning it.</p>
5566
<p>If available, carry-less multiply (clmul) can also be used to scan a word, by multiplying by the all-1s word, a trick explained <a href="https://branchfree.org/2019/03/06/code-fragment-finding-quote-pairs-with-carry-less-multiply-pclmulqdq">here</a>. The 128-bit result has an inclusive scan in the low 64 bits and a reverse exclusive scan in the high 64 bits (the top bit is always 0). This is useful because xor-ing high with low gives a word of all carry bits. And the clmul method also works for high-rank <code><span class='Function'></span><span class='Modifier'>`</span></code> if the row length <code><span class='Value'>l</span></code> is a divisor of 64, by choosing a mask where every <code><span class='Value'>l</span></code>-th bit is set. Then the high-low trick is much more important because shifting doesn't give a valid carry! For strides of 8 or more, this method might not be faster than AVX2 using element-level operations, but hey, it's free.</p>

implementation/primitive/fold.md

+15-1
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@
44

55
Folds and scans with some arithmetic primitives like `+`, ``, and boolean `` are staples of array programming. Fortunately these cases are also suitable for SIMD implementation. There is also the minor note that it's worth optimizing folds with `` or `` that give the first and last element (or cell), and the scan `` ⊣` `` broadcasts the first cell to the entire array, which has some uses like ``⊣`⊸≡`` to test if all cells match.
66

7+
When implementing SIMD scans, it's crucial to allow processing of a word or vector to begin before the previous one is completed, to benefit from instruction-level parallelism. Generally a good way to do this is to perform a scan just on that unit, and only then fix it up with a carry. For example, for `` +` ``, scan within a vector, then broadcast the last element of the previous vector of sums and add it to the scan.
8+
79
My talk "Implementing Reduction" ([video](https://dyalog.tv/Dyalog19/?v=TqmpSP8Knvg), [slides](https://www.dyalog.com/uploads/conference/dyalog19/presentations/D09_Implementing_Reduction.zip)) quickly covers some ideas about folding, particularly on high-rank arrays. The slides have illustrations of some extra algorithms not discussed in the talk.
810

911
## Associative arithmetic
@@ -29,7 +31,19 @@ Other folds `∧∨<>≤≥` can be shortcut: they depend only on the first inst
2931
| `` | `2\|𝕩⊐0`
3032
| `≥´` | `¬2\|𝕩⊐1`
3133

32-
Boolean scans are more varied. For ``, the result switches from all `0` to all `1` after the first `1`, and the other way around for ``. For ``, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are [discussed below](#xor-scan). The scan `` <` `` turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they [describe in detail](https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96) a method that uses subtraction for carrying. And `` >` `` flips to all 0 at the first bit if it's a 0 or the *second* 1 bit otherwise. `` ≤` `` is ``<`⌾¬``, and `` ≥` `` is ``>`⌾¬``.
34+
Boolean scans are more varied. For ``, the result switches from all `0` to all `1` after the first `1`, and the other way around for ``. For ``, the associative optimization gives a word-at-a-time algorithm with power-of-two shifts, and other possibilities with architecture support are [discussed below](#xor-scan). The scan `` <` `` turns off every other 1 in groups of 1s. It's used in simdjson for backslash escaping, and they [describe in detail](https://github.com/simdjson/simdjson/blob/ac78c62/src/generic/stage1/json_escape_scanner.h#L96) a method that uses subtraction for carrying. And `` >` `` flips to all 0 at the first bit if it's a 0 or the *second* 1 bit otherwise; ``∧`¬⌾(1⊸↓)`` is one implementation. `` ≤` `` is ``<`⌾¬``, and `` ≥` `` is ``>`⌾¬``.
35+
36+
Simple sequences for a few scans are given below. `` ∧` `` and `` ∨` `` on lists only call for one evaluation where the first 0 or 1 bit is found, but also have nice segmented forms that can be used for a row-scan like ``∧`˘``. Here `even` is the even bits `0x555…`, and `odd` is the odd bits `even<<1` or `0xAAA…`.
37+
38+
<table>
39+
<tr><th>Scan</th><th>C code (word)</th><th>C code (segment starts <code>m</code>)</th></tr>
40+
<tr><td align="center"><code>∧`</code></td><td><code>x | -x</code> </td><td><code>t = (x &~ m) >> 1; (x - t) ^ t</code></td></tr>
41+
<tr><td align="center"><code>∨`</code></td><td><code>x &~ (x+1)</code></td><td><code>t = (x | m) >> 1; (t - x) ^ t</code></td></tr>
42+
<tr><td align="center"><code><`</code></td><td colspan=2><code>t = odd | (x<<1); x & (odd ^ (t - x))</code></td></tr>
43+
<tr><td align="center"><code>≤`</code></td><td colspan=2><code>t = even & (x<<1); x | (odd ^ (t - x))</code></td></tr>
44+
</table>
45+
46+
Handling carries in lists for `` <` `` and `` ≤` `` is possible by modifying `x<<1`, but for shorter dependency chains you modify the result. For example, for `` <` ``, a carry of 1 means all the result bits corresponding to trailing 1s in `x` need to be flipped. If the result from the previous word is `c` with type `u64`, the result should by xor-ed with `-(c>>63) & x &~ (x+1)`, where the shared `x &` can be factored out. Note that for `` ≤` ``, the "passive" bit is 1 and so -1 is the right initial carry.
3347

3448
### Xor scan
3549

0 commit comments

Comments
 (0)