You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p>Boolean folds on short rows can be implemented as a segmented scan, or windowed reduction, followed by extracting the appropriate bit from each row. The extraction is the hard part. While it's a special case of <ahref="take.html#bit-interleaving-and-uninterleaving">bit uninterleaving</a>, it's better to implement it with a more specialized method. <ahref="https://orlp.net/blog/extracting-depositing-bits/">Here's a post</a> on how you might do this extraction on a single word. Believe it or not, for multiple words even the pext-based method is beaten soundly by some generic code! Okay, for even widths it requires a little cheating with SSE2 auto-vectorization. For an odd width, say <code><spanclass='Value'>f</span></code>, there's a complicated but powerful method relying on the fact that in the first <code><spanclass='Value'>f</span></code> input words, the row boundaries cover each position within a word exactly once (this follows from the Chinese remainder theorem, since an odd number is relatively prime to each power of two). So the idea is to mask out these bits and combine them into a single word, then un-permute to put them in the right order in the result. There are a lot of complications, so it's described in <ahref="#the-modular-bit-permutation">its own section</a>.</p>
391
391
<h2id="the-modular-bit-permutation"><aclass="header" href="#the-modular-bit-permutation">The modular bit permutation</a></h2>
392
392
<p>This section describes how to perform and use the permutation sending the bit at position <code><spanclass='Value'>n</span><spanclass='Function'>|</span><spanclass='Value'>f</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> to position <code><spanclass='Value'>i</span></code> within each group of <code><spanclass='Value'>n</span><spanclass='Gets'>←</span><spanclass='Number'>2</span><spanclass='Function'>⋆</span><spanclass='Value'>k</span></code> bits, where <code><spanclass='Value'>f</span></code> is odd. It's done by a series of swaps, conditionally exchanging pairs of bits separated by a power of two, starting at <code><spanclass='Value'>n</span><spanclass='Function'>÷</span><spanclass='Number'>2</span></code> and ending at 2. Each swap is a self-inverse, so doing them in the opposite order results in the opposite permutation taking position <code><spanclass='Value'>i</span></code> to <code><spanclass='Value'>n</span><spanclass='Function'>|</span><spanclass='Value'>f</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code>.</p>
393
-
<p>The direction we focus on here can extract one bit from every <code><spanclass='Value'>f</span></code>, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way, which can be used for take-cells but is most powerful in <ahref="replicate.html#constant-replicate">Replicate by constant</a> since this also applies to broadcasting as used in Table and leading axis extension.</p>
393
+
<p>The direction we focus on here can extract one bit from every <code><spanclass='Value'>f</span></code>, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way: this most directly applies to <ahref="take.html#bit-interleaving-and-uninterleaving">take-cells</a> but also works for <ahref="replicate.html#constant-replicate">Replicate by constant</a>, and thus broadcasting for Table and leading axis extension.</p>
394
394
<h3id="decomposing-into-swaps"><aclass="header" href="#decomposing-into-swaps">Decomposing into swaps</a></h3>
<p>First we'll prove that a modular permutation does actually decompose into swap operations. Here's the intuitive case: consider the permutation where index <code><spanclass='Value'>i</span></code> has value <code><spanclass='Number'>16</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> (meaning, that's the original index of the bit that ends up at <code><spanclass='Value'>i</span></code>). At positions <code><spanclass='Value'>i</span></code> and <code><spanclass='Number'>8</span><spanclass='Function'>+</span><spanclass='Value'>i</span></code>, <code><spanclass='Value'>i</span><spanclass='Function'><</span><spanclass='Number'>8</span></code>, we have <code><spanclass='Number'>16</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> and <code><spanclass='Number'>16</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Paren'>(</span><spanclass='Number'>8</span><spanclass='Function'>+</span><spanclass='Value'>i</span><spanclass='Paren'>)</span></code> or <code><spanclass='Number'>16</span><spanclass='Function'>|</span><spanclass='Number'>8</span><spanclass='Function'>+</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code>. These values are different, but both are congruent to <code><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> (mod 8), so one of them is <code><spanclass='Number'>8</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> and the other is <code><spanclass='Number'>8</span><spanclass='Function'>+</span><spanclass='Number'>8</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code>. These are the values at positions <code><spanclass='Value'>i</span></code> and <code><spanclass='Number'>8</span><spanclass='Function'>+</span><spanclass='Value'>i</span></code> in the permutation that applies <code><spanclass='Number'>8</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> within each byte, so to extend that permutation from size 8 to size 16 what we need to do is swap these bits if <code><spanclass='Number'>16</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> isn't equal to <code><spanclass='Number'>8</span><spanclass='Function'>|</span><spanclass='Number'>5</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code>.</p>
396
709
<p>To handle it more rigorously, suppose we have performed our permutation of size <code><spanclass='Value'>h</span></code> so the value at <code><spanclass='Value'>i</span></code> is <code><spanclass='Paren'>(</span><spanclass='Value'>i</span><spanclass='Function'>-</span><spanclass='Value'>h</span><spanclass='Function'>|</span><spanclass='Value'>i</span><spanclass='Paren'>)</span><spanclass='Function'>+</span><spanclass='Value'>h</span><spanclass='Function'>|</span><spanclass='Value'>f</span><spanclass='Function'>×</span><spanclass='Value'>i</span></code> and want to extend this to size <code><spanclass='Value'>l</span><spanclass='Gets'>←</span><spanclass='Number'>2</span><spanclass='Function'>×</span><spanclass='Value'>h</span></code>. Define <code><spanclass='Function'>B</span><spanclass='Gets'>←</span><spanclass='Brace'>{</span><spanclass='Paren'>(</span><spanclass='Value'>l</span><spanclass='Function'>|</span><spanclass='Value'>𝕩</span><spanclass='Paren'>)</span><spanclass='Function'>-</span><spanclass='Value'>h</span><spanclass='Function'>|</span><spanclass='Value'>𝕩</span><spanclass='Brace'>}</span></code>, noting that <code><spanclass='Value'>h</span><spanclass='Function'>|B</span><spanclass='Value'>𝕩</span></code> is always 0. We will show that the value to be moved to <code><spanclass='Value'>i</span></code> appears at <code><spanclass='Value'>j</span><spanclass='Gets'>←</span><spanclass='Paren'>(</span><spanclass='Value'>i</span><spanclass='Function'>-</span><spanclass='Function'>B</span><spanclass='Value'>i</span><spanclass='Paren'>)</span><spanclass='Function'>+</span><spanclass='Paren'>(</span><spanclass='Function'>B</span><spanclass='Value'>f</span><spanclass='Function'>×</span><spanclass='Value'>i</span><spanclass='Paren'>)</span></code>. Since <code><spanclass='Value'>h</span><spanclass='Function'>|</span><spanclass='Value'>j</span></code> is <code><spanclass='Value'>h</span><spanclass='Function'>|</span><spanclass='Value'>i</span></code> after dropping <code><spanclass='Function'>B</span></code> terms, we have:</p>
<p>The total data to permute width <code><spanclass='Value'>l</span></code> is 2+4+…<code><spanclass='Value'>l</span><spanclass='Function'>÷</span><spanclass='Number'>2</span></code> bits, or <code><spanclass='Value'>l</span><spanclass='Function'>-</span><spanclass='Number'>2</span></code>. It can be precomputed for each odd factor <code><spanclass='Value'>f</span><spanclass='Function'><</span><spanclass='Value'>l</span></code> (which covers larger factors too, since <code><spanclass='Value'>f</span><spanclass='Function'>+</span><spanclass='Value'>k</span><spanclass='Function'>×</span><spanclass='Value'>l</span></code> permutes as <code><spanclass='Value'>f</span></code>). Then it just needs to be read from a table and unpacked into individual mask vectors. These mask vectors could also be computed directly with multiplication and some bit shuffling; I'm not sure how this would compare in speed.</p>
<p>The bits to be passed into the modular permutation need to be collected from the argument (possibly after some processing), one bit out of each <code><spanclass='Value'>f</span></code>. Or, in the other direction, they need to be distributed to the result. This can be done by generating a bitmask of the required position in each register. Then an argument register is and-ed with the bitmask and or-ed into a running total. But generating the bitmask is slow. For example, with row size under 64, updating the mask <code><spanclass='Value'>m</span></code> for the next word is <code><spanclass='Value'>m</span><spanclass='Function'>>></span><spanclass='Value'>r</span><spanclass='Function'>|</span><spanclass='Value'>m</span><spanclass='Function'><<</span><spanclass='Value'>l</span></code> for appropriately-chosen shifts <code><spanclass='Value'>l</span></code> and <code><spanclass='Value'>r</span></code>: this is a lot of instructions at each step! For small factors, an unrolled loop with saved masks works; for larger factors, it gets to be a lot of code, and eventually you'll run out of registers.</p>
<p>Since one modular permutation is needed for every <code><spanclass='Value'>f</span></code> expanded registers, a better approach is to structure it as a loop of length <code><spanclass='Value'>f</span></code> and unroll this loop. An unrolled iteration handling 4 adjacent registers works with a mask that combines the selected bits from all those registers, and at the end of the iteration it's advanced by 4 steps—this is the same operation as advancing once, just with different shifts. So that contains iterations 0|1|2|3, then 4|5|6|7, and so on. In addition to this "horizontal" mask we need 4 pre-computed "vertical" masks to distinguish within an iteration: one mask combines register 0 of each iteration 0|4|8|…, another does 1|5|9|…, and so on. So the intersection of one horizontal and one vertical mask correctly handles a particular register. The unrolled iteration applies the vertical mask to each of the 4 registers, and the horizontal one to them as a whole. So:</p>
451
827
<ul>
452
828
<li>When extracting, add <code><spanclass='Value'>h</span><spanclass='Value'>&</span><spanclass='Paren'>((</span><spanclass='Value'>i0&v0</span><spanclass='Paren'>)</span><spanclass='Function'>|</span><spanclass='Value'>...</span><spanclass='Function'>|</span><spanclass='Paren'>(</span><spanclass='Value'>i3&v3</span><spanclass='Paren'>))</span></code> to the running total.</li>
Copy file name to clipboardexpand all lines: implementation/primitive/fold.md
+64-1
Original file line number
Diff line number
Diff line change
@@ -191,10 +191,46 @@ Boolean folds on short rows can be implemented as a segmented scan, or windowed
191
191
192
192
This section describes how to perform and use the permutation sending the bit at position `n|f×i` to position `i` within each group of `n←2⋆k` bits, where `f` is odd. It's done by a series of swaps, conditionally exchanging pairs of bits separated by a power of two, starting at `n÷2` and ending at 2. Each swap is a self-inverse, so doing them in the opposite order results in the opposite permutation taking position `i` to `n|f×i`.
193
193
194
-
The direction we focus on here can extract one bit from every `f`, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way, which can be used for take-cells but is most powerful in [Replicate by constant](replicate.md#constant-replicate) since this also applies to broadcasting as used in Table and leading axis extension.
194
+
The direction we focus on here can extract one bit from every `f`, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way: this most directly applies to [take-cells](take.md#bit-interleaving-and-uninterleaving) but also works for [Replicate by constant](replicate.md#constant-replicate), and thus broadcasting for Table and leading axis extension.
"opacity=0.25|stroke-width=4" Ge (Line dׯ0.5‿0.42⊸+)¨ ∾⟨
221
+
(2⊸× ⋈˜⊸≍¨ ¯0.8⋈¨y⊏˜{2|𝕩?0;1+𝕊𝕩÷2}¨) 1↓↕n÷2
222
+
(0‿n≍⋈˜)¨ y
223
+
⟩
224
+
((⟨1,0.5+rh⟩+d⊸×)¨¯0.5⋈¨y) Text¨ FmtNum 2⋆1+↕ln
225
+
⟩
226
+
"stroke-width=2.5" Ge 1↓lgs Ge¨ (⥊(>+2×<)˝˘2↕s) ⊔ lines
227
+
"stroke-width=0.2" Ge rgs Ge¨< (Rect (rd-0.2)⊸(-˜≍2×⊣))¨ np
228
+
bg Ge (2×rh×n÷˜1+s) (Line (0⋈⊣)(⊢≍˘-˜)((-1⊸+)⌾⊑rd)⊸+)¨ np
229
+
np Text¨ FmtNum s
230
+
⟩
231
+
}
232
+
-->
233
+
198
234
First we'll prove that a modular permutation does actually decompose into swap operations. Here's the intuitive case: consider the permutation where index `i` has value `16|5×i` (meaning, that's the original index of the bit that ends up at `i`). At positions `i` and `8+i`, `i<8`, we have `16|5×i` and `16|5×(8+i)` or `16|8+5×i`. These values are different, but both are congruent to `5×i` (mod 8), so one of them is `8|5×i` and the other is `8+8|5×i`. These are the values at positions `i` and `8+i` in the permutation that applies `8|5×i` within each byte, so to extend that permutation from size 8 to size 16 what we need to do is swap these bits if `16|5×i` isn't equal to `8|5×i`.
199
235
200
236
To handle it more rigorously, suppose we have performed our permutation of size `h` so the value at `i` is `(i - h|i) + h|f×i` and want to extend this to size `l ← 2×h`. Define `B ← {(l|𝕩) - h|𝕩}`, noting that `h|B𝕩` is always 0. We will show that the value to be moved to `i` appears at `j ← (i - B i) + (B f×i)`. Since `h|j` is `h|i` after dropping `B` terms, we have:
@@ -244,6 +280,33 @@ The total data to permute width `l` is 2+4+…`l÷2` bits, or `l-2`. It can be p
244
280
245
281
The bits to be passed into the modular permutation need to be collected from the argument (possibly after some processing), one bit out of each `f`. Or, in the other direction, they need to be distributed to the result. This can be done by generating a bitmask of the required position in each register. Then an argument register is and-ed with the bitmask and or-ed into a running total. But generating the bitmask is slow. For example, with row size under 64, updating the mask `m` for the next word is `m>>r | m<<l` for appropriately-chosen shifts `l` and `r`: this is a lot of instructions at each step! For small factors, an unrolled loop with saved masks works; for larger factors, it gets to be a lot of code, and eventually you'll run out of registers.
246
282
283
+
<!--GEN
284
+
{
285
+
g ← "fill=currentColor|text-anchor=middle|font-family=BQN,monospace"
t ← ⟨"&"⟩ ∾ (0=↕4) (∾⟜"|"⊸∾´ +⟜1⊸↑∾⟨"…"⟩∾-⟜2⊸↑)¨ <˘⍉𝕩
304
+
Or t ∾ {< ∾⟜"|…|"⊸∾´ ↑¨˜⟜(-⊸⋈⌈○≠´) 0‿¯1⊏𝕩}⊸∾˘ 𝕩
305
+
} ⌽‿4 ⥊ FmtNum ↕21
306
+
⟩
307
+
}
308
+
-->
309
+
247
310
Since one modular permutation is needed for every `f` expanded registers, a better approach is to structure it as a loop of length `f` and unroll this loop. An unrolled iteration handling 4 adjacent registers works with a mask that combines the selected bits from all those registers, and at the end of the iteration it's advanced by 4 steps—this is the same operation as advancing once, just with different shifts. So that contains iterations 0|1|2|3, then 4|5|6|7, and so on. In addition to this "horizontal" mask we need 4 pre-computed "vertical" masks to distinguish within an iteration: one mask combines register 0 of each iteration 0|4|8|…, another does 1|5|9|…, and so on. So the intersection of one horizontal and one vertical mask correctly handles a particular register. The unrolled iteration applies the vertical mask to each of the 4 registers, and the horizontal one to them as a whole. So:
248
311
- When extracting, add `h & ((i0&v0) | ... | (i3&v3))` to the running total.
249
312
- When depositing, set `c = h & p`, and use `c&v0`, ... `c&v3`.
0 commit comments