mlochbaum · Mar 8, 2025
diff --git a/‎docs/implementation/primitive/fold.html
+377-1 b/‎docs/implementation/primitive/fold.html
+377-1
diff --git a/‎implementation/primitive/fold.md
+64-1 b/‎implementation/primitive/fold.md
+64-1
@@ -390,8 +390,321 @@ <h3 id="boolean-fold-cells"><a class="header" href="#boolean-fold-cells">Boolean
 <p>Boolean folds on short rows can be implemented as a segmented scan, or windowed reduction, followed by extracting the appropriate bit from each row. The extraction is the hard part. While it's a special case of <a href="take.html#bit-interleaving-and-uninterleaving">bit uninterleaving</a>, it's better to implement it with a more specialized method. <a href="https://orlp.net/blog/extracting-depositing-bits/">Here's a post</a> on how you might do this extraction on a single word. Believe it or not, for multiple words even the pext-based method is beaten soundly by some generic code! Okay, for even widths it requires a little cheating with SSE2 auto-vectorization. For an odd width, say <code><span class='Value'>f</span></code>, there's a complicated but powerful method relying on the fact that in the first <code><span class='Value'>f</span></code> input words, the row boundaries cover each position within a word exactly once (this follows from the Chinese remainder theorem, since an odd number is relatively prime to each power of two). So the idea is to mask out these bits and combine them into a single word, then un-permute to put them in the right order in the result. There are a lot of complications, so it's described in <a href="#the-modular-bit-permutation">its own section</a>.</p>
 <h2 id="the-modular-bit-permutation"><a class="header" href="#the-modular-bit-permutation">The modular bit permutation</a></h2>
 <p>This section describes how to perform and use the permutation sending the bit at position <code><span class='Value'>n</span><span class='Function'>|</span><span class='Value'>f</span><span class='Function'>×</span><span class='Value'>i</span></code> to position <code><span class='Value'>i</span></code> within each group of <code><span class='Value'>n</span><span class='Gets'>←</span><span class='Number'>2</span><span class='Function'>⋆</span><span class='Value'>k</span></code> bits, where <code><span class='Value'>f</span></code> is odd. It's done by a series of swaps, conditionally exchanging pairs of bits separated by a power of two, starting at <code><span class='Value'>n</span><span class='Function'>÷</span><span class='Number'>2</span></code> and ending at 2. Each swap is a self-inverse, so doing them in the opposite order results in the opposite permutation taking position <code><span class='Value'>i</span></code> to <code><span class='Value'>n</span><span class='Function'>|</span><span class='Value'>f</span><span class='Function'>×</span><span class='Value'>i</span></code>.</p>
-<p>The direction we focus on here can extract one bit from every <code><span class='Value'>f</span></code>, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way, which can be used for take-cells but is most powerful in <a href="replicate.html#constant-replicate">Replicate by constant</a> since this also applies to broadcasting as used in Table and leading axis extension.</p>
+<p>The direction we focus on here can extract one bit from every <code><span class='Value'>f</span></code>, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way: this most directly applies to <a href="take.html#bit-interleaving-and-uninterleaving">take-cells</a> but also works for <a href="replicate.html#constant-replicate">Replicate by constant</a>, and thus broadcasting for Table and leading axis extension.</p>
 <h3 id="decomposing-into-swaps"><a class="header" href="#decomposing-into-swaps">Decomposing into swaps</a></h3>
+<svg viewBox='-105.4 -44.2 712.8 311.842'>
+  <g fill='currentColor' stroke-linecap='round' text-anchor='middle' font-family='BQN,monospace'>
+    <rect class='code' stroke-width='1.5' rx='12' x='-41.4' y='-36.2' width='584.8' height='295.842'/>
+    <g class='yellow' text-anchor='end' font-size='14'>
+      <g opacity='0.25' stroke-width='4'>
+        <line x1='51' x2='51' y1='-21.28' y2='23.52'/>
+        <line x1='119' x2='119' y1='-21.28' y2='79.52'/>
+        <line x1='187' x2='187' y1='-21.28' y2='23.52'/>
+        <line x1='255' x2='255' y1='-21.28' y2='150.64'/>
+        <line x1='323' x2='323' y1='-21.28' y2='23.52'/>
+        <line x1='391' x2='391' y1='-21.28' y2='79.52'/>
+        <line x1='459' x2='459' y1='-21.28' y2='23.52'/>
+        <line x1='-17' x2='527' y1='23.52' y2='23.52'/>
+        <line x1='-17' x2='527' y1='79.52' y2='79.52'/>
+        <line x1='-17' x2='527' y1='150.64' y2='150.64'/>
+        <line x1='-17' x2='527' y1='240.962' y2='240.962'/>
+      </g>
+      <text dy='0.32em' x='-16' y='13.5'>2</text>
+      <text dy='0.32em' x='-16' y='69.5'>4</text>
+      <text dy='0.32em' x='-16' y='140.62'>8</text>
+      <text dy='0.32em' x='-16' y='230.942'>16</text>
+    </g>
+    <g stroke-width='2.5'>
+      <g class='purple'>
+        <line x1='102' x2='34' y1='11.76' y2='44.24'/>
+        <line x1='238' x2='170' y1='11.76' y2='44.24'/>
+        <line x1='374' x2='306' y1='11.76' y2='44.24'/>
+        <line x1='510' x2='442' y1='11.76' y2='44.24'/>
+        <line x1='204' x2='68' y1='67.76' y2='115.36'/>
+        <line x1='476' x2='340' y1='67.76' y2='115.36'/>
+        <line x1='374' x2='102' y1='138.88' y2='205.682'/>
+        <line x1='408' x2='136' y1='138.88' y2='205.682'/>
+        <line x1='442' x2='170' y1='138.88' y2='205.682'/>
+      </g>
+      <g class='red'>
+        <line x1='34' x2='102' y1='11.76' y2='44.24'/>
+        <line x1='170' x2='238' y1='11.76' y2='44.24'/>
+        <line x1='306' x2='374' y1='11.76' y2='44.24'/>
+        <line x1='442' x2='510' y1='11.76' y2='44.24'/>
+        <line x1='68' x2='204' y1='67.76' y2='115.36'/>
+        <line x1='340' x2='476' y1='67.76' y2='115.36'/>
+        <line x1='102' x2='374' y1='138.88' y2='205.682'/>
+        <line x1='136' x2='408' y1='138.88' y2='205.682'/>
+        <line x1='170' x2='442' y1='138.88' y2='205.682'/>
+      </g>
+    </g>
+    <g stroke-width='0.2'>
+      <g class='code'>
+        <rect x='-9.8' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='24.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='58.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='92.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='126.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='160.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='194.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='228.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='262.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='296.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='330.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='364.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='398.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='432.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='466.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='500.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='43.2' width='21.6' height='25.6'/>
+        <rect x='24.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='58.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='92.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='126.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='160.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='194.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='228.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='262.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='296.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='330.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='364.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='398.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='432.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='466.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='500.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='114.32' width='21.6' height='25.6'/>
+        <rect x='24.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='58.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='92.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='126.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='160.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='194.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='228.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='262.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='296.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='330.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='364.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='398.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='432.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='466.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='500.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='204.642' width='21.6' height='25.6'/>
+        <rect x='24.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='58.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='92.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='126.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='160.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='194.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='228.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='262.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='296.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='330.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='364.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='398.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='432.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='466.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='500.2' y='204.642' width='21.6' height='25.6'/>
+      </g>
+      <g fill='none' stroke='currentColor'>
+        <rect x='-9.8' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='24.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='58.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='92.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='126.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='160.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='194.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='228.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='262.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='296.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='330.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='364.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='398.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='432.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='466.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='500.2' y='-12.8' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='43.2' width='21.6' height='25.6'/>
+        <rect x='24.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='58.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='92.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='126.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='160.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='194.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='228.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='262.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='296.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='330.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='364.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='398.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='432.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='466.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='500.2' y='43.2' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='114.32' width='21.6' height='25.6'/>
+        <rect x='24.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='58.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='92.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='126.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='160.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='194.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='228.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='262.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='296.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='330.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='364.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='398.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='432.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='466.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='500.2' y='114.32' width='21.6' height='25.6'/>
+        <rect x='-9.8' y='204.642' width='21.6' height='25.6'/>
+        <rect x='24.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='58.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='92.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='126.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='160.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='194.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='228.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='262.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='296.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='330.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='364.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='398.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='432.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='466.2' y='204.642' width='21.6' height='25.6'/>
+        <rect x='500.2' y='204.642' width='21.6' height='25.6'/>
+      </g>
+    </g>
+    <g stroke-width='4' stroke='currentColor' stroke-linecap='butt'>
+      <line x1='-11' x2='-11' y1='13' y2='11.375'/>
+      <line x1='23' x2='23' y1='13' y2='9.75'/>
+      <line x1='57' x2='57' y1='13' y2='8.125'/>
+      <line x1='91' x2='91' y1='13' y2='6.5'/>
+      <line x1='125' x2='125' y1='13' y2='4.875'/>
+      <line x1='159' x2='159' y1='13' y2='3.25'/>
+      <line x1='193' x2='193' y1='13' y2='1.625'/>
+      <line x1='227' x2='227' y1='13' y2='0'/>
+      <line x1='261' x2='261' y1='13' y2='-1.625'/>
+      <line x1='295' x2='295' y1='13' y2='-3.25'/>
+      <line x1='329' x2='329' y1='13' y2='-4.875'/>
+      <line x1='363' x2='363' y1='13' y2='-6.5'/>
+      <line x1='397' x2='397' y1='13' y2='-8.125'/>
+      <line x1='431' x2='431' y1='13' y2='-9.75'/>
+      <line x1='465' x2='465' y1='13' y2='-11.375'/>
+      <line x1='499' x2='499' y1='13' y2='-13'/>
+      <line x1='-11' x2='-11' y1='69' y2='67.375'/>
+      <line x1='23' x2='23' y1='69' y2='62.5'/>
+      <line x1='57' x2='57' y1='69' y2='64.125'/>
+      <line x1='91' x2='91' y1='69' y2='65.75'/>
+      <line x1='125' x2='125' y1='69' y2='60.875'/>
+      <line x1='159' x2='159' y1='69' y2='56'/>
+      <line x1='193' x2='193' y1='69' y2='57.625'/>
+      <line x1='227' x2='227' y1='69' y2='59.25'/>
+      <line x1='261' x2='261' y1='69' y2='54.375'/>
+      <line x1='295' x2='295' y1='69' y2='49.5'/>
+      <line x1='329' x2='329' y1='69' y2='51.125'/>
+      <line x1='363' x2='363' y1='69' y2='52.75'/>
+      <line x1='397' x2='397' y1='69' y2='47.875'/>
+      <line x1='431' x2='431' y1='69' y2='43'/>
+      <line x1='465' x2='465' y1='69' y2='44.625'/>
+      <line x1='499' x2='499' y1='69' y2='46.25'/>
+      <line x1='-11' x2='-11' y1='140.12' y2='138.495'/>
+      <line x1='23' x2='23' y1='140.12' y2='133.62'/>
+      <line x1='57' x2='57' y1='140.12' y2='128.745'/>
+      <line x1='91' x2='91' y1='140.12' y2='136.87'/>
+      <line x1='125' x2='125' y1='140.12' y2='131.995'/>
+      <line x1='159' x2='159' y1='140.12' y2='127.12'/>
+      <line x1='193' x2='193' y1='140.12' y2='135.245'/>
+      <line x1='227' x2='227' y1='140.12' y2='130.37'/>
+      <line x1='261' x2='261' y1='140.12' y2='125.495'/>
+      <line x1='295' x2='295' y1='140.12' y2='120.62'/>
+      <line x1='329' x2='329' y1='140.12' y2='115.745'/>
+      <line x1='363' x2='363' y1='140.12' y2='123.87'/>
+      <line x1='397' x2='397' y1='140.12' y2='118.995'/>
+      <line x1='431' x2='431' y1='140.12' y2='114.12'/>
+      <line x1='465' x2='465' y1='140.12' y2='122.245'/>
+      <line x1='499' x2='499' y1='140.12' y2='117.37'/>
+      <line x1='-11' x2='-11' y1='230.442' y2='228.817'/>
+      <line x1='23' x2='23' y1='230.442' y2='223.942'/>
+      <line x1='57' x2='57' y1='230.442' y2='219.067'/>
+      <line x1='91' x2='91' y1='230.442' y2='214.192'/>
+      <line x1='125' x2='125' y1='230.442' y2='209.317'/>
+      <line x1='159' x2='159' y1='230.442' y2='204.442'/>
+      <line x1='193' x2='193' y1='230.442' y2='225.567'/>
+      <line x1='227' x2='227' y1='230.442' y2='220.692'/>
+      <line x1='261' x2='261' y1='230.442' y2='215.817'/>
+      <line x1='295' x2='295' y1='230.442' y2='210.942'/>
+      <line x1='329' x2='329' y1='230.442' y2='206.067'/>
+      <line x1='363' x2='363' y1='230.442' y2='227.192'/>
+      <line x1='397' x2='397' y1='230.442' y2='222.317'/>
+      <line x1='431' x2='431' y1='230.442' y2='217.442'/>
+      <line x1='465' x2='465' y1='230.442' y2='212.567'/>
+      <line x1='499' x2='499' y1='230.442' y2='207.692'/>
+    </g>
+    <text dy='0.32em' x='1' y='0'>0</text>
+    <text dy='0.32em' x='35' y='0'>1</text>
+    <text dy='0.32em' x='69' y='0'>2</text>
+    <text dy='0.32em' x='103' y='0'>3</text>
+    <text dy='0.32em' x='137' y='0'>4</text>
+    <text dy='0.32em' x='171' y='0'>5</text>
+    <text dy='0.32em' x='205' y='0'>6</text>
+    <text dy='0.32em' x='239' y='0'>7</text>
+    <text dy='0.32em' x='273' y='0'>8</text>
+    <text dy='0.32em' x='307' y='0'>9</text>
+    <text dy='0.32em' x='341' y='0'>10</text>
+    <text dy='0.32em' x='375' y='0'>11</text>
+    <text dy='0.32em' x='409' y='0'>12</text>
+    <text dy='0.32em' x='443' y='0'>13</text>
+    <text dy='0.32em' x='477' y='0'>14</text>
+    <text dy='0.32em' x='511' y='0'>15</text>
+    <text dy='0.32em' x='1' y='56'>0</text>
+    <text dy='0.32em' x='35' y='56'>3</text>
+    <text dy='0.32em' x='69' y='56'>2</text>
+    <text dy='0.32em' x='103' y='56'>1</text>
+    <text dy='0.32em' x='137' y='56'>4</text>
+    <text dy='0.32em' x='171' y='56'>7</text>
+    <text dy='0.32em' x='205' y='56'>6</text>
+    <text dy='0.32em' x='239' y='56'>5</text>
+    <text dy='0.32em' x='273' y='56'>8</text>
+    <text dy='0.32em' x='307' y='56'>11</text>
+    <text dy='0.32em' x='341' y='56'>10</text>
+    <text dy='0.32em' x='375' y='56'>9</text>
+    <text dy='0.32em' x='409' y='56'>12</text>
+    <text dy='0.32em' x='443' y='56'>15</text>
+    <text dy='0.32em' x='477' y='56'>14</text>
+    <text dy='0.32em' x='511' y='56'>13</text>
+    <text dy='0.32em' x='1' y='127.12'>0</text>
+    <text dy='0.32em' x='35' y='127.12'>3</text>
+    <text dy='0.32em' x='69' y='127.12'>6</text>
+    <text dy='0.32em' x='103' y='127.12'>1</text>
+    <text dy='0.32em' x='137' y='127.12'>4</text>
+    <text dy='0.32em' x='171' y='127.12'>7</text>
+    <text dy='0.32em' x='205' y='127.12'>2</text>
+    <text dy='0.32em' x='239' y='127.12'>5</text>
+    <text dy='0.32em' x='273' y='127.12'>8</text>
+    <text dy='0.32em' x='307' y='127.12'>11</text>
+    <text dy='0.32em' x='341' y='127.12'>14</text>
+    <text dy='0.32em' x='375' y='127.12'>9</text>
+    <text dy='0.32em' x='409' y='127.12'>12</text>
+    <text dy='0.32em' x='443' y='127.12'>15</text>
+    <text dy='0.32em' x='477' y='127.12'>10</text>
+    <text dy='0.32em' x='511' y='127.12'>13</text>
+    <text dy='0.32em' x='1' y='217.442'>0</text>
+    <text dy='0.32em' x='35' y='217.442'>3</text>
+    <text dy='0.32em' x='69' y='217.442'>6</text>
+    <text dy='0.32em' x='103' y='217.442'>9</text>
+    <text dy='0.32em' x='137' y='217.442'>12</text>
+    <text dy='0.32em' x='171' y='217.442'>15</text>
+    <text dy='0.32em' x='205' y='217.442'>2</text>
+    <text dy='0.32em' x='239' y='217.442'>5</text>
+    <text dy='0.32em' x='273' y='217.442'>8</text>
+    <text dy='0.32em' x='307' y='217.442'>11</text>
+    <text dy='0.32em' x='341' y='217.442'>14</text>
+    <text dy='0.32em' x='375' y='217.442'>1</text>
+    <text dy='0.32em' x='409' y='217.442'>4</text>
+    <text dy='0.32em' x='443' y='217.442'>7</text>
+    <text dy='0.32em' x='477' y='217.442'>10</text>
+    <text dy='0.32em' x='511' y='217.442'>13</text>
+  </g>
+</svg>
+
 <p>First we'll prove that a modular permutation does actually decompose into swap operations. Here's the intuitive case: consider the permutation where index <code><span class='Value'>i</span></code> has value <code><span class='Number'>16</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> (meaning, that's the original index of the bit that ends up at <code><span class='Value'>i</span></code>). At positions <code><span class='Value'>i</span></code> and <code><span class='Number'>8</span><span class='Function'>+</span><span class='Value'>i</span></code>, <code><span class='Value'>i</span><span class='Function'>&lt;</span><span class='Number'>8</span></code>, we have <code><span class='Number'>16</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> and <code><span class='Number'>16</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Paren'>(</span><span class='Number'>8</span><span class='Function'>+</span><span class='Value'>i</span><span class='Paren'>)</span></code> or <code><span class='Number'>16</span><span class='Function'>|</span><span class='Number'>8</span><span class='Function'>+</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code>. These values are different, but both are congruent to <code><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> (mod 8), so one of them is <code><span class='Number'>8</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> and the other is <code><span class='Number'>8</span><span class='Function'>+</span><span class='Number'>8</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code>. These are the values at positions <code><span class='Value'>i</span></code> and <code><span class='Number'>8</span><span class='Function'>+</span><span class='Value'>i</span></code> in the permutation that applies <code><span class='Number'>8</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> within each byte, so to extend that permutation from size 8 to size 16 what we need to do is swap these bits if <code><span class='Number'>16</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code> isn't equal to <code><span class='Number'>8</span><span class='Function'>|</span><span class='Number'>5</span><span class='Function'>×</span><span class='Value'>i</span></code>.</p>
 <p>To handle it more rigorously, suppose we have performed our permutation of size <code><span class='Value'>h</span></code> so the value at <code><span class='Value'>i</span></code> is <code><span class='Paren'>(</span><span class='Value'>i</span> <span class='Function'>-</span> <span class='Value'>h</span><span class='Function'>|</span><span class='Value'>i</span><span class='Paren'>)</span> <span class='Function'>+</span> <span class='Value'>h</span><span class='Function'>|</span><span class='Value'>f</span><span class='Function'>×</span><span class='Value'>i</span></code> and want to extend this to size <code><span class='Value'>l</span> <span class='Gets'>←</span> <span class='Number'>2</span><span class='Function'>×</span><span class='Value'>h</span></code>. Define <code><span class='Function'>B</span> <span class='Gets'>←</span> <span class='Brace'>{</span><span class='Paren'>(</span><span class='Value'>l</span><span class='Function'>|</span><span class='Value'>𝕩</span><span class='Paren'>)</span> <span class='Function'>-</span> <span class='Value'>h</span><span class='Function'>|</span><span class='Value'>𝕩</span><span class='Brace'>}</span></code>, noting that <code><span class='Value'>h</span><span class='Function'>|B</span><span class='Value'>𝕩</span></code> is always 0. We will show that the value to be moved to <code><span class='Value'>i</span></code> appears at <code><span class='Value'>j</span> <span class='Gets'>←</span> <span class='Paren'>(</span><span class='Value'>i</span> <span class='Function'>-</span> <span class='Function'>B</span> <span class='Value'>i</span><span class='Paren'>)</span> <span class='Function'>+</span> <span class='Paren'>(</span><span class='Function'>B</span> <span class='Value'>f</span><span class='Function'>×</span><span class='Value'>i</span><span class='Paren'>)</span></code>. Since <code><span class='Value'>h</span><span class='Function'>|</span><span class='Value'>j</span></code> is <code><span class='Value'>h</span><span class='Function'>|</span><span class='Value'>i</span></code> after dropping <code><span class='Function'>B</span></code> terms, we have:</p>
 <pre>   <span class='Paren'>(</span><span class='Value'>j</span> <span class='Function'>-</span> <span class='Value'>h</span><span class='Function'>|</span><span class='Value'>j</span><span class='Paren'>)</span> <span class='Function'>+</span> <span class='Value'>h</span><span class='Function'>|</span><span class='Value'>f</span><span class='Function'>×</span><span class='Value'>j</span>
@@ -447,6 +760,69 @@ <h3 id="evaluating-swaps"><a class="header" href="#evaluating-swaps">Evaluating
 <p>The total data to permute width <code><span class='Value'>l</span></code> is 2+4+…<code><span class='Value'>l</span><span class='Function'>÷</span><span class='Number'>2</span></code> bits, or <code><span class='Value'>l</span><span class='Function'>-</span><span class='Number'>2</span></code>. It can be precomputed for each odd factor <code><span class='Value'>f</span><span class='Function'>&lt;</span><span class='Value'>l</span></code> (which covers larger factors too, since <code><span class='Value'>f</span><span class='Function'>+</span><span class='Value'>k</span><span class='Function'>×</span><span class='Value'>l</span></code> permutes as <code><span class='Value'>f</span></code>). Then it just needs to be read from a table and unpacked into individual mask vectors. These mask vectors could also be computed directly with multiplication and some bit shuffling; I'm not sure how this would compare in speed.</p>
 <h3 id="collecting-bits"><a class="header" href="#collecting-bits">Collecting bits</a></h3>
 <p>The bits to be passed into the modular permutation need to be collected from the argument (possibly after some processing), one bit out of each <code><span class='Value'>f</span></code>. Or, in the other direction, they need to be distributed to the result. This can be done by generating a bitmask of the required position in each register. Then an argument register is and-ed with the bitmask and or-ed into a running total. But generating the bitmask is slow. For example, with row size under 64, updating the mask <code><span class='Value'>m</span></code> for the next word is <code><span class='Value'>m</span><span class='Function'>&gt;&gt;</span><span class='Value'>r</span> <span class='Function'>|</span> <span class='Value'>m</span><span class='Function'>&lt;&lt;</span><span class='Value'>l</span></code> for appropriately-chosen shifts <code><span class='Value'>l</span></code> and <code><span class='Value'>r</span></code>: this is a lot of instructions at each step! For small factors, an unrolled loop with saved masks works; for larger factors, it gets to be a lot of code, and eventually you'll run out of registers.</p>
+<svg viewBox='-192 -8 816 314.8'>
+  <g fill='currentColor' text-anchor='middle' font-family='BQN,monospace'>
+    <rect class='code' stroke-width='1.5' rx='12' x='0' y='0' width='432' height='298.8'/>
+    <g stroke-width='10' stroke='#521f5e' opacity='0.1'>
+      <line x1='18' x2='414' y1='90' y2='90'/>
+      <line x1='18' x2='414' y1='126' y2='126'/>
+      <line x1='18' x2='414' y1='162' y2='162'/>
+      <line x1='18' x2='414' y1='198' y2='198'/>
+      <line x1='18' x2='414' y1='234' y2='234'/>
+    </g>
+    <g stroke-width='10' stroke='#991814' opacity='0.25'>
+      <line x1='18' x2='414' y1='270' y2='270'/>
+    </g>
+    <g stroke-width='10' stroke='#7f651c' opacity='0.1'>
+      <line x1='158.4' x2='158.4' y1='8' y2='290.8'/>
+      <line x1='230.4' x2='230.4' y1='8' y2='290.8'/>
+      <line x1='302.4' x2='302.4' y1='8' y2='290.8'/>
+      <line x1='374.4' x2='374.4' y1='8' y2='290.8'/>
+    </g>
+    <g font-size='28'>
+      <text dy='0.32em' x='64.8' y='39.6'>&</text>
+    </g>
+    <g font-size='18'>
+      <text dy='0.32em' x='158.4' y='28.6'>0|4|…|20</text>
+      <text dy='0.32em' x='230.4' y='50.6'>1|…|17|0</text>
+      <text dy='0.32em' x='302.4' y='28.6'>2|…|18|1</text>
+      <text dy='0.32em' x='374.4' y='50.6'>3|…|19|2</text>
+      <text dy='0.32em' x='64.8' y='90'>0|…|3</text>
+      <text dy='0.32em' x='64.8' y='126'>4|…|7</text>
+      <text dy='0.32em' x='64.8' y='162'> 8|…|11</text>
+      <text dy='0.32em' x='64.8' y='198'>12|…|15</text>
+      <text dy='0.32em' x='64.8' y='234'>16|…|19</text>
+      <text dy='0.32em' x='64.8' y='270'>20|…|2 </text>
+    </g>
+    <g font-size='20'>
+      <text dy='0.32em' x='158.4' y='90'>0</text>
+      <text dy='0.32em' x='230.4' y='90'>1</text>
+      <text dy='0.32em' x='302.4' y='90'>2</text>
+      <text dy='0.32em' x='374.4' y='90'>3</text>
+      <text dy='0.32em' x='158.4' y='126'>4</text>
+      <text dy='0.32em' x='230.4' y='126'>5</text>
+      <text dy='0.32em' x='302.4' y='126'>6</text>
+      <text dy='0.32em' x='374.4' y='126'>7</text>
+      <text dy='0.32em' x='158.4' y='162'>8</text>
+      <text dy='0.32em' x='230.4' y='162'>9</text>
+      <text dy='0.32em' x='302.4' y='162'>10</text>
+      <text dy='0.32em' x='374.4' y='162'>11</text>
+      <text dy='0.32em' x='158.4' y='198'>12</text>
+      <text dy='0.32em' x='230.4' y='198'>13</text>
+      <text dy='0.32em' x='302.4' y='198'>14</text>
+      <text dy='0.32em' x='374.4' y='198'>15</text>
+      <text dy='0.32em' x='158.4' y='234'>16</text>
+      <text dy='0.32em' x='230.4' y='234'>17</text>
+      <text dy='0.32em' x='302.4' y='234'>18</text>
+      <text dy='0.32em' x='374.4' y='234'>19</text>
+      <text dy='0.32em' x='158.4' y='270'>20|0</text>
+      <text dy='0.32em' x='230.4' y='270'>0|1</text>
+      <text dy='0.32em' x='302.4' y='270'>1|2</text>
+      <text dy='0.32em' x='374.4' y='270'>2</text>
+    </g>
+  </g>
+</svg>
+
 <p>Since one modular permutation is needed for every <code><span class='Value'>f</span></code> expanded registers, a better approach is to structure it as a loop of length <code><span class='Value'>f</span></code> and unroll this loop. An unrolled iteration handling 4 adjacent registers works with a mask that combines the selected bits from all those registers, and at the end of the iteration it's advanced by 4 steps—this is the same operation as advancing once, just with different shifts. So that contains iterations 0|1|2|3, then 4|5|6|7, and so on. In addition to this &quot;horizontal&quot; mask we need 4 pre-computed &quot;vertical&quot; masks to distinguish within an iteration: one mask combines register 0 of each iteration 0|4|8|…, another does 1|5|9|…, and so on. So the intersection of one horizontal and one vertical mask correctly handles a particular register. The unrolled iteration applies the vertical mask to each of the 4 registers, and the horizontal one to them as a whole. So:</p>
 <ul>
 <li>When extracting, add <code><span class='Value'>h</span> <span class='Value'>&amp;</span> <span class='Paren'>((</span><span class='Value'>i0&amp;v0</span><span class='Paren'>)</span> <span class='Function'>|</span> <span class='Value'>...</span> <span class='Function'>|</span> <span class='Paren'>(</span><span class='Value'>i3&amp;v3</span><span class='Paren'>))</span></code> to the running total.</li>
 
@@ -191,10 +191,46 @@ Boolean folds on short rows can be implemented as a segmented scan, or windowed
 
 This section describes how to perform and use the permutation sending the bit at position `n|f×i` to position `i` within each group of `n←2⋆k` bits, where `f` is odd. It's done by a series of swaps, conditionally exchanging pairs of bits separated by a power of two, starting at `n÷2` and ending at 2. Each swap is a self-inverse, so doing them in the opposite order results in the opposite permutation taking position `i` to `n|f×i`.
 
-The direction we focus on here can extract one bit from every `f`, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way, which can be used for take-cells but is most powerful in [Replicate by constant](replicate.md#constant-replicate) since this also applies to broadcasting as used in Table and leading axis extension.
+The direction we focus on here can extract one bit from every `f`, so it's useful for boolean fold-cells and select-cells picking out a single column. In the other direction, it can spread bits out in the same way: this most directly applies to [take-cells](take.md#bit-interleaving-and-uninterleaving) but also works for [Replicate by constant](replicate.md#constant-replicate), and thus broadcasting for Table and leading axis extension.
 
 ### Decomposing into swaps
 
+<!--GEN
+{
+lgs ← "stroke=currentColor|opacity=0.05"‿"class=purple"‿"class=red"
+rgs ← "class=code"‿"fill=none|stroke=currentColor"
+bg ← "stroke-width=4|stroke=currentColor|stroke-linecap=butt"
+_step ← { h 𝕗_𝕣 i:
+  l ← 2×h ⋄ B ← {(l|𝕩) - h|𝕩}
+  (i - B i) + B 𝕗×i
+}
+s ← > ⊏˜` ss ← (2⋆↕ln) 3 _step¨ <↕n←2⋆ln←4
+
+d ← 34‿56
+y ← +`»1.27⋆↕≠s
+np ← ⋈˜⌜´ 0‿1+(⌽d)×⟨y,↕n⟩
+dim ← ¯4‿3⊸+⌾⊏ d⊸×˘ 1.1‿0.7 (-≍+˜)⊸+ 0¨⊸≍ ⟨n-1, ⊢´y⟩
+rh ← ⊢´ rd ← 11‿13
+
+lines ← Line¨ ⥊≍¨´<⎉1¨d×⟨↕∘≠⊸(≍˘)˘1↓>ss, (⋈⟜-0.21)⊸+˘2↕y⟩
+
+(⥊64‿8(-≍+˜)⊸+dim) SVG g Ge ⟨
+  rc Rect dim
+  "class=yellow|text-anchor=end|font-size=14" Ge ⟨
+    "opacity=0.25|stroke-width=4" Ge (Line d×¯0.5‿0.42⊸+)¨ ∾⟨
+      (2⊸× ⋈˜⊸≍¨ ¯0.8⋈¨y⊏˜{2|𝕩?0;1+𝕊𝕩÷2}¨) 1↓↕n÷2
+      (0‿n≍⋈˜)¨ y
+    ⟩
+    ((⟨1,0.5+rh⟩+d⊸×)¨¯0.5⋈¨y) Text¨ FmtNum 2⋆1+↕ln
+  ⟩
+  "stroke-width=2.5" Ge 1↓lgs Ge¨ (⥊(>+2×<)˝˘2↕s) ⊔ lines
+  "stroke-width=0.2" Ge   rgs Ge¨< (Rect (rd-0.2)⊸(-˜≍2×⊣))¨ np
+  bg Ge (2×rh×n÷˜1+s) (Line (0⋈⊣)(⊢≍˘-˜)((-1⊸+)⌾⊑rd)⊸+)¨ np
+  np Text¨ FmtNum s
+⟩
+}
+-->
+
 First we'll prove that a modular permutation does actually decompose into swap operations. Here's the intuitive case: consider the permutation where index `i` has value `16|5×i` (meaning, that's the original index of the bit that ends up at `i`). At positions `i` and `8+i`, `i<8`, we have `16|5×i` and `16|5×(8+i)` or `16|8+5×i`. These values are different, but both are congruent to `5×i` (mod 8), so one of them is `8|5×i` and the other is `8+8|5×i`. These are the values at positions `i` and `8+i` in the permutation that applies `8|5×i` within each byte, so to extend that permutation from size 8 to size 16 what we need to do is swap these bits if `16|5×i` isn't equal to `8|5×i`.
 
 To handle it more rigorously, suppose we have performed our permutation of size `h` so the value at `i` is `(i - h|i) + h|f×i` and want to extend this to size `l ← 2×h`. Define `B ← {(l|𝕩) - h|𝕩}`, noting that `h|B𝕩` is always 0. We will show that the value to be moved to `i` appears at `j ← (i - B i) + (B f×i)`. Since `h|j` is `h|i` after dropping `B` terms, we have:
@@ -244,6 +280,33 @@ The total data to permute width `l` is 2+4+…`l÷2` bits, or `l-2`. It can be p
 
 The bits to be passed into the modular permutation need to be collected from the argument (possibly after some processing), one bit out of each `f`. Or, in the other direction, they need to be distributed to the result. This can be done by generating a bitmask of the required position in each register. Then an argument register is and-ed with the bitmask and or-ed into a running total. But generating the bitmask is slow. For example, with row size under 64, updating the mask `m` for the next word is `m>>r | m<<l` for appropriately-chosen shifts `l` and `r`: this is a lot of instructions at each step! For small factors, an unrolled loop with saved masks works; for larger factors, it gets to be a lot of code, and eventually you'll run out of registers.
 
+<!--GEN
+{
+g ← "fill=currentColor|text-anchor=middle|font-family=BQN,monospace"
+lc ← ("stroke-width"‿"10" ∾ "stroke"‿"opacity"≍˘⊢)¨ ⟨
+  "#521f5e"‿"0.1", "#991814"‿"0.25", "#7f651c"‿"0.1"
+⟩
+
+d ← 72‿36
+txy ← tx‿ty ← d × 0.9‿1.1 ∾¨ 2.2‿2.5 + ↕¨4‿6
+rd ← 0¨⊸≍ dimx‿dimy ← (0.8×d) + ⊢´¨txy
+tp ← (0⋈¨4⥊-⊸⋈11)⊸+⌾(1↓⊏) ⍉⋈⌜´txy
+
+(⥊ 192‿8 (-≍+˜)⊸+ rd) SVG g Ge ⟨
+  rc Rect rd
+  lc "g"⊸Attr⊸Enc¨ Line¨¨ ((¯1(↓⋈↑)⊑)∾1⊸↓) ⟨
+    ((18(⊣⋈-˜)dimx)˙≍⋈˜)¨ 1↓ty
+    (⋈˜≍( 8(⊣⋈-˜)dimy)˙)¨ 1↓tx
+  ⟩
+  "28"‿"18"‿"20" "font-size="⊸∾⊸Ge¨ (+⌜´0<↕¨∘≢)⊸⊔ tp Text¨ {
+    Or ← 1⊸⌽⊸(∾⟜"|"⊸∾˜¨⌾(3↑1⊸↓))⌾(¯1⊸⊏)
+    t ← ⟨"&"⟩ ∾ (0=↕4) (∾⟜"|"⊸∾´ +⟜1⊸↑∾⟨"…"⟩∾-⟜2⊸↑)¨ <˘⍉𝕩
+    Or t ∾ {< ∾⟜"|…|"⊸∾´ ↑¨˜⟜(-⊸⋈⌈○≠´) 0‿¯1⊏𝕩}⊸∾˘ 𝕩
+  } ⌽‿4 ⥊ FmtNum ↕21
+⟩
+}
+-->
+
 Since one modular permutation is needed for every `f` expanded registers, a better approach is to structure it as a loop of length `f` and unroll this loop. An unrolled iteration handling 4 adjacent registers works with a mask that combines the selected bits from all those registers, and at the end of the iteration it's advanced by 4 steps—this is the same operation as advancing once, just with different shifts. So that contains iterations 0|1|2|3, then 4|5|6|7, and so on. In addition to this "horizontal" mask we need 4 pre-computed "vertical" masks to distinguish within an iteration: one mask combines register 0 of each iteration 0|4|8|…, another does 1|5|9|…, and so on. So the intersection of one horizontal and one vertical mask correctly handles a particular register. The unrolled iteration applies the vertical mask to each of the 4 registers, and the horizontal one to them as a whole. So:
 - When extracting, add `h & ((i0&v0) | ... | (i3&v3))` to the running total.
 - When depositing, set `c = h & p`, and use `c&v0`, ... `c&v3`.