Skip to content

Conversation

@marmitar
Copy link
Contributor

@marmitar marmitar commented Jul 27, 2025

Branchless Square Root

Optimized sqrt implementation in Yul that avoids implicit and explicit branches, which are relatively expensive on the EVM. It eliminates Solidity's division checks, reuses and improves msb, and expresses conditionals arithmetically. The result is 49% lower gas cost on average and a constant 407 gas per call.

Estimated Gas Cost

Old New %diff
Maximum 807 407 -49.6%
Average over uint256 798.1 407 -49.0%
Standard Deviation 9.8 0 -100%
Minimum ($x > 0$) 638 407 -36.2%
$x = 0$ 74 407 +450%

Code Transformations

Branchless log2 (-40 gas on average)

The prior code branched once per region in uint256, costing ~14 gas (PUSH/JUMPI/JUMPDEST) per branch plus 24-30 gas when taken. About half of those branches fire on average, so switching to msb() already saves gas, even though msb() does one extra iteration (msb0), which sqrt discards.

Slight regression at this step: inputs that previously short-circuited (including values around 1e18) go from ~700 to ~754 gas because early-outs are gone. Acceptable, because this change enables the msb() optimization below, which cuts 60 gas and is required to reach the final 407 gas.

Reorder instructions on msb (-75 gas)

msb() previously updted x as it progressed. On a stack machine (EVM) this forces extra SWAP/DUP traffic. We can compute x >> result every iteration using the same number of shr(). Saves 75 gas.

Branchless "perfect square" condition (-42 gas on average)

The final iterate is always either $\lfloor\sqrt{x}\rfloor$ or $\lfloor\sqrt{x}\rfloor + 1$. So we replace result = min(result, x / result) with a boolean subtract: result -= (result > x / result) ? 1 : 0. Yul implementation drops gas from 679-687 to 641.

Unchecked division (-196 gas)

Solidity still inserts a division-by-zero check, even in unchecked blocks (Checked or Unchecked Arithmetic). The only way to avoid it is to use assembly directly, which yields the biggest win here: -196 gas.

Skip condition for zero (-38 gas on average)

With division in Yul, we drop the x == 0 branch and rely on the EVM semantics div(a, 0) = 0 (EVM Codes - DIV). This saves 38 gas for every nonzero input; $x = 0$ regresses by +333. This is intentional: inputs are expected to be non-trivial most of the time ($\geq 90\%$), so the average gas cost improves.

Optimized inlined msb (-21 gas, not implemented)

Inlining msb() into sqrt and skipping msb0 saves 21 gas, but the readability hit isn't worth it IMO, so I left it out. I can bring it back if you like it.

@PaulRBerg
Copy link
Owner

Hey @marmitar, thank you very much for this PR. I will review it during the weekend!

@marmitar marmitar force-pushed the perf/branchless-sqrt branch 4 times, most recently from 6bf6caf to fc31231 Compare July 29, 2025 00:36
@marmitar
Copy link
Contributor Author

@PaulRBerg I have another even more optimized implementation using De Bruijn sequences. It's a bit more involved, but the gas costs goes to 346. I could propose that instead, if you prefer.

De Bruijn implementation
/// @notice Calculates the square root of x using the Babylonian method.
///
/// @dev See https://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Babylonian_method.
///
/// Notes:
/// - If x is not a perfect square, the result is rounded down.
/// - Credits to OpenZeppelin for the explanations in comments below.
///
/// @param x The uint256 number for which to calculate the square root.
/// @return result The result as a uint256.
/// @custom:smtchecker abstract-function-nondet
function sqrt(uint256 x) pure returns (uint256 result) {
    // For our first guess, we find the most significant *byte* of x and use its value and position
    // to approximate the square root of x.
    //
    // For this, we want to find $k \in [0,255]$ and $n \in {0,8,...,248}$ such that $x \approx k 2^n$.
    // We can find $n$ by doing five steps of the `msb()` algorithm ($n = 8 floor(msb(x) / 8)$), and
    // then we also have $k = floor(x / 2^n)$.
    //
    // Once we have those values, the square root can be approximated by $sqrt(x) \approx sqrt(k 2^n) =
    // sqrt(k) 2^{n/2}$. For $sqrt(k)$, we use a lookup table that fits in a 32-byte word, which means
    // that we'll need to use the top 5 bits of $k$ for indexing, instead of the full 8 bits, so
    // $i = k >> 3$. Because of this, each position in the table must have the average square root for
    // all bytes that it covers:
    //
    // $$
    // table[i] = round(1/8 sum_{t=0}^7 sqrt(8i+t))
    // $$
    //
    // The table is encoded big-endian so `byte(i, table)` returns entry `i`. This process will produce
    // a good initial guess for $sqrt(x)$, with at least one correct bit.
    assembly ("memory-safe") {
        let n := shl(7, lt(0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF, x))
        n := or(n, shl(6, lt(0xFFFFFFFFFFFFFFFF, shr(n, x))))
        n := or(n, shl(5, lt(0xFFFFFFFF, shr(n, x))))
        n := or(n, shl(4, lt(0xFFFF, shr(n, x))))
        n := or(n, shl(3, lt(0xFF, shr(n, x))))

        let table := 0x02030405060707080809090A0A0A0B0B0B0C0C0C0D0D0D0E0E0E0F0F0F0F1010
        let i := shr(3, shr(n, x))
        result := shl(shr(1, n), byte(i, table))
    }

    // At this point, `result` is an estimation with at least one bit of precision. We know the true value has at
    // most 128 bits, since it is the square root of a uint256. Newton's method converges quadratically (precision
    // doubles at every iteration). We thus need at most 7 iterations to turn our partial result with one bit of
    // precision into the expected uint128 result.
    assembly ("memory-safe") {
        // note: division by zero in EVM returns zero
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))
        result := shr(1, add(result, div(x, result)))

        // If x is not a perfect square, round the result toward zero.
        result := sub(result, gt(result, div(x, result)))
    }
}

@PaulRBerg
Copy link
Owner

Thanks @marmitar. I suggest putting the De Bruihn implementation in another branch so we can compare the complexity. I would also suggest keeping the original implementation in the tests directory so that we can compare it to the new implementation from this PR (for regression testing).

@marmitar
Copy link
Contributor Author

marmitar commented Jul 29, 2025

Oh, I just noticed that there are no fuzz tests for sqrt() nor msb(). I'll open a PR for them, soon. As for using the old implementation as reference, I don't think it's a good idea. These mathematical functions have very well defined properties that can be used to validate their correctness. It's faster and safer then relying on previous implementations.

For msb():

$$ x = 0 \implies \text{msb}(x) = 0 $$ $$ x > 0 \implies x \gg \text{msb}(x) = 1 $$

For sqrt():

$$ \left\lfloor\sqrt{x}\right\rfloor \leq \sqrt{x} < \left\lfloor\sqrt{x}\right\rfloor + 1 $$ $$ \left\lfloor\sqrt{x}\right\rfloor^2 \leq x < \left(\left\lfloor\sqrt{x}\right\rfloor + 1\right)^2 $$

Also, that's what integer square root implementations do, see Python's isqrt.

@PaulRBerg
Copy link
Owner

Yeah, fuzz those for those functions would be helpful.

It would be good idea to use established mathematical properties to test out those functions, but I would still keep a copy of the original implementations simply because a lot of PRBMath users are using them now, and in this way, we can stress test both the old version and the new version. We could do something like:

old implementation == new implementation
old implementation ~ mathematical properties
new implementation ~ mathematical properties

Where ~ means 'adheres to'.

That said, I can understand if you don't have the time to write the tests like this. Feel free to implement the simple tests for the new version, and I can handle the rest later.

@marmitar marmitar force-pushed the perf/branchless-sqrt branch from fc31231 to 57b01fa Compare July 31, 2025 01:03
@marmitar
Copy link
Contributor Author

Ok, I'll do that.

In the meantime, I was tracking this gas variance on msb() implementation that didn't make sense to me. Turns out solc is better at optimizing the comparison ladder with lt()s than gt()s. Also, avoided some swaps by recomputing x >> result. -15 gas for free.

@PaulRBerg
Copy link
Owner

Thanks @marmitar, will review this over the weekend.

Yeah, gas golfing is difficult with the latest versions of Solidity, especially when --via-ir is used.

@marmitar marmitar force-pushed the perf/branchless-sqrt branch from 57b01fa to b5a3acc Compare August 2, 2025 01:24
@marmitar
Copy link
Contributor Author

marmitar commented Aug 2, 2025

I opened #265 with property-based tests, rebased this on top of that one, and added the regression tests here.

@marmitar marmitar force-pushed the perf/branchless-sqrt branch from b5a3acc to dc36e28 Compare August 2, 2025 19:47
@PaulRBerg
Copy link
Owner

tyvm @marmitar.

Apologies for the delay - I got overrun this weekend with life admin. I will review this week!

@marmitar
Copy link
Contributor Author

marmitar commented Aug 4, 2025

Oh, no need to hurry. Do it when you have time.

@marmitar marmitar force-pushed the perf/branchless-sqrt branch from dc36e28 to 79a6727 Compare August 30, 2025 20:56
Estimated gas reduction from 798.1 to 407 gas.
@marmitar marmitar force-pushed the perf/branchless-sqrt branch from 79a6727 to a0dff59 Compare August 30, 2025 20:59
@marmitar
Copy link
Contributor Author

Rebased it to main and updated the reference implementation of msb used in tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants