|
| 1 | +# Sudoku solver - Zig Implementation |
| 2 | + |
| 3 | +## Solving Sudoku Grids on Windows 64 |
| 4 | + |
| 5 | +See documentation on how to use this program [here](https://github.com/nilostolte/Sudoku/tree/main/documentation). |
| 6 | + |
| 7 | +## Optimizations done in this version |
| 8 | + |
| 9 | +I have done several different optimizations in Zig. Some are either not possible, either not portable |
| 10 | +in C. Many of them may be available on gcc but not in other compilers. On the other hand, every |
| 11 | +optimization made here in the Zig version is portable, although some may not generate the |
| 12 | +most optimal performance in platforms different than X64. |
| 13 | + |
| 14 | +### Using a linear grid |
| 15 | + |
| 16 | +In Zig the grid matrix is given linearly by an array containing the 81 elements of the grid stored |
| 17 | +line by line contiguously: |
| 18 | + |
| 19 | +``` Zig |
| 20 | + var grid = [_]u8{0} ** 81; // Sudoku grid stored linearly here |
| 21 | +``` |
| 22 | + |
| 23 | +This configuration increases cache coherency and avoids indirections to access the elements via pointers |
| 24 | +as it's usually done in matrices and as it was also done in previous versions of this Zig code. The |
| 25 | +linear storage doesn't come for free, since it implies additional operations in the `solve` function to be |
| 26 | +able to cope with this configuration. |
| 27 | + |
| 28 | +The most notable is to maintain not only the line and column of an element (`j` and `i` variables), but |
| 29 | +also its index (`index` variable) in the linear grid. |
| 30 | + |
| 31 | +Additional operations are needed to recover `index` whenever backtracking, by calculating it |
| 32 | +with the previous line and column values popped from the stack. Here one needs to multiply `i` (the |
| 33 | +current line) by 9, to jump over the previous lines, and add `j`, the current column: |
| 34 | + |
| 35 | +``` Zig |
| 36 | + index = @shlExact(i,3) + i + j; |
| 37 | +``` |
| 38 | + |
| 39 | +Since backtracking occurs less often than other parts of the loop, these extra operations |
| 40 | +don't impact the performance in a noticeable way. |
| 41 | + |
| 42 | +The most frequent operation impacting the linear grid configuration is an extra addition to |
| 43 | +increment `index`, besides the usual `j` incrementation at the end of the loop just before |
| 44 | +testing a line change and end of the loop: |
| 45 | + |
| 46 | +``` Zig |
| 47 | + index += 1; // advance to the next position in grid |
| 48 | + j += 1; // advance to the next column |
| 49 | +``` |
| 50 | + |
| 51 | +Fortunately, the time spent in the extra operations didn't overlap the time gained with the |
| 52 | +linear grid storage. Less indirections and more coherency when accessing the elements one after |
| 53 | +the other in sequence as done here highly justified the cost of extra operations. It's clear that |
| 54 | +the less one needs to use values stored in memory the better the solver performs. Focusing on |
| 55 | +that unveiled quite a few surprises after calculating values dynamically instead of accessing the |
| 56 | +calculated values in memory. |
| 57 | + |
| 58 | +### Calculating the grid element value from bit representation using @popcount |
| 59 | + |
| 60 | +Each grid element value (0 to 9) is represented in binary as shown in the table below to speed up |
| 61 | +occupation sets checking. |
| 62 | + |
| 63 | +| Element Value | Binary Representation | Hexadecimal | Decimal | |
| 64 | +| :-----------: | :-------------------: | :---------: | :-----: | |
| 65 | +| 0 | **000000000** | 0x000 | 0 | |
| 66 | +| 1 | **000000001** | 0x001 | 1 | |
| 67 | +| 2 | **000000010** | 0x002 | 2 | |
| 68 | +| 3 | **000000100** | 0x004 | 4 | |
| 69 | +| 4 | **000001000** | 0x008 | 8 | |
| 70 | +| 5 | **000010000** | 0x010 | 16 | |
| 71 | +| 6 | **000100000** | 0x020 | 32 | |
| 72 | +| 7 | **001000000** | 0x040 | 64 | |
| 73 | +| 8 | **010000000** | 0x080 | 128 | |
| 74 | +| 9 | **100000000** | 0x100 | 256 | |
| 75 | + |
| 76 | +In practice, one never uses zero because in Sudoku zero represents an empty element, an element not yet |
| 77 | +filled with an estimated value by the solver. All estimated values are then between 1 and 9. |
| 78 | + |
| 79 | +It's easy to convert a value `n`, where: |
| 80 | + |
| 81 | +``` Zig |
| 82 | + n ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9} |
| 83 | +``` |
| 84 | +If `code` is the binary representation of `n`, one can calculate `code` in this way: |
| 85 | + |
| 86 | +``` Zig |
| 87 | + code = 1 << (n-1) |
| 88 | +``` |
| 89 | + |
| 90 | +But it's not simple to obtain `n` from `code`, unless using popcount assembly instruction. |
| 91 | + |
| 92 | +Since popcount instruction counts the numbers of ones in an integer binary value, one can calculate `n` in |
| 93 | +this way in Zig: |
| 94 | + |
| 95 | +``` Zig |
| 96 | + n = @popCount(code - 1) + 1 |
| 97 | +``` |
| 98 | + |
| 99 | +Substituting this code in the Zig version of the Sudoku solver produced a noticeable optimization. The popCount built-in actually generates a single Assembler instruction as shown here: |
| 100 | + |
| 101 | +<p align="center"> |
| 102 | + <img src="https://github.com/user-attachments/assets/ba6d2502-1c3b-4276-83cd-6f06a3476bcf" width="400"> |
| 103 | +</p> |
| 104 | + |
| 105 | +### Actually calculating a division by 3 instead of using tables. |
| 106 | + |
| 107 | +This was one of the most surprising optimizations of them all. In Sudoku one needs to calculate |
| 108 | +in which 3x3 subgrid (that I called a "cell," but in Sudoku, cells generally refer to any of its |
| 109 | +81 grid elements) an element belongs to check if an estimated value for this element is already |
| 110 | +used somewhere in its subgrid. |
| 111 | + |
| 112 | +This is normally done by first calculating the following two integer truncating divisions: |
| 113 | + |
| 114 | +``` Zig |
| 115 | + @divTrunc(i, 3) |
| 116 | + @divTrunc(j, 3) |
| 117 | +``` |
| 118 | + |
| 119 | +Initially, I was doing this using a table, since I estimated that divisions would be too slow. |
| 120 | + |
| 121 | +But I decided to try doing the division explicitly as shown above, and I was quite surprised to |
| 122 | +see that a significant speed up was obtained. That puzzled me and it triggered me to investigate |
| 123 | +what was going on under the hood. |
| 124 | + |
| 125 | +What I found was that the Assembler code produced was actually only doing an integer multiplication |
| 126 | +followed by a shift operators as shown below. |
| 127 | + |
| 128 | +<p align="center"> |
| 129 | + <img src="https://github.com/user-attachments/assets/2396d038-f5ff-4f23-a8f5-abe180350a62" width="400"> |
| 130 | +</p> |
| 131 | + |
| 132 | +I kind of understood that it was multiplying the value by a fixed point notation for ⅓, but to me |
| 133 | +that could never result into an exact integer number corresponding to the quotient. Well, it turns |
| 134 | +out it can. |
| 135 | + |
| 136 | +The math behind is called Modular Arithmetic. I didn't dive in depth, but the demonstration in |
| 137 | +[this site](https://www.pagetable.com/?p=23) is pretty clear, although I just browsed through. It's |
| 138 | +indeed basically a fixed point notation in binary (0xAAAB in the code corresponds to |
| 139 | +~0.3333 but shifted left in binary), but the arithmetic, however, is not approximate as one would |
| 140 | +normally assume. It's demonstrable exact. |
| 141 | + |
| 142 | +### Use of prefetch |
| 143 | + |
| 144 | +Prefetch is a very interesting resource for increasing memory cache coherency. One can't use in many |
| 145 | +places in the same context. In this code I used it before entering the loop and at the end of the loop |
| 146 | +to keep `grid[index]` in the cache memory. I just tewweked some values and it actually produced faster |
| 147 | +executions. |
0 commit comments