Skip to content

Commit 9094e97

Browse files
authored
Add files via upload
1 parent 05f6eb7 commit 9094e97

File tree

1 file changed

+147
-0
lines changed

1 file changed

+147
-0
lines changed

Zig/README.md

+147
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Sudoku solver - Zig Implementation
2+
3+
## Solving Sudoku Grids on Windows 64
4+
5+
See documentation on how to use this program [here](https://github.com/nilostolte/Sudoku/tree/main/documentation).
6+
7+
## Optimizations done in this version
8+
9+
I have done several different optimizations in Zig. Some are either not possible, either not portable
10+
in C. Many of them may be available on gcc but not in other compilers. On the other hand, every
11+
optimization made here in the Zig version is portable, although some may not generate the
12+
most optimal performance in platforms different than X64.
13+
14+
### Using a linear grid
15+
16+
In Zig the grid matrix is given linearly by an array containing the 81 elements of the grid stored
17+
line by line contiguously:
18+
19+
``` Zig
20+
var grid = [_]u8{0} ** 81; // Sudoku grid stored linearly here
21+
```
22+
23+
This configuration increases cache coherency and avoids indirections to access the elements via pointers
24+
as it's usually done in matrices and as it was also done in previous versions of this Zig code. The
25+
linear storage doesn't come for free, since it implies additional operations in the `solve` function to be
26+
able to cope with this configuration.
27+
28+
The most notable is to maintain not only the line and column of an element (`j` and `i` variables), but
29+
also its index (`index` variable) in the linear grid.
30+
31+
Additional operations are needed to recover `index` whenever backtracking, by calculating it
32+
with the previous line and column values popped from the stack. Here one needs to multiply `i` (the
33+
current line) by 9, to jump over the previous lines, and add `j`, the current column:
34+
35+
``` Zig
36+
index = @shlExact(i,3) + i + j;
37+
```
38+
39+
Since backtracking occurs less often than other parts of the loop, these extra operations
40+
don't impact the performance in a noticeable way.
41+
42+
The most frequent operation impacting the linear grid configuration is an extra addition to
43+
increment `index`, besides the usual `j` incrementation at the end of the loop just before
44+
testing a line change and end of the loop:
45+
46+
``` Zig
47+
index += 1; // advance to the next position in grid
48+
j += 1; // advance to the next column
49+
```
50+
51+
Fortunately, the time spent in the extra operations didn't overlap the time gained with the
52+
linear grid storage. Less indirections and more coherency when accessing the elements one after
53+
the other in sequence as done here highly justified the cost of extra operations. It's clear that
54+
the less one needs to use values stored in memory the better the solver performs. Focusing on
55+
that unveiled quite a few surprises after calculating values dynamically instead of accessing the
56+
calculated values in memory.
57+
58+
### Calculating the grid element value from bit representation using @popcount
59+
60+
Each grid element value (0 to 9) is represented in binary as shown in the table below to speed up
61+
occupation sets checking.
62+
63+
| Element Value | Binary Representation | Hexadecimal | Decimal |
64+
| :-----------: | :-------------------: | :---------: | :-----: |
65+
| 0 | **000000000** | 0x000 | 0 |
66+
| 1 | **000000001** | 0x001 | 1 |
67+
| 2 | **000000010** | 0x002 | 2 |
68+
| 3 | **000000100** | 0x004 | 4 |
69+
| 4 | **000001000** | 0x008 | 8 |
70+
| 5 | **000010000** | 0x010 | 16 |
71+
| 6 | **000100000** | 0x020 | 32 |
72+
| 7 | **001000000** | 0x040 | 64 |
73+
| 8 | **010000000** | 0x080 | 128 |
74+
| 9 | **100000000** | 0x100 | 256 |
75+
76+
In practice, one never uses zero because in Sudoku zero represents an empty element, an element not yet
77+
filled with an estimated value by the solver. All estimated values are then between 1 and 9.
78+
79+
It's easy to convert a value `n`, where:
80+
81+
``` Zig
82+
n ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9}
83+
```
84+
If `code` is the binary representation of `n`, one can calculate `code` in this way:
85+
86+
``` Zig
87+
code = 1 << (n-1)
88+
```
89+
90+
But it's not simple to obtain `n` from `code`, unless using popcount assembly instruction.
91+
92+
Since popcount instruction counts the numbers of ones in an integer binary value, one can calculate `n` in
93+
this way in Zig:
94+
95+
``` Zig
96+
n = @​popCount(code - 1) + 1
97+
```
98+
99+
Substituting this code in the Zig version of the Sudoku solver produced a noticeable optimization. The ​popCount built-in actually generates a single Assembler instruction as shown here:
100+
101+
<p align="center">
102+
<img src="https://github.com/user-attachments/assets/ba6d2502-1c3b-4276-83cd-6f06a3476bcf" width="400">
103+
</p>
104+
105+
### Actually calculating a division by 3 instead of using tables.
106+
107+
This was one of the most surprising optimizations of them all. In Sudoku one needs to calculate
108+
in which 3x3 subgrid (that I called a "cell," but in Sudoku, cells generally refer to any of its
109+
81 grid elements) an element belongs to check if an estimated value for this element is already
110+
used somewhere in its subgrid.
111+
112+
This is normally done by first calculating the following two integer truncating divisions:
113+
114+
``` Zig
115+
@divTrunc(i, 3)
116+
@divTrunc(j, 3)
117+
```
118+
119+
Initially, I was doing this using a table, since I estimated that divisions would be too slow.
120+
121+
But I decided to try doing the division explicitly as shown above, and I was quite surprised to
122+
see that a significant speed up was obtained. That puzzled me and it triggered me to investigate
123+
what was going on under the hood.
124+
125+
What I found was that the Assembler code produced was actually only doing an integer multiplication
126+
followed by a shift operators as shown below.
127+
128+
<p align="center">
129+
<img src="https://github.com/user-attachments/assets/2396d038-f5ff-4f23-a8f5-abe180350a62" width="400">
130+
</p>
131+
132+
I kind of understood that it was multiplying the value by a fixed point notation for ⅓, but to me
133+
that could never result into an exact integer number corresponding to the quotient. Well, it turns
134+
out it can.
135+
136+
The math behind is called Modular Arithmetic. I didn't dive in depth, but the demonstration in
137+
[this site](https://www.pagetable.com/?p=23) is pretty clear, although I just browsed through. It's
138+
indeed basically a fixed point notation in binary (0xAAAB in the code corresponds to
139+
~0.3333 but shifted left in binary), but the arithmetic, however, is not approximate as one would
140+
normally assume. It's demonstrable exact.
141+
142+
### Use of prefetch
143+
144+
Prefetch is a very interesting resource for increasing memory cache coherency. One can't use in many
145+
places in the same context. In this code I used it before entering the loop and at the end of the loop
146+
to keep `grid[index]` in the cache memory. I just tewweked some values and it actually produced faster
147+
executions.

0 commit comments

Comments
 (0)