Textual approach can work #7

david-vanderson · 2023-09-29T01:45:47Z

This implements a textual approach to making universal headers. We diff the different versions and add #if/#endif blocks. You end up with stuff like:

extern ssize_t readlinkat (int __fd, const char *__restrict __path,
      char *__restrict __buf, size_t __len)
#if defined x86_64-linux-gnu.2.32 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only__, 3, 4));
#endif //x86_64-linux-gnu.2.32|
#if defined x86_64-linux-gnu.2.33 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_only__, 3, 4));
#endif //x86_64-linux-gnu.2.33|

@andrewrk talked about universal headers being needed to solve some problem, but I don't understand what you are looking for besides size reduction. Does this strategy help you?

For size reduction, combining these headers together saves about 90%

dvanderson@librem14:~/code/universal-headers$ du -shc headers/i*
1.1M	headers/i386-linux-musl
3.1M	headers/i486-linux-gnu.2.32
3.2M	headers/i486-linux-gnu.2.33
3.2M	headers/i486-linux-gnu.2.34
3.3M	headers/i486-linux-gnu.2.35
3.3M	headers/i486-linux-gnu.2.36
3.1M	headers/i586-linux-gnu.2.32
3.2M	headers/i586-linux-gnu.2.33
3.2M	headers/i586-linux-gnu.2.34
3.3M	headers/i586-linux-gnu.2.35
3.3M	headers/i586-linux-gnu.2.36
3.1M	headers/i686-linux-gnu.2.32
3.2M	headers/i686-linux-gnu.2.33
3.2M	headers/i686-linux-gnu.2.34
3.3M	headers/i686-linux-gnu.2.35
3.3M	headers/i686-linux-gnu.2.36
3.1M	headers/ia64-linux-gnu.2.32
3.1M	headers/ia64-linux-gnu.2.33
3.2M	headers/ia64-linux-gnu.2.34
3.2M	headers/ia64-linux-gnu.2.35
3.3M	headers/ia64-linux-gnu.2.36
65M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
6.3M	uh_headers/

Combining less similar headers gives less savings (2/3 for these)

dvanderson@librem14:~/code/universal-headers$ du -shc headers/x86*
27M	headers/x86_64-freebsd.12.3-gnu
28M	headers/x86_64-freebsd.13.1-gnu
3.1M	headers/x86_64-linux-gnu.2.32
3.2M	headers/x86_64-linux-gnu.2.33
3.2M	headers/x86_64-linux-gnu.2.34
3.3M	headers/x86_64-linux-gnu.2.35
3.3M	headers/x86_64-linux-gnu.2.36
3.1M	headers/x86_64-linux-gnu-x32.2.32
3.2M	headers/x86_64-linux-gnu-x32.2.33
3.2M	headers/x86_64-linux-gnu-x32.2.34
3.3M	headers/x86_64-linux-gnu-x32.2.35
3.3M	headers/x86_64-linux-gnu-x32.2.36
1.1M	headers/x86_64-linux-musl
6.4M	headers/x86_64-macos.11-none
6.4M	headers/x86_64-macos.12-none
6.5M	headers/x86_64-macos.13-none
27M	headers/x86-freebsd.12.3-gnu
28M	headers/x86-freebsd.13.1-gnu
161M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
54M	uh_headers/

Look at src/textdiff/example.bash for how I'm using/testing it.

One of the nice things about this strategy is testability - we can pull a specific version from the universal headers and we should get back the same files as that original version (modulo whitespace).

The rest of this description is some details about how it works and what problems I ran into.

Strategy is to maintain 2 work-in-progress files for each header. One has all the lines from all the versions. The other keeps the version list for each line in the first file. Here's how it looks in vimdiff

  82669376 extern ssize_t readlinkat (int __fd, const char *__restrict __pat│  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376       char *__restrict __buf, size_t __len)                      │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only│  x86_64-linux-gnu.2.32|                                               
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_onl│  x86_64-linux-gnu.2.33|                                               
  82669376 #endif                                                           │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|

To add a new version to the work-in-progress we

normalize whitespace (convert tabs to spaces)
add context (prepend each line with a hash - see below for why)
diff it against the current work-in-progress (literally call diff -wdN)
update both work-in-progress files

Once all the versions have been folded in, outputHeaders goes through the work-in-progress files and adds #if/#endif blocks where the versions change.

Why the context hash stuff?

Since we are adding our own #if/#endif blocks, we need to ensure ours don't interleave with existing ones. It's okay if everything is nested, but if version A had #endif and version B had #endif /* new comment */ then we can't combine in the naive way:

#if defined A
#endif
#endif /* version A */
#if defined B
#endif /* new comment */
#endif /* version B */

The solution here is to preprocess the file and prepend each line with a hash to keep #if/#endif blocks from separate versions from becoming entangled. This can also happen with comments, because you can't put functioning #if/#endif in the middle of /* */ comments.

Some random things I did not expect:

there are lines inside /* */ comments that start with #ifndef because the comment is giving an example (so the code needs to track if it is inside a comment or not)
there can be whitespace between # and if

Testing

Right now each #if that we add gets an extra _ZIG_UH_TEST added to the end. This allows testHeaders to only evaluate those #if/#endif blocks and keep all the rest.

I've run every header dir successfully in at least one batch.

Stuff that could be improved

context hash probably should grab continuation lines (sometimes the #if condition is split onto multiple lines)
instead of maintaining comments, strip them?
do something intelligent with sym links rather than copying them?

andrewrk · 2023-10-12T00:08:46Z

Thanks for exploring this! These are some promising results indeed.

To answer your question, yes, size reduction is the main goal here. I want to ship potentially even a lot more sets of headers than these along with the Zig compiler, and want to keep the installation size down to a reasonable amount. At the same time, I don't want to ship files that are compressed with a generic file compression strategy because I want to keep these properties for the Zig installation:

One can edit any source file from lib/ and the changes are reflected immediately.
C compilation errors that point to header file source locations point to a location inside the zig installation rather than inside a cache directory.

Furthermore, I think that providing a set of "universal headers" as a standalone deliverable could be something interesting to offer in and of itself. Other projects might like to benefit from it, and perhaps contribute to this project in return.

This patch is making me think, maybe we could go with this textual approach you have explored, but with an additional preprocessing step to try to normalize them even more? For example, you're already normalizing the whitespace, but what if we tried to normalize the order of top-level declarations, too?

To flesh out the idea a little more, start with some more straightforward normalizations:

remove all parameter names
remove all comments
remove empty lines
remove whitespace between # and if (and similar whitespace)
flatten out all multi-line #define statements into single-line
normalize #if defined(A) and #if !defined(A) to #ifdef A and #ifndef A respectively (since the latter are fewer bytes)
remove extern keyword from function declarations

Finally, the complicated part. Mark every declaration as having a set of input names and output names. For example, #define a b+c has the input set of {b, c} and output set of {a}. A #include would have a set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively.

Our goal is to perform a sort of the declarations for each file, such that the sorted result has minimal textual differences across header files.

For a given header file, for each named declaration, we will create two sort keys: primary, and then the name itself.

For the primary, create a bitset with a bit for every header file of that name (e.g. a bit for every stdio.h file in the headers), populate that bit based on whether or not the name exists. This bitset, interpreted as an unsigned integer, is now used as the primary sort key.

#ifdef and similar constructs are treated as having the set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively. Then the same sorting is recursively applied inside of them.

The idea here is that after this aggressive normalization is performed, textual approach could work even better, especially when combining less similar headers.

What do you think about this idea?

david-vanderson · 2023-10-12T13:32:40Z

Thanks for the explanation! I think I understand what you are going for and agree with those installation properties.

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header. I think there's also some utility to having the universal header be somewhat readable by us in the future debugging things, but that worry is more vague.

My guess is that a few of those normalizations (like removing comments and empty lines) will be both easy to verify and give us most of the savings.

Let me implement the easy ones, think more about the harder ones, and come back with more info.

andrewrk · 2023-10-12T16:58:21Z

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header.

On that note, @marler8997 has been working on a verification tool.

david-vanderson · 2023-10-20T15:32:32Z

Progress update. Added stripping comments and newlines. Can now compile all headers together:

$ du -sh headers/
1.8G	headers/
$ du -sh uh_headers/
149M	uh_headers/
$ grep -rh ZIG_UH uh_headers/* | wc -c
52635229

We are down from 1.8G to 150M. And of that, about 1/3 is just the overhead of the huge extra #if lines. They look like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 OR ...

I'm thinking of doing an extra processing step at the end to collapse some of those. So it could detect that the #if has all of the aarch64_be* and replace it with a single aarch64_be_all. The downside here is that mapping would need to be known to zig so that when you ask for aarch64_be-linux-gnu.2.32 it automatically includes aarch64_be_all.

The code could generate an extra header file that looks like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 ...
#define aarch64_be_all
#endif

And then add that header to the top of every file I guess. Would that make sense as a way to organize these?

I'll go down that route unless you have a better plan.

Just eyeballing the headers, I don't think doing more sophisticated minimization will buy much savings, and they all require a much deeper understanding of C/C++ syntax. Let's see if this is good enough.

andrewrk · 2023-10-20T19:14:18Z

Here's a fun point of comparison: Zig's current installation size for libc headers is 91M. So you have already achieved increasing the number from 56 to 451 different libc header sets, with only 1.6x the size. I think this is within shippable range, however, I still think we can do even better.

As for reducing large sets of #if together: It's ok to introduce more requirements on what must be defined by the compiler in order to use these headers. For example, for aarch64, I think it should only check the __aarch64__ preprocessor definition, and rely on the compiler to properly set that.

Maybe we can have a look at some common, large sets of if conditions and see if there are any similar simplifications that can be made?

kristoff-it · 2023-10-20T19:35:58Z

This is awesome.

The clause reduction problem is text-book SAT, but you can probably get 90% close to the best theoretical solution just by applying some good old-fashioned human knowledge.

I do think that reordering definitions (what Andrew suggested a few comments ago) might also prove to be a worthwhile reduction (as it acts like a full removal of a #if clause), but I personally am not fully familiar with how much more complicated it would make things.

david-vanderson · 2023-10-23T02:25:09Z

Down to 103M.

Take a look at src/textdiff/reductions.zig - each entry lists all the headers that go together and the last line is what we replace them with. Some headers are in multiple reductions. With this set, there aren't that many remaining huge #if statments (judging from grep -rn ZIG_UH uh_headers/).

The replacements themselves are completely UNTESTED, and I made up some of them. Just want to see if this is the right path before digging into exactly what they should be. There will be a large cleanup pass if this works.

What do you think?

andrewrk · 2023-10-24T18:11:06Z

Wow, incredible! That's only a 13% increase from status quo!

I'll make some time this week to tinker with this.

david-vanderson · 2023-10-24T19:35:01Z

Thanks! I've added src/textdiff/HOWTO.txt to help get you going.

motiejus · 2023-10-25T05:50:40Z

This is mindblowing/amazing, depending on how one sees things. I am watching this closely now. Good job, @david-vanderson !

Somewhat unrelated question before this gets rolled out: how far back do we want to go with regards to glibc? RHEL7 ships with 2.17, the oldest "supported" distro I am aware of; it's EOL will be on 2024-06-30. Next one is RHEL 8, which ships glibc 2.28.

motiejus · 2023-10-25T06:00:54Z

❤️ from the past: ziglang/zig#15573

#if (defined __FreeBSD__ AND defined __GLIBC__) OR __GLIBC_MINOR__ == 32 OR __GLIBC_MINOR__ == 33 OR _ZIG_UH_
#define dn_expand  __dn_expand
#define dn_skipname  __dn_skipname
#endif /*_ZIG_UH_*/

david-vanderson added 8 commits October 19, 2023 19:43

textual combination strategy

5b5d00c

working on x86*

b1faa27

make example output to uh_headers

11cfb49

add some comments

e3c556b

only use /* */ comments in output

a0e83af

tweak bash script to accept header dirs on command line

e5eccc6

filter c-style comments and empty lines, bug fix

b80deb4

filter // comments

1d5a1fb

david-vanderson force-pushed the master branch from 251d33d to 1d5a1fb Compare October 20, 2023 15:02

david-vanderson added 2 commits October 22, 2023 22:02

add reductions.zig that reduces huge #ifs

e10250a

update testHeaders for reductions

a178a9e

add HOWTO

1604b4e

andrewrk mentioned this pull request Nov 24, 2023

Fix libc_nonshared.a ziglang/zig#17702

Merged

ghost mentioned this pull request Feb 1, 2024

zig tracking issue uber/hermetic_cc_toolchain#153

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Textual approach can work #7

Textual approach can work #7

david-vanderson commented Sep 29, 2023

andrewrk commented Oct 12, 2023 •

edited

Loading

david-vanderson commented Oct 12, 2023

andrewrk commented Oct 12, 2023

david-vanderson commented Oct 20, 2023

andrewrk commented Oct 20, 2023

kristoff-it commented Oct 20, 2023

david-vanderson commented Oct 23, 2023

andrewrk commented Oct 24, 2023

david-vanderson commented Oct 24, 2023

motiejus commented Oct 25, 2023

motiejus commented Oct 25, 2023

Textual approach can work #7

Are you sure you want to change the base?

Textual approach can work #7

Conversation

david-vanderson commented Sep 29, 2023

The rest of this description is some details about how it works and what problems I ran into.

Why the context hash stuff?

Testing

Stuff that could be improved

andrewrk commented Oct 12, 2023 • edited Loading

david-vanderson commented Oct 12, 2023

andrewrk commented Oct 12, 2023

david-vanderson commented Oct 20, 2023

andrewrk commented Oct 20, 2023

kristoff-it commented Oct 20, 2023

david-vanderson commented Oct 23, 2023

andrewrk commented Oct 24, 2023

david-vanderson commented Oct 24, 2023

motiejus commented Oct 25, 2023

motiejus commented Oct 25, 2023

andrewrk commented Oct 12, 2023 •

edited

Loading