Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Textual approach can work #7

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

david-vanderson
Copy link

This implements a textual approach to making universal headers. We diff the different versions and add #if/#endif blocks. You end up with stuff like:

extern ssize_t readlinkat (int __fd, const char *__restrict __path,
      char *__restrict __buf, size_t __len)
#if defined x86_64-linux-gnu.2.32 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only__, 3, 4));
#endif //x86_64-linux-gnu.2.32|
#if defined x86_64-linux-gnu.2.33 OR _ZIG_UH_TEST
     __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_only__, 3, 4));
#endif //x86_64-linux-gnu.2.33|

@andrewrk talked about universal headers being needed to solve some problem, but I don't understand what you are looking for besides size reduction. Does this strategy help you?

For size reduction, combining these headers together saves about 90%

dvanderson@librem14:~/code/universal-headers$ du -shc headers/i*
1.1M	headers/i386-linux-musl
3.1M	headers/i486-linux-gnu.2.32
3.2M	headers/i486-linux-gnu.2.33
3.2M	headers/i486-linux-gnu.2.34
3.3M	headers/i486-linux-gnu.2.35
3.3M	headers/i486-linux-gnu.2.36
3.1M	headers/i586-linux-gnu.2.32
3.2M	headers/i586-linux-gnu.2.33
3.2M	headers/i586-linux-gnu.2.34
3.3M	headers/i586-linux-gnu.2.35
3.3M	headers/i586-linux-gnu.2.36
3.1M	headers/i686-linux-gnu.2.32
3.2M	headers/i686-linux-gnu.2.33
3.2M	headers/i686-linux-gnu.2.34
3.3M	headers/i686-linux-gnu.2.35
3.3M	headers/i686-linux-gnu.2.36
3.1M	headers/ia64-linux-gnu.2.32
3.1M	headers/ia64-linux-gnu.2.33
3.2M	headers/ia64-linux-gnu.2.34
3.2M	headers/ia64-linux-gnu.2.35
3.3M	headers/ia64-linux-gnu.2.36
65M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
6.3M	uh_headers/

Combining less similar headers gives less savings (2/3 for these)

dvanderson@librem14:~/code/universal-headers$ du -shc headers/x86*
27M	headers/x86_64-freebsd.12.3-gnu
28M	headers/x86_64-freebsd.13.1-gnu
3.1M	headers/x86_64-linux-gnu.2.32
3.2M	headers/x86_64-linux-gnu.2.33
3.2M	headers/x86_64-linux-gnu.2.34
3.3M	headers/x86_64-linux-gnu.2.35
3.3M	headers/x86_64-linux-gnu.2.36
3.1M	headers/x86_64-linux-gnu-x32.2.32
3.2M	headers/x86_64-linux-gnu-x32.2.33
3.2M	headers/x86_64-linux-gnu-x32.2.34
3.3M	headers/x86_64-linux-gnu-x32.2.35
3.3M	headers/x86_64-linux-gnu-x32.2.36
1.1M	headers/x86_64-linux-musl
6.4M	headers/x86_64-macos.11-none
6.4M	headers/x86_64-macos.12-none
6.5M	headers/x86_64-macos.13-none
27M	headers/x86-freebsd.12.3-gnu
28M	headers/x86-freebsd.13.1-gnu
161M	total
dvanderson@librem14:~/code/universal-headers$ du -sh uh_headers/
54M	uh_headers/

Look at src/textdiff/example.bash for how I'm using/testing it.

One of the nice things about this strategy is testability - we can pull a specific version from the universal headers and we should get back the same files as that original version (modulo whitespace).

The rest of this description is some details about how it works and what problems I ran into.

Strategy is to maintain 2 work-in-progress files for each header. One has all the lines from all the versions. The other keeps the version list for each line in the first file. Here's how it looks in vimdiff

  82669376 extern ssize_t readlinkat (int __fd, const char *__restrict __pat│  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376       char *__restrict __buf, size_t __len)                      │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33|                         
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__read_only│  x86_64-linux-gnu.2.32|                                               
  82669376      __THROW __nonnull ((2, 3)) __wur __attr_access ((__write_onl│  x86_64-linux-gnu.2.33|                                               
  82669376 #endif                                                           │  x86_64-linux-gnu.2.32|x86_64-linux-gnu.2.33| 

To add a new version to the work-in-progress we

  • normalize whitespace (convert tabs to spaces)
  • add context (prepend each line with a hash - see below for why)
  • diff it against the current work-in-progress (literally call diff -wdN)
  • update both work-in-progress files

Once all the versions have been folded in, outputHeaders goes through the work-in-progress files and adds #if/#endif blocks where the versions change.

Why the context hash stuff?

Since we are adding our own #if/#endif blocks, we need to ensure ours don't interleave with existing ones. It's okay if everything is nested, but if version A had #endif and version B had #endif /* new comment */ then we can't combine in the naive way:

#if defined A
#endif
#endif /* version A */
#if defined B
#endif /* new comment */
#endif /* version B */

The solution here is to preprocess the file and prepend each line with a hash to keep #if/#endif blocks from separate versions from becoming entangled. This can also happen with comments, because you can't put functioning #if/#endif in the middle of /* */ comments.

Some random things I did not expect:

  • there are lines inside /* */ comments that start with #ifndef because the comment is giving an example (so the code needs to track if it is inside a comment or not)
  • there can be whitespace between # and if

Testing

Right now each #if that we add gets an extra _ZIG_UH_TEST added to the end. This allows testHeaders to only evaluate those #if/#endif blocks and keep all the rest.

I've run every header dir successfully in at least one batch.

Stuff that could be improved

  • context hash probably should grab continuation lines (sometimes the #if condition is split onto multiple lines)
  • instead of maintaining comments, strip them?
  • do something intelligent with sym links rather than copying them?

@andrewrk
Copy link
Member

andrewrk commented Oct 12, 2023

Thanks for exploring this! These are some promising results indeed.

To answer your question, yes, size reduction is the main goal here. I want to ship potentially even a lot more sets of headers than these along with the Zig compiler, and want to keep the installation size down to a reasonable amount. At the same time, I don't want to ship files that are compressed with a generic file compression strategy because I want to keep these properties for the Zig installation:

  • One can edit any source file from lib/ and the changes are reflected immediately.
  • C compilation errors that point to header file source locations point to a location inside the zig installation rather than inside a cache directory.

Furthermore, I think that providing a set of "universal headers" as a standalone deliverable could be something interesting to offer in and of itself. Other projects might like to benefit from it, and perhaps contribute to this project in return.

This patch is making me think, maybe we could go with this textual approach you have explored, but with an additional preprocessing step to try to normalize them even more? For example, you're already normalizing the whitespace, but what if we tried to normalize the order of top-level declarations, too?

To flesh out the idea a little more, start with some more straightforward normalizations:

  • remove all parameter names
  • remove all comments
  • remove empty lines
  • remove whitespace between # and if (and similar whitespace)
  • flatten out all multi-line #define statements into single-line
  • normalize #if defined(A) and #if !defined(A) to #ifdef A and #ifndef A respectively (since the latter are fewer bytes)
  • remove extern keyword from function declarations

Finally, the complicated part. Mark every declaration as having a set of input names and output names. For example, #define a b+c has the input set of {b, c} and output set of {a}. A #include would have a set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively.

Our goal is to perform a sort of the declarations for each file, such that the sorted result has minimal textual differences across header files.

For a given header file, for each named declaration, we will create two sort keys: primary, and then the name itself.

For the primary, create a bitset with a bit for every header file of that name (e.g. a bit for every stdio.h file in the headers), populate that bit based on whether or not the name exists. This bitset, interpreted as an unsigned integer, is now used as the primary sort key.

#ifdef and similar constructs are treated as having the set of inputs for all inputs inside it, recursively, and a set of outputs for all outputs inside it, recursively. Then the same sorting is recursively applied inside of them.

The idea here is that after this aggressive normalization is performed, textual approach could work even better, especially when combining less similar headers.

What do you think about this idea?

@david-vanderson
Copy link
Author

Thanks for the explanation! I think I understand what you are going for and agree with those installation properties.

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header. I think there's also some utility to having the universal header be somewhat readable by us in the future debugging things, but that worry is more vague.

My guess is that a few of those normalizations (like removing comments and empty lines) will be both easy to verify and give us most of the savings.

Let me implement the easy ones, think more about the harder ones, and come back with more info.

@andrewrk
Copy link
Member

The main worry I have is ensuring correctness through the transformation - to maintain the ability to partially evaluate a universal header for a given version and verify it's the "same" as the original header.

On that note, @marler8997 has been working on a verification tool.

@david-vanderson
Copy link
Author

Progress update. Added stripping comments and newlines. Can now compile all headers together:

$ du -sh headers/
1.8G	headers/
$ du -sh uh_headers/
149M	uh_headers/
$ grep -rh ZIG_UH uh_headers/* | wc -c
52635229

We are down from 1.8G to 150M. And of that, about 1/3 is just the overhead of the huge extra #if lines. They look like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 OR ...

I'm thinking of doing an extra processing step at the end to collapse some of those. So it could detect that the #if has all of the aarch64_be* and replace it with a single aarch64_be_all. The downside here is that mapping would need to be known to zig so that when you ask for aarch64_be-linux-gnu.2.32 it automatically includes aarch64_be_all.

The code could generate an extra header file that looks like this:

#if defined aarch64_be-linux-gnu.2.32 OR defined aarch64_be-linux-gnu.2.33 OR defined aarch64_be-linux-gnu.2.34 ...
#define aarch64_be_all
#endif

And then add that header to the top of every file I guess. Would that make sense as a way to organize these?

I'll go down that route unless you have a better plan.

Just eyeballing the headers, I don't think doing more sophisticated minimization will buy much savings, and they all require a much deeper understanding of C/C++ syntax. Let's see if this is good enough.

@andrewrk
Copy link
Member

Here's a fun point of comparison: Zig's current installation size for libc headers is 91M. So you have already achieved increasing the number from 56 to 451 different libc header sets, with only 1.6x the size. I think this is within shippable range, however, I still think we can do even better.

As for reducing large sets of #if together: It's ok to introduce more requirements on what must be defined by the compiler in order to use these headers. For example, for aarch64, I think it should only check the __aarch64__ preprocessor definition, and rely on the compiler to properly set that.

Maybe we can have a look at some common, large sets of if conditions and see if there are any similar simplifications that can be made?

@kristoff-it
Copy link
Member

This is awesome.

The clause reduction problem is text-book SAT, but you can probably get 90% close to the best theoretical solution just by applying some good old-fashioned human knowledge.

I do think that reordering definitions (what Andrew suggested a few comments ago) might also prove to be a worthwhile reduction (as it acts like a full removal of a #if clause), but I personally am not fully familiar with how much more complicated it would make things.

@david-vanderson
Copy link
Author

Down to 103M.

Take a look at src/textdiff/reductions.zig - each entry lists all the headers that go together and the last line is what we replace them with. Some headers are in multiple reductions. With this set, there aren't that many remaining huge #if statments (judging from grep -rn ZIG_UH uh_headers/).

The replacements themselves are completely UNTESTED, and I made up some of them. Just want to see if this is the right path before digging into exactly what they should be. There will be a large cleanup pass if this works.

What do you think?

@andrewrk
Copy link
Member

Wow, incredible! That's only a 13% increase from status quo!

I'll make some time this week to tinker with this.

@david-vanderson
Copy link
Author

Thanks! I've added src/textdiff/HOWTO.txt to help get you going.

@motiejus
Copy link

This is mindblowing/amazing, depending on how one sees things. I am watching this closely now. Good job, @david-vanderson !

Somewhat unrelated question before this gets rolled out: how far back do we want to go with regards to glibc? RHEL7 ships with 2.17, the oldest "supported" distro I am aware of; it's EOL will be on 2024-06-30. Next one is RHEL 8, which ships glibc 2.28.

@motiejus
Copy link

❤️ from the past: ziglang/zig#15573

#if (defined __FreeBSD__ AND defined __GLIBC__) OR __GLIBC_MINOR__ == 32 OR __GLIBC_MINOR__ == 33 OR _ZIG_UH_
#define dn_expand  __dn_expand
#define dn_skipname  __dn_skipname
#endif /*_ZIG_UH_*/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants